Welcome, Guest
Username: Password: Remember me

Here we can describe more what should be posted here

TOPIC: Harmonie crashes/restarts Kernel panic and related

Harmonie crashes/restarts Kernel panic and related 9 months 2 weeks ago #2368


Some time ago we started testing both cy40h1.2 and some flavor of cy43 (beta2 before the switch to python/ecflow, but not related with the case) with both Intel 18.0.3 and Intel 19.0.2, with the same optimization levels that we were using in our operational setup previously (17.0.4 worked as a charm).

Arome 3DVAR | domain Lithuania | domain DKOEXP | defaults (enough for initial testing)

What followed were some occasional crashes of MASTERODB, mostly in the initial forecast phase without any clear explanation e.g. logs. During the first crash it seemed like a familiar situation and  that a simple restart would mostly do the trick and the computations would run trough. Now sometimes it did, mostly, once , but never went on without any significant changes like a cold start/time step reduction for every failed forecast run as such

Before that, we checked and rejected most possibilities of:

sysctl settings (limits, etc)
failed memory slots/faulty physical  blades (random blade crashes not tied to some specific ones, those were healthy on other applications)
Harmonie related setting (MBX_size and alike)
Spectre/Meltdown (the loaded kernels in the compute blades being way too old for that, and the newer ones showing the same result).

In the beginning we failed to notice that it was actually compute blades that were crashing after catching a kernel  panic, and the controlling software rebooting them immediately afterwards. After checking the compute blade logs it was clear that the combination of newer versions of Intel and probably our MPI have been crashing via/the xpmem linux kernel module.

In any case, KP shouldn't be caused by a user-land process as such, so it probably means that our OS kernel (4.4.59 (which narrows it down to 4.4 as such perhaps))) has some flaw in general. Like in this case Harmonie code acting as an exploit. We should have received a more familiar Harmonie crash log, but in this case we mostly got nothing . Also, we never went for testing any downgraded optimizations for any of Harmonie subroutines - a KP is a signal clear enough that he user-land ... well should should not be affecting the kernel in any combination.

We came up with changes which might be significant to HPE/SGI MPI environment only (never tested this elsewhere myself) as the later had inherited some of the HPE/SGI software ecosystem settings. Disabling the memory allocations via  the xpmem kernel module did the trick with a small wall-time overhead/no crashes.

Yet another test with Intel 19.0.5 is due next week. I'll post the results.

I would still stress that the whole thing above should be still applicable to HPE/SGI software stack as such unless other vendors tend to behave this way especially in the memory reservation/handling.

Last Edit: 9 months 2 weeks ago by Martynas Kazlauskas.

Harmonie crashes/restarts Kernel panic and related 7 months 3 weeks ago #2390


Intel 19.0.5 seems to be somewhat healthy again with default settings, running for some time now.

Never crashed the kernel at all under the same factory settings (while 18.04's and 19 02's did).

Last Edit: 7 months 3 weeks ago by Martynas Kazlauskas.
Time to create page: 0.073 seconds