Welcome, Guest
Username: Password: Remember me
HELPDESK

Here we can describe more what should be posted here

TOPIC: irregular failure on cca

irregular failure on cca 5 months 17 hours ago #2349

Hi,

I get the following failure with forecast. A simple re-run make it always. Any idea on how to get rid of this? I tried to play with buffer size, but didn't help.

03:06:56 STEP 14 H= 0:14 +CPU= 3.032
Rank 256 [Thu Jul 11 03:06:59 2019] [c0-0c1s14n0] Fatal error in MPI_Bsend: Invalid buffer pointer, error stack:
MPI_Bsend(192).......: MPI_Bsend(buf=0x7506a930, count=703840, dtype=0x4c000829, dest=131, tag=52000, MPI_COMM_WORLD) failed
MPIR_Bsend_isend(348): Insufficient space in Bsend buffer; requested 5630720; total buffer size is 628000000
aborting job:
Fatal error in MPI_Bsend: Invalid buffer pointer, error stack:
MPI_Bsend(192).......: MPI_Bsend(buf=0x7506a930, count=703840, dtype=0x4c000829, dest=131, tag=52000, MPI_COMM_WORLD) failed
MPIR_Bsend_isend(348): Insufficient space in Bsend buffer; requested 5630720; total buffer size is 628000000
xpmem_attach error: : No such file or directory
Rank 255 [Thu Jul 11 03:06:59 2019] [c0-0c1s14n0] Fatal error in MPI_Waitany: Other MPI error, error stack:
MPI_Waitany(243)..................: MPI_Waitany(count=5, req_array=0x7c997e70, index=0x7fffff78b7ac, status=0x7fffff78b7f0) failed
MPIDI_CH3I_Progress(568)..........:
pkt_RTS_handler(306)..............:
do_cts(637).......................:
MPID_nem_lmt_xpmem_start_recv(980):
MPID_nem_lmt_send_COOKIE(462).....:
MPID_nem_lmt_send_COOKIE(403).....: xpmem_attach failed on rank 255 (src_rank 256, vaddr 0x2aaad1b78c30, len 1462240)
aborting job:
Fatal error in MPI_Waitany: Other MPI error, error stack:
MPI_Waitany(243)..................: MPI_Waitany(count=5, req_array=0x7c997e70, index=0x7fffff78b7ac, status=0x7fffff78b7f0) failed
MPIDI_CH3I_Progress(568)..........:
pkt_RTS_handler(306)..............:
do_cts(637).......................:
MPID_nem_lmt_xpmem_start_recv(980):
MPID_nem_lmt_send_COOKIE(462).....:
MPID_nem_lmt_send_COOKIE(403).....: xpmem_attach failed on rank 255 (src_rank 256, vaddr 0x2aaad1b78c30, len 1462240)
xpmem_attach error: : No such file or directory
Rank 254 [Thu Jul 11 03:06:59 2019] [c0-0c1s14n0] Fatal error in MPI_Waitany: Other MPI error, error stack:
MPI_Waitany(243)..................: MPI_Waitany(count=5, req_array=0x74c82690, index=0x7fffff78b7ac, status=0x7fffff78b7f0) failed
MPIDI_CH3I_Progress(568)..........:
pkt_RTS_handler(306)..............:
do_cts(637).......................:
MPID_nem_lmt_xpmem_start_recv(980):
MPID_nem_lmt_send_COOKIE(462).....:
MPID_nem_lmt_send_COOKIE(403).....: xpmem_attach failed on rank 254 (src_rank 256, vaddr 0x2aaad1a13bf8, len 1462240)
aborting job:
Fatal error in MPI_Waitany: Other MPI error, error stack:
MPI_Waitany(243)..................: MPI_Waitany(count=5, req_array=0x74c82690, index=0x7fffff78b7ac, status=0x7fffff78b7f0) failed
MPIDI_CH3I_Progress(568)..........:
pkt_RTS_handler(306)..............:
do_cts(637).......................:
MPID_nem_lmt_xpmem_start_recv(980):
MPID_nem_lmt_send_COOKIE(462).....:
MPID_nem_lmt_send_COOKIE(403).....: xpmem_attach failed on rank 254 (src_rank 256, vaddr 0x2aaad1a13bf8, len 1462240)
xpmem_attach error: : No such file or directory
Rank 253 [Thu Jul 11 03:06:59 2019] [c0-0c1s14n0] Fatal error in MPI_Waitany: Other MPI error, error stack:
MPI_Waitany(243)..................: MPI_Waitany(count=5, req_array=0x8b39b560, index=0x7fffff78b7ac, status=0x7fffff78b7f0) failed
MPIDI_CH3I_Progress(568)..........:
pkt_RTS_handler(306)..............:
do_cts(637).......................:
MPID_nem_lmt_xpmem_start_recv(980):
MPID_nem_lmt_send_COOKIE(462).....:
MPID_nem_lmt_send_COOKIE(403).....: xpmem_attach failed on rank 253 (src_rank 256, vaddr 0x2aaad18aebc0, len 1462240)
aborting job:
Fatal error in MPI_Waitany: Other MPI error, error stack:
MPI_Waitany(243)..................: MPI_Waitany(count=5, req_array=0x8b39b560, index=0x7fffff78b7ac, status=0x7fffff78b7f0) failed
MPIDI_CH3I_Progress(568)..........:
pkt_RTS_handler(306)..............:
do_cts(637).......................:
MPID_nem_lmt_xpmem_start_recv(980):
MPID_nem_lmt_send_COOKIE(462).....:
MPID_nem_lmt_send_COOKIE(403).....: xpmem_attach failed on rank 253 (src_rank 256, vaddr 0x2aaad18aebc0, len 1462240)
[NID 00120] 2019-07-11 03:06:59 Apid 348397947: initiated application termination
Application 348397947 exit codes: 255
Application 348397947 exit signals: Killed
Application 348397947 resources: utime ~139s, stime ~600s, Rss ~1004104, inblocks ~2789587, outblocks ~9133024
Dir is /scratch/ms/no/sbt/hm_home/TEST43N/20180301_18/forecast


Thanks in advance.
Roger

irregular failure on cca 4 months 3 weeks ago #2353

  • Colm Clancy
  • Colm Clancy's Avatar
  • OFFLINE
  • Fresh Boarder
  • Posts: 10
  • Thank you received: 2
I got a similar error recently, and a restart also worked.

I hadn't changed from the default nprocx=nprocy=16. For subsequent experiments I increased these and haven't had the problem since. Not sure whether it's related.

Colm
Last Edit: 4 months 3 weeks ago by Colm Clancy.

irregular failure on cca 4 months 2 weeks ago #2354

I also increased the processor number, still the system failed irregularly.

Restart always went through.

Roger

irregular failure on cca 2 months 3 weeks ago #2369

Hi Roger,
others,

Don't know if that helps as your environment seems to be related to ECMWF/or/similar only, but still:

hirlam.org/index.php/forum/2-harmonie-sy...nic-and-related#2368

irregular failure on cca 1 month 3 days ago #2388

  • Eoin Whelan
  • Eoin Whelan's Avatar
  • OFFLINE
  • Gold Boarder
  • Posts: 195
  • Thank you received: 33
Hi Roger et al,

I received the following suggestion from ECMWF:
This may not be the problem but I have in the back of my mind that there's a environment variable that can be set to change the size of the buffer or "mailbox" used for buffered sends. Maybe this needs to be increased for the larger domains ?

According to Sami Saarinen, you can do this by setting MPL_MBX_SIZE, e.g.:

export MPL_MBX_SIZE=${MPL_MBX_SIZE:-128000000} # 128,000,000 bytes


Sami says that this very value needs to also be set in nampar0 namelist-variable MBX_SIZE (Sami sets both in our benchmarks).

However, you need to be careful. The maximum size is 2GiB and you already seem to have a value of 628000000. Sami suggest you try some tests with 1GiB i.e. value 1000000000 aka 1,000,000,000 bytes and to increase only gradually if needed.

I hope that helps (and makes some sense within the Harmonie set up !).

irregular failure on cca 4 weeks 1 day ago #2391

  • Eoin Whelan
  • Eoin Whelan's Avatar
  • OFFLINE
  • Gold Boarder
  • Posts: 195
  • Thank you received: 33
Erik Gregow (FMI) carried out some tests:
It seems to work good now! A 2-days experiment went through without problems (no crashes in Forecast or elsewhere).

This is what I changed, following Eoin's recommendations:
config-sh/config.ecgb-cca export MPL_MBX_SIZE=1000000000 (added this to the beginning of file)
nam/harmonie_namelists.pm 'MBX_SIZE' => '1000000000,', (in "Host specific settings" and namelist variable: NAMPAR0)

I used METCOOP25D domain (interpolation of the METCOOP25B structure functions) with normal Timestep=75

You can see my experiment at EC: /home/ms/fi/fie/hm_home/Exp_ew_c43_ref_25D/
Time to create page: 0.090 seconds