Welcome, Guest
Username: Password: Remember me
Forum header

TOPIC: Problem in defining de-composition ECMWF

Problem in defining de-composition ECMWF 8 years 8 months ago #162

We are now testing possible Hirlam 7.4 beta at ECMWF. The area definition is as follows

RCR_7.4)
NLON=1030 # areasize
NLAT=728
NLEV=65
SOUTH=-24.00 # area boundaries
WEST=-33.50
NORTH=25.436
EAST=36.472
POLAT=-30.0 # coordinates of South pole
POLON=.0
NPBPTS=2 # Number of passive boundary points
NBNDRY=16 # width of boundary zone
GPHALO=8 # Number of halo-zone points in grid-point model
NDTIME=150 # dynamics time step (s)
HDFSET=9000 # old, regular, 9000: climate data set projection
HIRES=0.0125 # resolution of climate data sets (as per available)

Trying to run with the current default submission.db results a stuck in the LSMIX re-run at the end of forecast. After the forecast is ready, it prints the ddr and then stops for ever. You can find an example of the result in
/scratch/ms/fi/fne/hl_home/RCR_K74/HL_Cycle_2010091706r.html


In this run we used the default submission.db of trunk:

if ( $COMPCENTRE eq "ECMWF" ) {
# default nprocx, nprocy, nproc_hgs, LL_NODE and LL_TASKS_PER_NODE:
if ( $ENSSIZE >= 0 ) {
$ll_node = 1;
$ll_tasks_per_node = 60;
$nprocx = 10;
$nprocy = 6;
}
else {
$ll_node = 4;
$ll_tasks_per_node = 51;
$nprocx = 10;
$nprocy = 20;
}
$nproc_hgs = $ll_node * $ll_tasks_per_node - $nprocx * $nprocy;
if ( $nproc_hgs < 0 ) { print STDERR "Illegal nproc_hgs: $nproc_hgs\n"; exit 1 }


Then we tried with a modified submission.db

if ( $COMPCENTRE eq "ECMWF" ) {
# default nprocx, nprocy, nproc_hgs, LL_NODE and LL_TASKS_PER_NODE:
if ( $ENSSIZE >= 0 ) {
$ll_node = 1;
$ll_tasks_per_node = 60;
$nprocx = 10;
$nprocy = 6;
}
else {
$ll_node = 2;
$ll_tasks_per_node = 32;
$nprocx = 8;
$nprocy = 8;
}
$nproc_hgs = $ll_node * $ll_tasks_per_node - $nprocx * $nprocy;
if ( $nproc_hgs < 0 ) { print STDERR "Illegal nproc_hgs: $nproc_hgs\n"; exit 1 }

With this the forecast and re-run forecast worked, but of course much more slowly (~4-5 sec versus 10 sec per timestep)

Strange enough the stuck was only in the LSMIX re-run not in the normal forecast run.

The experiments are RCR_K74 and RCR_K74_new in
/home/ms/fi/fne/hl_home

We don't understand what is going on here and how to define optimal de-composition and number of nodes/processors.

Kalle

Re:Problem in defining de-composition ECMWF 8 years 8 months ago #164

  • Xiaohua Yang
  • Xiaohua Yang's Avatar
  • OFFLINE
  • Administrator
  • Posts: 195
  • Thank you received: 4
Even tough it may not be very likely, could you please redo the previous test tunning off HGS server, i.e., change the default of 4nodes x 51 mpi tasks to 4 nodesx50 tasks.

There was one thing that we had in the todo list before 7.4: that is to make another check about sanity of HGS server. For 7.3 I recently discovered problem with reproducibility in case of HGS server at IBM platform. This needs to be confirmed in more similar tests and bugs removed. I think similar tests about nhgs =/= 0 shall be repeated on all other platforms.

Re:Problem in defining de-composition ECMWF 8 years 8 months ago #166

So far I have found that "no HGS"-runs work. I have tried two possibilities in submission.db, which result nproc_hgs=0 and there has been no problems in those runs.

But still I wonder, why the stuck comes only in re-run forecasts and in writing +06 forecasts, while writing +03 forecasts succeed.

The difference with my experiment and the trunk (which works with nproc_HGS=4) is that in the re-run in the dfi-phase one extra file is read in.

Kalle
Time to create page: 0.076 seconds