Welcome, Guest
Username: Password: Remember me
HELPDESK

Here we can describe more what should be posted here

TOPIC: exit code: HARMONIE is always successful

exit code: HARMONIE is always successful 1 month 2 weeks ago #2413

  • Bert van Ulft
  • Bert van Ulft's Avatar
  • OFFLINE
  • Expert Boarder
  • Posts: 101
  • Thank you received: 22
Dear all,

last week an HCLIM experiment of Danijel got killed on the cca due to problems elsewhere on the node he was running on. But according to the operators his batch job still had status exit code 0, and they advised to modify this, so terminated jobs can be identified easily.
This behaviour seems to be caused by the ERROR function in ecf/head.h which is included in all job scripts. This function is called if a script exits with a non-zero exit code, does the required stuff (like setting the task to aborted in ecFlow), but at the end always exits with code 0.
The following small script contains the important bits:
#!/bin/bash
set -e
# Define a error handler
ERROR() {
   set +e
   echo "ERROR CALLED"
   trap 0                      # Remove the trap
   exit 0                      # End the script
}

# Trap any calls to exit and errors caught by the -e flag
trap ERROR 0

# Trap any signal that may cause the script to fail
trap '{ echo "Killed by a signal"; ERROR ; }' 1 2 3 4 5 6 7 8 10 12 13 15

# Give some time to type Ctrl-C
sleep 10

# Some command that fails
ls blahblah || exit

# Normal exit
echo "Success"
trap - 0
exit

When submitted on the KNMI HPC this script according to the accounting status completed successfully. When the "exit 0" from the ERROR function is changed to just "exit" it shows up as FAILED with the corresponding exit code, 2 in this case. But if the script is terminated with Ctrl-C it still exits with code 0. Other signals like SIGFPE and SIGTERM seem to work.
Does anybody know if there is a reason why the "exit 0" line is at the end of the ERROR function or if it is save to change "exit 0" to "exit". Any suggestions for other improve on the exit status?

best wishes,

Bert

exit code: HARMONIE is always successful 1 month 2 weeks ago #2414

  • Bert van Ulft
  • Bert van Ulft's Avatar
  • OFFLINE
  • Expert Boarder
  • Posts: 101
  • Thank you received: 22
Something like this might do the trick:
#!/bin/bash
set -e
# Define a error handler
ERROR() {
   set +e
   echo "ERROR CALLED"
   exit
}

# Trap any calls to exit and errors caught by the -e flag
trap 'ERROR' 0

# Trap any signal that may cause the script to fail
trap '{ rc=$?; echo "Killed by a signal"; exit $rc; }' 1 2 3 4 5 6 7 8 10 12 13 15

# Give some time to type Ctrl-C
sleep 5

# Some command that fails
ls blahblah || exit

# Normal exit
echo "Success"
trap - 0
exit

Catching a signal stores the exit code as rc, echos info, then exits with exit code $rc. The exit trap then calls the ERROR function and in the end exits with provided exit code.
I am not sure it is foolproof but we could test this in HCLIM and if it works at some point put it in Harmonie/develop.

exit code: HARMONIE is always successful 1 month 2 weeks ago #2415

  • Ole Vignes
  • Ole Vignes's Avatar
  • OFFLINE
  • Administrator
  • Posts: 37
  • Thank you received: 10
Looking through version control history it seems the "exit 0" was introduced in the ecflow branch (subversion) and then brought to trunk around 2014, in
this changeset (as seen from git): hirlam.org/trac/changeset/3daf3dca8d54cb...327a25e0b9a/Harmonie

In the subversion trunk, in sms/sms.h the ERROR function ends with "exit 1".
I think that (or just exit) is the proper thing to do, not "exit 0".

exit code: HARMONIE is always successful 1 month 2 weeks ago #2416

Hi,

I am guilty of the exit 0. I took this head.h as base from the mSMS-ecFlow adaptation
#!%SHELL:/bin/ksh%
set -e          # stop the shell on first error
set -u          # fail when using an undefined variable
set -x          # echo script lines as they are executed
set -o pipefail # fail if last(rightmost) command exits with a non-zero status
 
# Defines the variables that are needed for any communication with ECF
export ECF_PORT=%ECF_PORT%    # The server port number
export ECF_HOST=%ECF_HOST%    # The host name where the server is running
export ECF_NAME=%ECF_NAME%    # The name of this current task
export ECF_PASS=%ECF_PASS%    # A unique password, used for job validation & zombie detection
export ECF_TRYNO=%ECF_TRYNO%  # Current try number of the task
export ECF_RID=$$             # record the process id. Also used for zombie detection
# export NO_ECF=1             # uncomment to run as a standalone task on the command line
 
# Define the path where to find ecflow_client
# make sure client and server use the *same* version.
# Important when there are multiple versions of ecFlow
export PATH=/usr/local/apps/ecflow/%ECF_VERSION%/bin:$PATH
 
# Tell ecFlow we have started
ecflow_client --init=$$
 
 
# Define a error handler
ERROR() {
   set +e                      # Clear -e flag, so we don't fail
   wait                        # wait for background process to stop
   ecflow_client --abort=trap  # Notify ecFlow that something went wrong, using 'trap' as the reason
   trap 0                      # Remove the trap
   exit 0                      # End the script
}
 
 
# Trap any calls to exit and errors caught by the -e flag
trap ERROR 0
 
 
# Trap any signal that may cause the script to fail
trap '{ echo "Killed by a signal"; ERROR ; }' 1 2 3 4 5 6 7 8 10 12 13 15


This code is in the ecFlow Tutorial at ecmwf.

ecFlow tutorial

exit code: HARMONIE is always successful 1 month 2 weeks ago #2417

  • Bert van Ulft
  • Bert van Ulft's Avatar
  • OFFLINE
  • Expert Boarder
  • Posts: 101
  • Thank you received: 22
Thanks for your feedback Ole and Daniel.
As written on the tutorial Daniel linked to, ECMWF uses "exit 1" operationally, so I will put it like that in Harmonie/develop (hirlam.org/trac/log/Harmonie/ecf?rev=c0d...eb924c87b1a3fadfc465). For HCLIM, I will implement the option from my last post as it preserves the exit code, possibly giving a bit more info why a job terminated (committed to HCLIM38h1 hirlam.org/trac/changeset/18798).
Last Edit: 1 month 2 weeks ago by Bert van Ulft. Reason: added links
Time to create page: 0.072 seconds