Shutdown: BAD_WORK_UNIT (when run with 10 CPUs)
Posted: Thu Mar 19, 2020 10:49 am
Hi,
I am getting this error in my logs:
I am seeing this on the thirteen nodes I am currently trying to fold on. (1 28 core VM and 12 128 core epyc physical systems)
Any advice on how to get around this?
I am getting this error in my logs:
Code: Select all
10:47:43:WU00:FS00:0xa7:Project: 13851 (Run 0, Clone 14914, Gen 0)
10:47:43:WU00:FS00:0xa7:Unit: 0x00000000287234c95e7301ac882be327
10:47:43:WU00:FS00:0xa7:Reading tar file core.xml
10:47:43:WU00:FS00:0xa7:Reading tar file frame0.tpr
10:47:43:WU00:FS00:0xa7:Digital signatures verified
10:47:43:WU00:FS00:0xa7:Calling: mdrun -s frame0.tpr -o frame0.trr -x frame0.xtc -e frame0.edr -cpt 15 -nt 128
10:47:43:WU00:FS00:0xa7:Steps: first=0 total=500000
10:47:43:WU00:FS00:0xa7:ERROR:
10:47:43:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
10:47:43:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
10:47:43:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
10:47:43:WU00:FS00:0xa7:ERROR:
10:47:43:WU00:FS00:0xa7:ERROR:Fatal error:
10:47:43:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 96 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
10:47:43:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
10:47:43:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
10:47:43:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
10:47:43:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
10:47:43:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
10:47:48:WU00:FS00:0xa7:WARNING:Unexpected exit() call
10:47:48:WU00:FS00:0xa7:WARNING:Unexpected exit from science code
10:47:48:WU00:FS00:0xa7:Saving result file ../logfile_01.txt
10:47:48:WU00:FS00:0xa7:Saving result file md.log
10:47:48:WU00:FS00:0xa7:Saving result file science.log
10:47:48:WU00:FS00:0xa7:Folding@home Core Shutdown: BAD_WORK_UNIT
10:47:48:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
Any advice on how to get around this?