Page 1 of 1

[Linux] Problem with BAD_WORK_UNIT

Posted: Sun Apr 26, 2020 10:41 pm
by krasny
Hi there:

Currently I'm folding in headless Linux machines and I noticed that sometimes I get these errors:

Code: Select all

22:31:44:WU01:FS00:0xa7:*********************** Log Started 2020-04-26T22:31:43Z ***********************
22:31:44:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
22:31:44:WU01:FS00:0xa7:       Type: 0xa7
22:31:44:WU01:FS00:0xa7:       Core: Gromacs
22:31:44:WU01:FS00:0xa7:       Args: -dir 01 -suffix 01 -version 706 -lifeline 36972 -checkpoint 15 -np
22:31:44:WU01:FS00:0xa7:             39
22:31:44:WU01:FS00:0xa7:************************************ CBang *************************************
22:31:44:WU01:FS00:0xa7:       Date: Nov 5 2019
22:31:44:WU01:FS00:0xa7:       Time: 06:06:57
22:31:44:WU01:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
22:31:44:WU01:FS00:0xa7:     Branch: master
22:31:44:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
22:31:44:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
22:31:44:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
22:31:44:WU01:FS00:0xa7:       Bits: 64
22:31:44:WU01:FS00:0xa7:       Mode: Release
22:31:44:WU01:FS00:0xa7:************************************ System ************************************
22:31:44:WU01:FS00:0xa7:        CPU: Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz
22:31:44:WU01:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 79 Stepping 1
22:31:44:WU01:FS00:0xa7:       CPUs: 40
22:31:44:WU01:FS00:0xa7:     Memory: 62.65GiB
22:31:44:WU01:FS00:0xa7:Free Memory: 61.41GiB
22:31:44:WU01:FS00:0xa7:    Threads: POSIX_THREADS
22:31:44:WU01:FS00:0xa7: OS Version: 3.10
22:31:44:WU01:FS00:0xa7:Has Battery: false
22:31:44:WU01:FS00:0xa7: On Battery: false
22:31:44:WU01:FS00:0xa7: UTC Offset: 2
22:31:44:WU01:FS00:0xa7:        PID: 36976
22:31:44:WU01:FS00:0xa7:        CWD: /var/lib/fahclient/work
22:31:44:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
22:31:44:WU01:FS00:0xa7:    Version: 0.0.18
22:31:44:WU01:FS00:0xa7:     Author: Joseph Coffland <[email protected]>
22:31:44:WU01:FS00:0xa7:  Copyright: 2019 foldingathome.org
22:31:44:WU01:FS00:0xa7:   Homepage: https://foldingathome.org/
22:31:44:WU01:FS00:0xa7:       Date: Nov 5 2019
22:31:44:WU01:FS00:0xa7:       Time: 06:13:26
22:31:44:WU01:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
22:31:44:WU01:FS00:0xa7:     Branch: master
22:31:44:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
22:31:44:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
22:31:44:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
22:31:44:WU01:FS00:0xa7:       Bits: 64
22:31:44:WU01:FS00:0xa7:       Mode: Release
22:31:44:WU01:FS00:0xa7:************************************ Build *************************************
22:31:44:WU01:FS00:0xa7:       SIMD: avx_256
22:31:44:WU01:FS00:0xa7:********************************************************************************
22:31:44:WU01:FS00:0xa7:Project: 16417 (Run 1751, Clone 0, Gen 110)
22:31:44:WU01:FS00:0xa7:Unit: 0x0000007a96880e6e5e8a608553ba549c
22:31:44:WU01:FS00:0xa7:Reading tar file core.xml
22:31:44:WU01:FS00:0xa7:Reading tar file frame110.tpr
22:31:44:WU01:FS00:0xa7:Digital signatures verified
22:31:44:WU01:FS00:0xa7:Reducing thread count from 39 to 38 to avoid domain decomposition with large prime factor 13
22:31:44:WU01:FS00:0xa7:Reducing thread count from 38 to 37 to avoid domain decomposition with large prime factor 19
22:31:44:WU01:FS00:0xa7:Reducing thread count from 37 to 36 to avoid domain decomposition by a prime number > 3
22:31:44:WU01:FS00:0xa7:Calling: mdrun -s frame110.tpr -o frame110.trr -x frame110.xtc -cpt 15 -nt 36
22:31:44:WU01:FS00:0xa7:Steps: first=27500000 total=250000
22:31:44:WU01:FS00:0xa7:ERROR:
22:31:44:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
22:31:44:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
22:31:44:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
22:31:44:WU01:FS00:0xa7:ERROR:
22:31:44:WU01:FS00:0xa7:ERROR:Fatal error:
22:31:44:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 30 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm
22:31:44:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
22:31:44:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
22:31:44:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
22:31:44:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
22:31:44:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
22:31:49:WU01:FS00:0xa7:WARNING:Unexpected exit() call
22:31:49:WU01:FS00:0xa7:WARNING:Unexpected exit from science code
22:31:49:WU01:FS00:0xa7:Saving result file ../logfile_01.txt
22:31:49:WU01:FS00:0xa7:Saving result file md.log
22:31:49:WU01:FS00:0xa7:Saving result file science.log
22:31:49:WU01:FS00:0xa7:Folding@home Core Shutdown: BAD_WORK_UNIT
22:31:49:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
As far as I read is due to the number of cores used but I can't change via GUI and the nodes has 40 cores each. Is there any way to configure a smaller number of cores using config.xml? which number of cores do you recommend to avoid this type of error? Any way to auto-relaunch the client to avoid waste time?

Thank you!

Re: [Linux] Problem with BAD_WORK_UNIT

Posted: Sun Apr 26, 2020 10:44 pm
by PantherX
This would be an option in config.xml:

Code: Select all

  <slot id='0' type='CPU'>
    <cpus v='32'/>
  </slot>
32 is a safe number but may not have as many projects as other low count CPUs might have.

Re: [Linux] Problem with BAD_WORK_UNIT

Posted: Sun Apr 26, 2020 10:48 pm
by krasny
Thank you very much! I'll try with this config.

Re: [Linux] Problem with BAD_WORK_UNIT

Posted: Sun Apr 26, 2020 10:52 pm
by bruce
FAH doesn't like any slot with more than 32 CPUs.

You can break the 40 up into whatever makes sense to you to avoid the re-launch. Maybe something like this:

<slot type="CPU" id="0">
cpus v="16"/>
</slot>
<slot type="CPU" id="1">
cpus v="12"/>
</slot>
<slot type="CPU" id="2">
cpus v="12"/>
</slot>

It likes cpus to have factors of 2 or 3. Large prime factors should be avoided.

Re: [Linux] Problem with BAD_WORK_UNIT

Posted: Sun Apr 26, 2020 11:03 pm
by JimboPalmer
Windows has a limit at 32 for most non server versions. You did not give us the configuration section of the log, but one path hints this is Linux.
22:31:44:WU01:FS00:0xa7: CWD: /var/lib/fahclient/work

It is still true that selecting a number of CPUs which is prime or a multiple of a prime larger than 3 can sometimes fail to fold. For larger numbers of CPUs (threads, really but F@H calls them CPUs) the rules get complex, and the number of Work Units available goes down so even in Linux 32 is not a bad choice, it is just not the hard limit it is in Windows.

2, 3, 4, 6, 8, 9, 12, 16, 18, 20, 21, 24, 27, 30, 32 are good numbers of CPUs to choose. (_r2w_ben has advised me of more good numbers)
5. 10. 15, 20, 25, 28 may work most of the time. Other numbers will bite you