[Linux] Problem with BAD_WORK_UNIT

Moderators: Site Moderators, FAHC Science Team

Post Reply
krasny
Posts: 5
Joined: Thu Apr 23, 2020 4:56 pm

[Linux] Problem with BAD_WORK_UNIT

Post by krasny »

Hi there:

Currently I'm folding in headless Linux machines and I noticed that sometimes I get these errors:

Code: Select all

22:31:44:WU01:FS00:0xa7:*********************** Log Started 2020-04-26T22:31:43Z ***********************
22:31:44:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
22:31:44:WU01:FS00:0xa7:       Type: 0xa7
22:31:44:WU01:FS00:0xa7:       Core: Gromacs
22:31:44:WU01:FS00:0xa7:       Args: -dir 01 -suffix 01 -version 706 -lifeline 36972 -checkpoint 15 -np
22:31:44:WU01:FS00:0xa7:             39
22:31:44:WU01:FS00:0xa7:************************************ CBang *************************************
22:31:44:WU01:FS00:0xa7:       Date: Nov 5 2019
22:31:44:WU01:FS00:0xa7:       Time: 06:06:57
22:31:44:WU01:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
22:31:44:WU01:FS00:0xa7:     Branch: master
22:31:44:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
22:31:44:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
22:31:44:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
22:31:44:WU01:FS00:0xa7:       Bits: 64
22:31:44:WU01:FS00:0xa7:       Mode: Release
22:31:44:WU01:FS00:0xa7:************************************ System ************************************
22:31:44:WU01:FS00:0xa7:        CPU: Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz
22:31:44:WU01:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 79 Stepping 1
22:31:44:WU01:FS00:0xa7:       CPUs: 40
22:31:44:WU01:FS00:0xa7:     Memory: 62.65GiB
22:31:44:WU01:FS00:0xa7:Free Memory: 61.41GiB
22:31:44:WU01:FS00:0xa7:    Threads: POSIX_THREADS
22:31:44:WU01:FS00:0xa7: OS Version: 3.10
22:31:44:WU01:FS00:0xa7:Has Battery: false
22:31:44:WU01:FS00:0xa7: On Battery: false
22:31:44:WU01:FS00:0xa7: UTC Offset: 2
22:31:44:WU01:FS00:0xa7:        PID: 36976
22:31:44:WU01:FS00:0xa7:        CWD: /var/lib/fahclient/work
22:31:44:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
22:31:44:WU01:FS00:0xa7:    Version: 0.0.18
22:31:44:WU01:FS00:0xa7:     Author: Joseph Coffland <[email protected]>
22:31:44:WU01:FS00:0xa7:  Copyright: 2019 foldingathome.org
22:31:44:WU01:FS00:0xa7:   Homepage: https://foldingathome.org/
22:31:44:WU01:FS00:0xa7:       Date: Nov 5 2019
22:31:44:WU01:FS00:0xa7:       Time: 06:13:26
22:31:44:WU01:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
22:31:44:WU01:FS00:0xa7:     Branch: master
22:31:44:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
22:31:44:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
22:31:44:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
22:31:44:WU01:FS00:0xa7:       Bits: 64
22:31:44:WU01:FS00:0xa7:       Mode: Release
22:31:44:WU01:FS00:0xa7:************************************ Build *************************************
22:31:44:WU01:FS00:0xa7:       SIMD: avx_256
22:31:44:WU01:FS00:0xa7:********************************************************************************
22:31:44:WU01:FS00:0xa7:Project: 16417 (Run 1751, Clone 0, Gen 110)
22:31:44:WU01:FS00:0xa7:Unit: 0x0000007a96880e6e5e8a608553ba549c
22:31:44:WU01:FS00:0xa7:Reading tar file core.xml
22:31:44:WU01:FS00:0xa7:Reading tar file frame110.tpr
22:31:44:WU01:FS00:0xa7:Digital signatures verified
22:31:44:WU01:FS00:0xa7:Reducing thread count from 39 to 38 to avoid domain decomposition with large prime factor 13
22:31:44:WU01:FS00:0xa7:Reducing thread count from 38 to 37 to avoid domain decomposition with large prime factor 19
22:31:44:WU01:FS00:0xa7:Reducing thread count from 37 to 36 to avoid domain decomposition by a prime number > 3
22:31:44:WU01:FS00:0xa7:Calling: mdrun -s frame110.tpr -o frame110.trr -x frame110.xtc -cpt 15 -nt 36
22:31:44:WU01:FS00:0xa7:Steps: first=27500000 total=250000
22:31:44:WU01:FS00:0xa7:ERROR:
22:31:44:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
22:31:44:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
22:31:44:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
22:31:44:WU01:FS00:0xa7:ERROR:
22:31:44:WU01:FS00:0xa7:ERROR:Fatal error:
22:31:44:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 30 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm
22:31:44:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
22:31:44:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
22:31:44:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
22:31:44:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
22:31:44:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
22:31:49:WU01:FS00:0xa7:WARNING:Unexpected exit() call
22:31:49:WU01:FS00:0xa7:WARNING:Unexpected exit from science code
22:31:49:WU01:FS00:0xa7:Saving result file ../logfile_01.txt
22:31:49:WU01:FS00:0xa7:Saving result file md.log
22:31:49:WU01:FS00:0xa7:Saving result file science.log
22:31:49:WU01:FS00:0xa7:Folding@home Core Shutdown: BAD_WORK_UNIT
22:31:49:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
As far as I read is due to the number of cores used but I can't change via GUI and the nodes has 40 cores each. Is there any way to configure a smaller number of cores using config.xml? which number of cores do you recommend to avoid this type of error? Any way to auto-relaunch the client to avoid waste time?

Thank you!
PantherX
Site Moderator
Posts: 6986
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: [Linux] Problem with BAD_WORK_UNIT

Post by PantherX »

This would be an option in config.xml:

Code: Select all

  <slot id='0' type='CPU'>
    <cpus v='32'/>
  </slot>
32 is a safe number but may not have as many projects as other low count CPUs might have.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
krasny
Posts: 5
Joined: Thu Apr 23, 2020 4:56 pm

Re: [Linux] Problem with BAD_WORK_UNIT

Post by krasny »

Thank you very much! I'll try with this config.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: [Linux] Problem with BAD_WORK_UNIT

Post by bruce »

FAH doesn't like any slot with more than 32 CPUs.

You can break the 40 up into whatever makes sense to you to avoid the re-launch. Maybe something like this:

<slot type="CPU" id="0">
cpus v="16"/>
</slot>
<slot type="CPU" id="1">
cpus v="12"/>
</slot>
<slot type="CPU" id="2">
cpus v="12"/>
</slot>

It likes cpus to have factors of 2 or 3. Large prime factors should be avoided.
JimboPalmer
Posts: 2522
Joined: Mon Feb 16, 2009 4:12 am
Location: Greenwood MS USA

Re: [Linux] Problem with BAD_WORK_UNIT

Post by JimboPalmer »

Windows has a limit at 32 for most non server versions. You did not give us the configuration section of the log, but one path hints this is Linux.
22:31:44:WU01:FS00:0xa7: CWD: /var/lib/fahclient/work

It is still true that selecting a number of CPUs which is prime or a multiple of a prime larger than 3 can sometimes fail to fold. For larger numbers of CPUs (threads, really but F@H calls them CPUs) the rules get complex, and the number of Work Units available goes down so even in Linux 32 is not a bad choice, it is just not the hard limit it is in Windows.

2, 3, 4, 6, 8, 9, 12, 16, 18, 20, 21, 24, 27, 30, 32 are good numbers of CPUs to choose. (_r2w_ben has advised me of more good numbers)
5. 10. 15, 20, 25, 28 may work most of the time. Other numbers will bite you
Tsar of all the Rushers
I tried to remain childlike, all I achieved was childish.
A friend to those who want no friends
Post Reply