ERROR: BAD_WORK_UNIT

Moderators: Site Moderators, FAHC Science Team

Post Reply
acnash
Posts: 20
Joined: Sun Mar 11, 2018 4:34 pm
Hardware configuration: 32x Intel (R) Xeon (R) CPU E5-2687W 0 @ 3.10GHz
125.88GiB
1 Nvidia GeForce GTX 970
Mint Linux ver 18
Location: Oxford, UK
Contact:

ERROR: BAD_WORK_UNIT

Post by acnash »

Hi all,

I've been F@H for about three years now and this is the first time I've come across this problem. I've just upgraded my client to 7.6.9, but I think the timing of my upgrade and the error is coincidental.

Quick system spec:

Code: Select all

11:16:12:WU02:FS00:0xa7:*********************** Log Started 2020-04-23T11:16:12Z ***********************
11:16:12:WU02:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
11:16:12:WU02:FS00:0xa7:       Type: 0xa7
11:16:12:WU02:FS00:0xa7:       Core: Gromacs
11:16:12:WU02:FS00:0xa7:       Args: -dir 02 -suffix 01 -version 704 -lifeline 10318 -checkpoint 15 -np
11:16:12:WU02:FS00:0xa7:             15
11:16:12:WU02:FS00:0xa7:************************************ CBang *************************************
11:16:12:WU02:FS00:0xa7:       Date: Nov 5 2019
11:16:12:WU02:FS00:0xa7:       Time: 06:06:57
11:16:12:WU02:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
11:16:12:WU02:FS00:0xa7:     Branch: master
11:16:12:WU02:FS00:0xa7:   Compiler: GNU 8.3.0
11:16:12:WU02:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
11:16:12:WU02:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
11:16:12:WU02:FS00:0xa7:       Bits: 64
11:16:12:WU02:FS00:0xa7:       Mode: Release
11:16:12:WU02:FS00:0xa7:************************************ System ************************************
11:16:12:WU02:FS00:0xa7:        CPU: Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz
11:16:12:WU02:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 45 Stepping 7
11:16:12:WU02:FS00:0xa7:       CPUs: 32
11:16:12:WU02:FS00:0xa7:     Memory: 125.88GiB
11:16:12:WU02:FS00:0xa7:Free Memory: 43.49GiB
11:16:12:WU02:FS00:0xa7:    Threads: POSIX_THREADS
11:16:12:WU02:FS00:0xa7: OS Version: 4.10
11:16:12:WU02:FS00:0xa7:Has Battery: false
11:16:12:WU02:FS00:0xa7: On Battery: false
11:16:12:WU02:FS00:0xa7: UTC Offset: 1
11:16:12:WU02:FS00:0xa7:        PID: 10322
11:16:12:WU02:FS00:0xa7:        CWD: /var/lib/fahclient/work
11:16:12:WU02:FS00:0xa7:******************************** Build - libFAH ********************************
11:16:12:WU02:FS00:0xa7:    Version: 0.0.18
11:16:12:WU02:FS00:0xa7:     Author: Joseph Coffland <[email protected]>
11:16:12:WU02:FS00:0xa7:  Copyright: 2019 foldingathome.org
11:16:12:WU02:FS00:0xa7:   Homepage: https://foldingathome.org/
11:16:12:WU02:FS00:0xa7:       Date: Nov 5 2019
11:16:12:WU02:FS00:0xa7:       Time: 06:13:26
11:16:12:WU02:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
11:16:12:WU02:FS00:0xa7:     Branch: master
11:16:12:WU02:FS00:0xa7:   Compiler: GNU 8.3.0
11:16:12:WU02:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
11:16:12:WU02:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
11:16:12:WU02:FS00:0xa7:       Bits: 64
11:16:12:WU02:FS00:0xa7:       Mode: Release
11:16:12:WU02:FS00:0xa7:************************************ Build *************************************
11:16:12:WU02:FS00:0xa7:       SIMD: avx_256
11:16:12:WU02:FS00:0xa7:********************************************************************************
Now the actual error:

Code: Select all

11:16:12:WU02:FS00:0xa7:Project: 16417 (Run 1322, Clone 1, Gen 62)
11:16:12:WU02:FS00:0xa7:Unit: 0x0000004996880e6e5e8a61572b189804
11:16:12:WU02:FS00:0xa7:Reading tar file core.xml
11:16:12:WU02:FS00:0xa7:Reading tar file frame62.tpr
11:16:12:WU02:FS00:0xa7:Digital signatures verified
11:16:12:WU02:FS00:0xa7:Calling: mdrun -s frame62.tpr -o frame62.trr -x frame62.xtc -cpt 15 -nt 15
11:16:12:WU02:FS00:0xa7:Steps: first=15500000 total=250000
11:16:12:WU02:FS00:0xa7:ERROR:
11:16:12:WU02:FS00:0xa7:ERROR:-------------------------------------------------------
11:16:12:WU02:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
11:16:12:WU02:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
11:16:12:WU02:FS00:0xa7:ERROR:
11:16:12:WU02:FS00:0xa7:ERROR:Fatal error:
11:16:12:WU02:FS00:0xa7:ERROR:There is no domain decomposition for 15 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm
11:16:12:WU02:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
11:16:12:WU02:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
11:16:12:WU02:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
11:16:12:WU02:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
11:16:12:WU02:FS00:0xa7:ERROR:-------------------------------------------------------
11:16:17:WU02:FS00:0xa7:WARNING:Unexpected exit() call
11:16:17:WU02:FS00:0xa7:WARNING:Unexpected exit from science code
11:16:17:WU02:FS00:0xa7:Saving result file ../logfile_01.txt
11:16:17:WU02:FS00:0xa7:Saving result file md.log
11:16:17:WU02:FS00:0xa7:Saving result file science.log
11:16:17:WU02:FS00:0xa7:Folding@home Core Shutdown: BAD_WORK_UNIT
11:16:17:WU02:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
I've tried changing 'cause' but the project is stuck on Covid-19 16417 (which is CPU only; I'd like to change that if possible i.e., how does one reach out to project GPU compatible?)

The Computational Chemist in me thinks that the

Code: Select all

gmx grompp
of the project topology and coordinate data was designed for fewer cores (hence a domain decomposition issue). However, I would be very surprised if this work unit got through to a client if such was the case. Hence, I would like to hand over to your better judgement in terms of the F@H software.

Quick note: I've left this running for 3 days hoping it would right itself. It continues to pull down the same project WU and fail. Also, another note. This is my own work-horse of a machine running my own comp chem calculations, suggestions on how to restart the F@H Core without restarting that machine if that's all it takes. Thanks.
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon [email protected], 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon [email protected], 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: [email protected], 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: ERROR: BAD_WORK_UNIT

Post by Neil-B »

Looks like you may have a 15core cpu slot ... try backing that down to 12 and it might clear
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
acnash
Posts: 20
Joined: Sun Mar 11, 2018 4:34 pm
Hardware configuration: 32x Intel (R) Xeon (R) CPU E5-2687W 0 @ 3.10GHz
125.88GiB
1 Nvidia GeForce GTX 970
Mint Linux ver 18
Location: Oxford, UK
Contact:

Re: ERROR: BAD_WORK_UNIT

Post by acnash »

I hope it's that simple!

Sorry if I sound dense, but how do I drop to 12? The Client Control provides "Light = 15, Medium=30, Full=31 & with GPU running". It is currently set on "Light".

Nudging it up to "Full" I get:

Code: Select all

12:14:14:WU02:FS00:0xa7:ERROR:There is no domain decomposition for 25 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm
Which again is an issue with Gromacs trying to split the unit cell over that number of ranks.
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon [email protected], 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon [email protected], 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: [email protected], 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: ERROR: BAD_WORK_UNIT

Post by Neil-B »

With Linux I am not sure - but If can find/open advanced control, click on configure, select the slots tab, click on the CPU folding slot then click edit - then change the number of CPU threads to 12.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
acnash
Posts: 20
Joined: Sun Mar 11, 2018 4:34 pm
Hardware configuration: 32x Intel (R) Xeon (R) CPU E5-2687W 0 @ 3.10GHz
125.88GiB
1 Nvidia GeForce GTX 970
Mint Linux ver 18
Location: Oxford, UK
Contact:

Re: ERROR: BAD_WORK_UNIT

Post by acnash »

Thank you! I hadn't spotted that option before.

I've set it to 4, so the command now comes up as "-nt 4".

It looks to be holding and running without interruption. I'm now just wondering when can I change it back to "-1" for the client to decide. I've never had to manually set the cpu count.
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon [email protected], 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon [email protected], 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: [email protected], 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: ERROR: BAD_WORK_UNIT

Post by Neil-B »

right click on the slot in status and mark as finish ... then once it has finished reset the cpu slot to whatever.

On the topic of number of cores ... -1 is the default but can cause some issue/challenges ... how many cores do you actually have and what/how many slots do you fold on (CPU & GPU)? ... Do you use the power slider (which doesn't actually work the way you may think it does) or are you happy to just set a stable set of folding slots? ... If you let us know your priorities/usage style setting the slot to a specific number of cores might avoid a few challenges in the future - and choosing what core count to use can make quiet a difference.

Believe the way the slots currently work is High uses the total number of cores you system has less one for each GPU slot/card ... Medium actually just reduces the core count by one from High ... Light halves the core count and pauses the GPU slot(s) - It also makes the GPU slots look as if they are paused waiting for idle which can be confusing.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
acnash
Posts: 20
Joined: Sun Mar 11, 2018 4:34 pm
Hardware configuration: 32x Intel (R) Xeon (R) CPU E5-2687W 0 @ 3.10GHz
125.88GiB
1 Nvidia GeForce GTX 970
Mint Linux ver 18
Location: Oxford, UK
Contact:

Re: ERROR: BAD_WORK_UNIT

Post by acnash »

Thanks for the information. I kind of figured out the slidder behaviour by checking the mdrun commands in the system log.

There are 32 cores. If the is any way of running 15/16 of them whilst using the GPU that I would be awesome. The other half I need for my own work.
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon [email protected], 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon [email protected], 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: [email protected], 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: ERROR: BAD_WORK_UNIT

Post by Neil-B »

Where you set 4 once this WU has finished set it to 16 (better count than 15 - if you are interested search he forums for "large primes") ... that will limit/run that CPU slot at 16cores ... however the FAH cores run at low priority so you shouldn't have issues with something higher as your other work should easily take priority (say 24 - if it does impact then you can down step the WU to 16 by the method you have just used - but it shouldn't) - GPU is a bit tougher as the way OSs work GPU is pretty much non prioritised - if you find you are getting lags then sometimes turning off hardware acceleration can help ... but most importantly, make sure you choose something you are happy with :)
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
acnash
Posts: 20
Joined: Sun Mar 11, 2018 4:34 pm
Hardware configuration: 32x Intel (R) Xeon (R) CPU E5-2687W 0 @ 3.10GHz
125.88GiB
1 Nvidia GeForce GTX 970
Mint Linux ver 18
Location: Oxford, UK
Contact:

Re: ERROR: BAD_WORK_UNIT

Post by acnash »

Thanks for the information. The WU finished and now at 16 cores everything seems fine.

During a break from my own calculations, I changed it to 30 cores and the TEP per day obviously increased, however, my GPU is still ideal. The project (14366) is for only CPUs. Is there any way of forcing the F@H client to pick only CPU+GPU based projects?

Thanks
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon [email protected], 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon [email protected], 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: [email protected], 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: ERROR: BAD_WORK_UNIT

Post by Neil-B »

30cores is relatively stable (for the most part) but is a multiple of 5 so may have an occasional issue.

CPU and GPU are totally separate and independent - you will get CPU WUs (Gromacs based) for the CPU Slot and GPU WUs (OpenMM based) for the GPU and the client will quite happily run both at once (if the WUs are available) ... There has been a bit of a scarcity of GPU WUs recently but I believe this is improving and chances of getting them are increasing ... I believe most people are now back to 24/7 CPU folding ... GPU is still showing periods of waiting.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
acnash
Posts: 20
Joined: Sun Mar 11, 2018 4:34 pm
Hardware configuration: 32x Intel (R) Xeon (R) CPU E5-2687W 0 @ 3.10GHz
125.88GiB
1 Nvidia GeForce GTX 970
Mint Linux ver 18
Location: Oxford, UK
Contact:

Re: ERROR: BAD_WORK_UNIT

Post by acnash »

Thanks for that information, that all makes sense. Also, interesting to know F@H is also folding using openMM, I always thought it was Gromacs only.

Thanks again.
Anthony
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon [email protected], 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon [email protected], 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: [email protected], 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: ERROR: BAD_WORK_UNIT

Post by Neil-B »

Someone for the "FAH Forum History Department" aka an "Old Timer" will no doubt be able to give chapter and verse on when FAH embraced OpenMM ... Think I spotted somewhere that Gromacs itself is moving to embracing OpenMM in some way - but that may be a fantasy world in my grey matter.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
PantherX
Site Moderator
Posts: 6986
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: ERROR: BAD_WORK_UNIT

Post by PantherX »

acnash wrote:...Is there any way of forcing the F@H client to pick only CPU+GPU based projects?...
There's no single WU that can utilize both the CPU and GPU simultaneously yet (it's a cool idea but not sure if it's feasible or not). Instead, you can get with WUs for CPU only or GPU only. You can fold on both CPU and GPU but they would be processing different WUs, not the same.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
NoMoreQuarantine
Posts: 168
Joined: Tue Apr 07, 2020 2:38 pm

Re: ERROR: BAD_WORK_UNIT

Post by NoMoreQuarantine »

PantherX wrote:There's no single WU that can utilize both the CPU and GPU simultaneously yet (it's a cool idea but not sure if it's feasible or not). Instead, you can get with WUs for CPU only or GPU only. You can fold on both CPU and GPU but they would be processing different WUs, not the same.
I've been wondering for a while now why GPU acceleration isn't used for the Gromacs core: http://www.gromacs.org/Documentation/Ac ... lelization
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: ERROR: BAD_WORK_UNIT

Post by bruce »

"Old Timer" here.

History: FAH was started about 20 years ago at Stanford University. Early simulations were done with a variety of Open Software analysis packeges but gradually GROMACS dominated the field. It was updated from x86 code to SSE and 3dNow! and before long SSE2 was adopted as a requirement. Meanwile, the GPU had been mostly idle and a team was put together at Staford it support the new hardware with OpenMM. A lot has changed since then but FAH has adopted GROMACS exclusifly for CPUs and OpenMM exclusively for GPUs. (The stand alone versions for individual scientists are not exclusive.)

The OpenMM project is still at Stanford, while FAH has move out to a Consortium of many Universities.
https://foldingathome.org/about/the-fol ... onsortium/
https://foldingathome.org/project-timeline/
Post Reply