Page 1 of 1

Project 16404 (0, 4835, 72) -- no domain decomposition

Posted: Sun Apr 19, 2020 9:55 pm
by rusty
Hello,

I have received a WU that continually generates the following error message reproduced below regarding there being no domain decomposition for 20 ranks that is compatible with the machine.

So, that CPU slot was stuck in a loop, attempting to run the WU, erroring out, and then trying again.

I manually reduced my number of usable threads to 18 and that seems to have gotten the unit running again.

Just wanted to be sure that this issue was known. Seems that this WU should not have been served to my configuration.

Thanks in advance. Details follow.

Machine:

Code: Select all

21:36:50:WU01:FS00:Started FahCore on PID 378401
21:36:50:WU01:FS00:Core PID:378405
21:36:50:WU01:FS00:FahCore 0xa7 started
21:36:50:WU01:FS00:0xa7:*********************** Log Started 2020-04-19T21:36:50Z ***********************
21:36:50:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
21:36:50:WU01:FS00:0xa7:       Type: 0xa7
21:36:50:WU01:FS00:0xa7:       Core: Gromacs
21:36:50:WU01:FS00:0xa7:       Args: -dir 01 -suffix 01 -version 705 -lifeline 378401 -checkpoint 15 -np
21:36:50:WU01:FS00:0xa7:             29
21:36:50:WU01:FS00:0xa7:************************************ CBang *************************************
21:36:50:WU01:FS00:0xa7:       Date: Nov 5 2019
21:36:50:WU01:FS00:0xa7:       Time: 06:06:57
21:36:50:WU01:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
21:36:50:WU01:FS00:0xa7:     Branch: master
21:36:50:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
21:36:50:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
21:36:50:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
21:36:50:WU01:FS00:0xa7:       Bits: 64
21:36:50:WU01:FS00:0xa7:       Mode: Release
21:36:50:WU01:FS00:0xa7:************************************ System ************************************
21:36:50:WU01:FS00:0xa7:        CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
21:36:50:WU01:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 63 Stepping 2
21:36:50:WU01:FS00:0xa7:       CPUs: 32
21:36:50:WU01:FS00:0xa7:     Memory: 503.81GiB
21:36:50:WU01:FS00:0xa7:Free Memory: 417.90GiB
21:36:50:WU01:FS00:0xa7:    Threads: POSIX_THREADS
21:36:50:WU01:FS00:0xa7: OS Version: 5.5
21:36:50:WU01:FS00:0xa7:Has Battery: false
21:36:50:WU01:FS00:0xa7: On Battery: false
21:36:50:WU01:FS00:0xa7: UTC Offset: -4
21:36:50:WU01:FS00:0xa7:        PID: 378405
21:36:50:WU01:FS00:0xa7:        CWD: /opt/fah/work
21:36:50:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
21:36:50:WU01:FS00:0xa7:    Version: 0.0.18
21:36:50:WU01:FS00:0xa7:     Author: Joseph Coffland <[email protected]>
21:36:50:WU01:FS00:0xa7:  Copyright: 2019 foldingathome.org
21:36:50:WU01:FS00:0xa7:   Homepage: https://foldingathome.org/
21:36:50:WU01:FS00:0xa7:       Date: Nov 5 2019
21:36:50:WU01:FS00:0xa7:       Time: 06:13:26
21:36:50:WU01:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
21:36:50:WU01:FS00:0xa7:     Branch: master
21:36:50:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
21:36:50:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
21:36:50:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
21:36:50:WU01:FS00:0xa7:       Bits: 64
21:36:50:WU01:FS00:0xa7:       Mode: Release
21:36:50:WU01:FS00:0xa7:************************************ Build *************************************
21:36:50:WU01:FS00:0xa7:       SIMD: avx_256
21:36:50:WU01:FS00:0xa7:********************************************************************************
Error Message:

Code: Select all

21:36:50:WU01:FS00:0xa7:Project: 16404 (Run 0, Clone 4835, Gen 72)
21:36:50:WU01:FS00:0xa7:Unit: 0x0000004fa8f5c67d5e7eb9072a30cb57
21:36:50:WU01:FS00:0xa7:Reading tar file core.xml
21:36:50:WU01:FS00:0xa7:Reading tar file frame72.tpr
21:36:50:WU01:FS00:0xa7:Digital signatures verified
21:36:50:WU01:FS00:0xa7:Reducing thread count from 29 to 28 to avoid domain decomposition by a prime number > 3
21:36:50:WU01:FS00:0xa7:Calling: mdrun -s frame72.tpr -o frame72.trr -x frame72.xtc -cpt 15 -nt 28
21:36:50:WU01:FS00:0xa7:Steps: first=36000000 total=500000
21:36:50:WU01:FS00:0xa7:ERROR:
21:36:50:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
21:36:50:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
21:36:50:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
21:36:50:WU01:FS00:0xa7:ERROR:
21:36:50:WU01:FS00:0xa7:ERROR:Fatal error:
21:36:50:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 20 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
21:36:50:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
21:36:50:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
21:36:50:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
21:36:50:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
21:36:50:WU01:FS00:0xa7:ERROR:-------------------------------------------------------

Re: Project 16404 (0, 4835, 72) -- no domain decomposition

Posted: Sun Apr 19, 2020 9:57 pm
by Neil-B
Try changing cpu slot to 24 cores ... 25 through 31 are prone to issues

Re: Project 16404 (0, 4835, 72) -- no domain decomposition

Posted: Sun Apr 19, 2020 10:04 pm
by rusty
Fair enough. The system has two 8 core CPUs (with SMT). So, I split the slot into two 16 thread CPU slots. Hopefully FAH is smart enough to set the affinity to 1 CPU per WU (or maybe it just punts to the kernel's scheduler?)

Re: Project 16404 (0, 4835, 72) -- no domain decomposition

Posted: Sun Apr 19, 2020 10:09 pm
by Neil-B
If you have that and aren't running a gpu then run a single 32core ... from my experience very stable .and better for the science and points than 2x 16 ... can't see from your logs why it was running as a 29core

Re: Project 16404 (0, 4835, 72) -- no domain decomposition

Posted: Sun Apr 19, 2020 10:11 pm
by rusty
Nevermind. No, it is not smart enough.

Code: Select all

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND  
 379034 fah       39  19 1391160 257568  13068 R  1596   0.0 133:59.22 FahCore+ 
 379060 fah       39  19 1284932 106348  13200 R  1594   0.0 131:33.24 FahCore+ 

Code: Select all

$ taskset -cp 379060
pid 379060's current affinity list: 0-31
$ taskset -cp 379034
pid 379034's current affinity list: 0-31
Oh well...

Re: Project 16404 (0, 4835, 72) -- no domain decomposition

Posted: Sun Apr 19, 2020 10:17 pm
by rusty
Well, I was running it at 30 (not 32, because the FAH client bumped it down to 30 upon install) to, presumably, avoid issues like the one I just ran into with the rank decomposition.

The default setting of 30 had been working well without issue for the last month or so when I commissioned this machine for folding.

In any case, I'm surprised that the work server passed this WU to my configuration.

Thanks for your help. I'll keep playing with this.

Re: Project 16404 (0, 4835, 72) -- no domain decomposition

Posted: Sun Apr 19, 2020 10:23 pm
by Neil-B
30 Is divisible by 5 which is sometimes an issue ... the install may have used 30 to leave cores for gpus ... if not using gpus for folding 32 would be solid choice tbh

Re: Project 16404 (0, 4835, 72) -- no domain decomposition

Posted: Sun Apr 19, 2020 10:25 pm
by Neil-B
A quick search for "large primes" on these forums should find you a thread where JimboPalmer explains the best core numbers and why

Re: Project 16404 (0, 4835, 72) -- no domain decomposition

Posted: Sun Apr 19, 2020 10:27 pm
by rusty
Ah, yes... that's right. It left the other 2 cores for the two GPUs in this machine.

Thanks for the tip on divisibility by 5. Now let me see if I can find a sustainable configuration I don't have to keep an eye on... :roll:

Re: Project 16404 (0, 4835, 72) -- no domain decomposition

Posted: Sun Apr 19, 2020 10:29 pm
by Neil-B
My 32 core has never faulted ... I probably shouldn't have typed that ?!