Page 1 of 1

Fixed: Project: 14524 - ERROR:There is no domain for 10 rank

Posted: Sun Apr 26, 2020 12:46 am
by TheSnowedone
Hello all,
Getting the following error on my Ubuntu 20.04 install (currently, 11 cores dedicated to FAHClient). Any suggestions on how to remedy this would be appreciated.

Code: Select all

00:38:43:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version 706 -lifeline 1268 -checkpoint 15 -np 11
00:38:43:WU01:FS00:Started FahCore on PID 113671
00:38:43:WU01:FS00:Core PID:113675
00:38:43:WU01:FS00:FahCore 0xa7 started
00:38:44:WU01:FS00:0xa7:*********************** Log Started 2020-04-26T00:38:43Z ***********************
00:38:44:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
00:38:44:WU01:FS00:0xa7:       Type: 0xa7
00:38:44:WU01:FS00:0xa7:       Core: Gromacs
00:38:44:WU01:FS00:0xa7:       Args: -dir 01 -suffix 01 -version 706 -lifeline 113671 -checkpoint 15 -np
00:38:44:WU01:FS00:0xa7:             11
00:38:44:WU01:FS00:0xa7:************************************ CBang *************************************
00:38:44:WU01:FS00:0xa7:       Date: Nov 5 2019
00:38:44:WU01:FS00:0xa7:       Time: 06:06:57
00:38:44:WU01:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
00:38:44:WU01:FS00:0xa7:     Branch: master
00:38:44:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
00:38:44:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
00:38:44:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
00:38:44:WU01:FS00:0xa7:       Bits: 64
00:38:44:WU01:FS00:0xa7:       Mode: Release
00:38:44:WU01:FS00:0xa7:************************************ System ************************************
00:38:44:WU01:FS00:0xa7:        CPU: Intel(R) Core(TM) i7-6850K CPU @ 3.60GHz
00:38:44:WU01:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 79 Stepping 1
00:38:44:WU01:FS00:0xa7:       CPUs: 12
00:38:44:WU01:FS00:0xa7:     Memory: 46.97GiB
00:38:44:WU01:FS00:0xa7:Free Memory: 1.69GiB
00:38:44:WU01:FS00:0xa7:    Threads: POSIX_THREADS
00:38:44:WU01:FS00:0xa7: OS Version: 5.4
00:38:44:WU01:FS00:0xa7:Has Battery: false
00:38:44:WU01:FS00:0xa7: On Battery: false
00:38:44:WU01:FS00:0xa7: UTC Offset: 10
00:38:44:WU01:FS00:0xa7:        PID: 113675
00:38:44:WU01:FS00:0xa7:        CWD: /var/lib/fahclient/work
00:38:44:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
00:38:44:WU01:FS00:0xa7:    Version: 0.0.18
00:38:44:WU01:FS00:0xa7:     Author: Joseph Coffland <[email protected]>
00:38:44:WU01:FS00:0xa7:  Copyright: 2019 foldingathome.org
00:38:44:WU01:FS00:0xa7:   Homepage: https://foldingathome.org/
00:38:44:WU01:FS00:0xa7:       Date: Nov 5 2019
00:38:44:WU01:FS00:0xa7:       Time: 06:13:26
00:38:44:WU01:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
00:38:44:WU01:FS00:0xa7:     Branch: master
00:38:44:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
00:38:44:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
00:38:44:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
00:38:44:WU01:FS00:0xa7:       Bits: 64
00:38:44:WU01:FS00:0xa7:       Mode: Release
00:38:44:WU01:FS00:0xa7:************************************ Build *************************************
00:38:44:WU01:FS00:0xa7:       SIMD: avx_256
00:38:44:WU01:FS00:0xa7:********************************************************************************
00:38:44:WU01:FS00:0xa7:Project: 14524 (Run 982, Clone 0, Gen 14)
00:38:44:WU01:FS00:0xa7:Unit: 0x0000001e80fccb0a5e459b8d0f57e0fa
00:38:44:WU01:FS00:0xa7:Reading tar file core.xml
00:38:44:WU01:FS00:0xa7:Reading tar file frame14.tpr
00:38:44:WU01:FS00:0xa7:Digital signatures verified
00:38:44:WU01:FS00:0xa7:Reducing thread count from 11 to 10 to avoid domain decomposition by a prime number > 3
00:38:44:WU01:FS00:0xa7:Calling: mdrun -s frame14.tpr -o frame14.trr -x frame14.xtc -cpt 15 -nt 10
00:38:44:WU01:FS00:0xa7:Steps: first=3500000 total=250000
00:38:44:WU01:FS00:0xa7:ERROR:
00:38:44:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
00:38:44:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
00:38:44:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
00:38:44:WU01:FS00:0xa7:ERROR:
00:38:44:WU01:FS00:0xa7:ERROR:Fatal error:
00:38:44:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 10 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm
00:38:44:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
00:38:44:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
00:38:44:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
00:38:44:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
00:38:44:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
00:38:49:WU01:FS00:0xa7:WARNING:Unexpected exit() call
00:38:49:WU01:FS00:0xa7:WARNING:Unexpected exit from science code
00:38:49:WU01:FS00:0xa7:Saving result file ../logfile_01.txt
00:38:49:WU01:FS00:0xa7:Saving result file md.log
00:38:49:WU01:FS00:0xa7:Saving result file science.log
00:38:49:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
EDIT: Marking as Fixed - see solution below:
_r2w_ben wrote: This particular work unit does not work on 11 threads, which is downgraded to 10. Please adjust the slot configuration to 9 CPUs and it should work. Once the work unit finishes, you can go back to 10/11.

Please do so by editing /etc/fahclient/config.xml

Replace this part

Code: Select all

<slot id='0' type='CPU' />
with this

Code: Select all

<slot id='0' type='CPU'>
    <cpus v='9'/>
</slot>

Re: Project: 14524 - RROR:There is no domain decomposition f

Posted: Sun Apr 26, 2020 1:22 am
by _r2w_ben
Welcome to the forum TheSnowedone!

This particular work unit does not work on 11 threads, which is downgraded to 10. Please adjust the slot configuration to 9 CPUs and it should work. Once the work unit finishes, you can go back to 10/11.

Please do so by editing /etc/fahclient/config.xml

Replace this part

Code: Select all

<slot id='0' type='CPU' />
with this

Code: Select all

<slot id='0' type='CPU'>
    <cpus v='9'/>
</slot>

Re: Project: 14524 - RROR:There is no domain decomposition f

Posted: Sun Apr 26, 2020 2:39 am
by TheSnowedone
_r2w_ben wrote:Welcome to the forum TheSnowedone!

This particular work unit does not work on 11 threads, which is downgraded to 10. Please adjust the slot configuration to 9 CPUs and it should work. Once the work unit finishes, you can go back to 10/11.

Please do so by editing /etc/fahclient/config.xml

Replace this part

Code: Select all

<slot id='0' type='CPU' />
with this

Code: Select all

<slot id='0' type='CPU'>
    <cpus v='9'/>
</slot>
Thanks for the welcome and the fix - worked like a charm. :)

Re: Fixed: Project: 14524 - ERROR:There is no domain for 10

Posted: Thu Jun 11, 2020 6:57 am
by peepsalot
Could FAHClient please just try another number when it gets these domain decomposition errors instead of hammering the same work unit over and over?

I have a headless 6C/12T computer that occasionally remote-login to. I set it up with mostly blank/auto configuration with power="full", 1 CPU slot and 1 GPU slot. I guess the first "cpu" (thread actually) is used by the GPU, leaving only 11 for the other WUs.
When I looked at it the other day it was getting these errors and retrying multiple times per minute. No idea how many days it had been trying the same work unit over and over with the same settings.

Then it took me a couple more days before I actually found this forum post with the solution. It's really frustrating that the config.xml doesn't seem to be documented at all, only info for GUI users. And command line users are considered to be "experts". Hell, maybe I am a computer expert, but I'm not a Folding At Home expert!

Also seeing this message "Reducing thread count from 11 to 10 to avoid domain decomposition by a prime number > 3"
hurts my brain.
If its actually trying for a number not divisible by any prime larger than 3, then it has failed miserably.
Here's a list of all such numbers up to 256:
1,2,3,4,6,8,9,12,16,18,24,27,32,36,48,54,64,72,81,96,108,128,144,162,192,216,243,256
Could it maybe just default the next number down the list when it gets an error?
_r2w_ben wrote:Once the work unit finishes, you can go back to 10/11.
I mean, crisis averted for the moment, and thank you for showing the config settings to fix it. But if I wanted to use as much compute power as possible, am I really expected to babysit the program, keep a list of which work unit types are ok with X number of threads, and change my config file manually each time a bad one pops up?

Re: Fixed: Project: 14524 - ERROR:There is no domain for 10

Posted: Thu Jun 11, 2020 9:04 am
by PantherX
Welcome to the F@H Forum peepsalot,
peepsalot wrote:Could FAHClient please just try another number when it gets these domain decomposition errors instead of hammering the same work unit over and over?...
That was the idea originally about 10 years ago and it worked when quad cores with no HT/SMT were mainstream. There are plans to work on this so let's wait and see what happens.
peepsalot wrote:...Also seeing this message "Reducing thread count from 11 to 10 to avoid domain decomposition by a prime number > 3"
hurts my brain.
If its actually trying for a number not divisible by any prime larger than 3, then it has failed miserably.
Here's a list of all such numbers up to 256:
1,2,3,4,6,8,9,12,16,18,24,27,32,36,48,54,64,72,81,96,108,128,144,162,192,216,243,256
Could it maybe just default the next number down the list when it gets an error?...
That message doesn't convey the full picture and is a bit dated. The issue is that some CPU values result in Domain Decomposition errors. In simple terms it means that the assigned WU can't successfully be divided across the given CPU value. There are some common "bad numbers" which the FahCore_a7 has but it is only part of the picture. Let's hope that the next version (if it is in development) can address this issue more gracefully which doesn't require manual intervention.
peepsalot wrote:...But if I wanted to use as much compute power as possible, am I really expected to babysit the program, keep a list of which work unit types are ok with X number of threads, and change my config file manually each time a bad one pops up?
The value of 8 is bullet-proof since 8 CPUs is a standard number of threads in systems which researchers use. The reason is GROMACS tend to use 100% dedicated systems so values like 2, 4, 8, 12, 16, 24, 32 are known to always work well since that's the total number of CPUs on a dedicated system. It gets tricky when you deviate from them. Here's a post from _r2w_ben which has a list of currently good CPU values for specific projects: viewtopic.php?f=72&t=34350&start=45