Page 1 of 3

Fatal Error with WU

Posted: Sat Mar 14, 2020 10:25 am
by Oipo
One of my computers stopped folding.

Code: Select all

10:22:00:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
10:22:00:WU01:FS00:0xa7:    Version: 0.0.18
10:22:00:WU01:FS00:0xa7:     Author: Joseph Coffland <[email protected]>
10:22:00:WU01:FS00:0xa7:  Copyright: 2019 foldingathome.org
10:22:00:WU01:FS00:0xa7:   Homepage: https://foldingathome.org/
10:22:00:WU01:FS00:0xa7:       Date: Nov 5 2019
10:22:00:WU01:FS00:0xa7:       Time: 06:13:26
10:22:00:WU01:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
10:22:00:WU01:FS00:0xa7:     Branch: master
10:22:00:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
10:22:00:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
10:22:00:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
10:22:00:WU01:FS00:0xa7:       Bits: 64
10:22:00:WU01:FS00:0xa7:       Mode: Release
10:22:00:WU01:FS00:0xa7:************************************ Build *************************************
10:22:00:WU01:FS00:0xa7:       SIMD: avx_256
10:22:00:WU01:FS00:0xa7:********************************************************************************
10:22:00:WU01:FS00:0xa7:Project: 14245 (Run 0, Clone 41, Gen 221)
10:22:00:WU01:FS00:0xa7:Unit: 0x0000012d80fccb0a5d6fe0b76a9a2ae3
10:22:00:WU01:FS00:0xa7:Reading tar file core.xml
10:22:00:WU01:FS00:0xa7:Reading tar file frame221.tpr
10:22:00:WU01:FS00:0xa7:Digital signatures verified
10:22:00:WU01:FS00:0xa7:Reducing thread count from 23 to 22 to avoid domain decomposition by a prime number > 3
10:22:00:WU01:FS00:0xa7:Reducing thread count from 22 to 21 to avoid domain decomposition with large prime factor 11
10:22:00:WU01:FS00:0xa7:Calling: mdrun -s frame221.tpr -o frame221.trr -x frame221.xtc -cpt 15 -nt 21
10:22:00:WU01:FS00:0xa7:Steps: first=55250000 total=250000
10:22:00:WU01:FS00:0xa7:ERROR:
10:22:00:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
10:22:00:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
10:22:00:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
10:22:00:WU01:FS00:0xa7:ERROR:
10:22:00:WU01:FS00:0xa7:ERROR:Fatal error:
10:22:00:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 16 ranks that is compatible with the given box and a minimum cell size of 1.45733 nm
10:22:00:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
10:22:00:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
10:22:00:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
10:22:00:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
10:22:00:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
10:22:05:WU01:FS00:0xa7:WARNING:Unexpected exit() call
10:22:05:WU01:FS00:0xa7:WARNING:Unexpected exit from science code
10:22:05:WU01:FS00:0xa7:Saving result file ../logfile_01.txt
10:22:05:WU01:FS00:0xa7:Saving result file md.log
10:22:05:WU01:FS00:0xa7:Saving result file science.log
10:22:05:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)

Re: Fatal Error with WU

Posted: Sat Mar 14, 2020 4:07 pm
by foldy
10:22:00:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 16 ranks that is compatible with the given box
I had this error too. I reduced CPU slot threads to 15 to make it work again.

Re: Fatal Error with WU

Posted: Sat Mar 14, 2020 11:50 pm
by toTOW
Thanks for the report, I alerted the researcher in change of this project.

Have you been able to move to a new WU, or do you need assistance ?

Re: Fatal Error with WU

Posted: Sun Mar 15, 2020 12:12 pm
by naw
New user here:

I'm having a similar error with 14523

Code: Select all

12:08:14:WU00:FS00:Running FahCore: /opt/fah/FAHCoreWrapper /opt/fah/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 705 -lifeline 725 -checkpoint 15 -np 10
12:08:14:WU00:FS00:Started FahCore on PID 9801
12:08:14:WU00:FS00:Core PID:9805
12:08:14:WU00:FS00:FahCore 0xa7 started
12:08:15:WU00:FS00:0xa7:*********************** Log Started 2020-03-15T12:08:14Z ***********************
12:08:15:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
12:08:15:WU00:FS00:0xa7:       Type: 0xa7
12:08:15:WU00:FS00:0xa7:       Core: Gromacs
12:08:15:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 705 -lifeline 9801 -checkpoint 15 -np
12:08:15:WU00:FS00:0xa7:             10
12:08:15:WU00:FS00:0xa7:************************************ CBang *************************************
12:08:15:WU00:FS00:0xa7:       Date: Nov 5 2019
12:08:15:WU00:FS00:0xa7:       Time: 06:06:57
12:08:15:WU00:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
12:08:15:WU00:FS00:0xa7:     Branch: master
12:08:15:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
12:08:15:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
12:08:15:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
12:08:15:WU00:FS00:0xa7:       Bits: 64
12:08:15:WU00:FS00:0xa7:       Mode: Release
12:08:15:WU00:FS00:0xa7:************************************ System ************************************
12:08:15:WU00:FS00:0xa7:        CPU: AMD Ryzen 5 3600 6-Core Processor
12:08:15:WU00:FS00:0xa7:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
12:08:15:WU00:FS00:0xa7:       CPUs: 12
12:08:15:WU00:FS00:0xa7:     Memory: 31.38GiB
12:08:15:WU00:FS00:0xa7:Free Memory: 233.72MiB
12:08:15:WU00:FS00:0xa7:    Threads: POSIX_THREADS
12:08:15:WU00:FS00:0xa7: OS Version: 5.5
12:08:15:WU00:FS00:0xa7:Has Battery: false
12:08:15:WU00:FS00:0xa7: On Battery: false
12:08:15:WU00:FS00:0xa7: UTC Offset: 1
12:08:15:WU00:FS00:0xa7:        PID: 9805
12:08:15:WU00:FS00:0xa7:        CWD: /opt/fah/work
12:08:15:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
12:08:15:WU00:FS00:0xa7:    Version: 0.0.18
12:08:15:WU00:FS00:0xa7:     Author: Joseph Coffland <[email protected]>
12:08:15:WU00:FS00:0xa7:  Copyright: 2019 foldingathome.org
12:08:15:WU00:FS00:0xa7:   Homepage: https://foldingathome.org/
12:08:15:WU00:FS00:0xa7:       Date: Nov 5 2019
12:08:15:WU00:FS00:0xa7:       Time: 06:13:26
12:08:15:WU00:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
12:08:15:WU00:FS00:0xa7:     Branch: master
12:08:15:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
12:08:15:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
12:08:15:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
12:08:15:WU00:FS00:0xa7:       Bits: 64
12:08:15:WU00:FS00:0xa7:       Mode: Release
12:08:15:WU00:FS00:0xa7:************************************ Build *************************************
12:08:15:WU00:FS00:0xa7:       SIMD: avx_256
12:08:15:WU00:FS00:0xa7:********************************************************************************
12:08:15:WU00:FS00:0xa7:Project: 14523 (Run 995, Clone 1, Gen 2)
12:08:15:WU00:FS00:0xa7:Unit: 0x0000000580fccb0a5e459bbf07fe89e0
12:08:15:WU00:FS00:0xa7:Reading tar file core.xml
12:08:15:WU00:FS00:0xa7:Reading tar file frame2.tpr
12:08:15:WU00:FS00:0xa7:Digital signatures verified
12:08:15:WU00:FS00:0xa7:Calling: mdrun -s frame2.tpr -o frame2.trr -x frame2.xtc -cpt 15 -nt 10
12:08:15:WU00:FS00:0xa7:Steps: first=500000 total=250000
12:08:15:WU00:FS00:0xa7:ERROR:
12:08:15:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
12:08:15:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
12:08:15:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
12:08:15:WU00:FS00:0xa7:ERROR:
12:08:15:WU00:FS00:0xa7:ERROR:Fatal error:
12:08:15:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 10 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm
12:08:15:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
12:08:15:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
12:08:15:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
12:08:15:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
12:08:15:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
12:08:20:WU00:FS00:0xa7:WARNING:Unexpected exit() call
12:08:20:WU00:FS00:0xa7:WARNING:Unexpected exit from science code
12:08:20:WU00:FS00:0xa7:Saving result file ../logfile_01.txt
12:08:20:WU00:FS00:0xa7:Saving result file md.log
12:08:20:WU00:FS00:0xa7:Saving result file science.log
12:08:20:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
12:09:00:WU01:FS01:0x22:Completed 130000 out of 1000000 steps (13%)
12:09:14:WU00:FS00:Starting
12:09:14:WARNING:WU00:FS00:AS lowered CPUs from 11 to 10
Interesting the last line: 12:09:14:WARNING:WU00:FS00:AS lowered CPUs from 11 to 10

My processor is Ryzen 5 3600 6 cores, 12 threads.
Any tip on how to proceed? Should I delete the WU?

Re: Fatal Error with WU

Posted: Sun Mar 15, 2020 4:20 pm
by Joe_H
You can try reducing the thread count further manually with FAHControl, or moving the slider to Light should cause it to run on 6 threads if I recall correctly.

Re: Fatal Error with WU

Posted: Sun Mar 15, 2020 6:15 pm
by Oipo
Hey Joe_h/toTOW,

I am still stuck on this particular WU. Even changing the slider to Medium or Light does not help. Fahcontrol is missing on my linux machine, is there a way I can edit a file to get rid of it instead?

Re: Fatal Error with WU

Posted: Sun Mar 15, 2020 6:51 pm
by naw
Hello

After messing a bit with the core numbers I got assigned a new WU (13826). I had to wait several hours for it, though.
Oipo wrote:Hey Joe_h/toTOW,

I am still stuck on this particular WU. Even changing the slider to Medium or Light does not help. Fahcontrol is missing on my linux machine, is there a way I can edit a file to get rid of it instead?
Edit your config.xml and try:

Code: Select all

  <slot id='0' type='CPU'>
    <cpus v='8'/>
  </slot>
My current config.xml is:

Code: Select all

<config>
  <!-- Network -->
  <proxy v=':8080'/>

  <!-- Slot Control -->
  <power v='FULL'/>

  <!-- User Information -->
  <passkey v='xxxxxxxxxxxxxxxxxxxxx'/>
  <team v='xxxxxxx'/>
  <user v='xxxxxxxxx'/>

  <!-- Folding Slots -->
  <slot id='0' type='CPU'>
    <cpus v='8'/>
  </slot>
  <slot id='1' type='GPU'/>
</config>

Re: Fatal Error with WU

Posted: Mon Mar 16, 2020 7:50 am
by Oipo
That did it, thanks. It's currently folding the WU again.

Re: Fatal Error with WU

Posted: Mon Mar 16, 2020 8:15 am
by muziqaz
Simple way of sorting this issue, is Pause the slot. go into configure, then Slot tab, then edit your CPU slot and reduce the thread count to let's say 9 or 8. Then restart the slot :)
If you go to configure and adjust thread count while slot is still running, slot will still treat the cpu as 10 core.

Re: Fatal Error with WU

Posted: Thu Mar 19, 2020 9:37 am
by dfreeman
I've had this happen twice now, the first time I accidentally lost the WU while trying to fix the problem, so that's unfortunately going to time out.

From the logs, I can see that it's setting -np 63, whereas I have 64 cores and I want to use all of them. So I've added <cpus v='64'/> to my config file.

Unfortunately yet again, I found it stuck in a loop of failing on a WU, same as other posters have reported. And the log contains the lines:

Code: Select all

09:19:29:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
09:19:29:WU01:FS00:Starting
09:19:29:WARNING:WU01:FS00:AS lowered CPUs from 64 to 63
09:19:29:WU01:FS00:Removing old file './work/01/logfile_01-20200319-084819.txt'
09:19:29:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper //cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version 705 -lifeline 1 -checkpoint 15 -np 63
So why is it lowering the CPU count?

I've temporarily solved the problem by setting CPUs to 60, which has no large prime factors.

Re: Fatal Error with WU

Posted: Thu Mar 19, 2020 4:41 pm
by Joe_H
@dfreeman We would need to see the folding configuration to get an idea why the server assigned at a count of 63. I will pass on the information as it shouldn't have used that number, it is a multiple of 7. I just need to know which project the WU is from.

Once a WU is downloaded at a specific thread count, it will not run at a higher count, just lower. Usable counts are multiples of 2, 3 and sometimes 5 if the protein system is large enough.

Re: Fatal Error with WU

Posted: Thu Mar 19, 2020 7:29 pm
by dfreeman
The use of 63 cores was not specific to the WU or the project, it was doing that the whole time. (I've only been running F@H for a couple of days.)

I was running it in a Docker container, derived from https://github.com/johnktims/folding-at-home .

I've since changed to two containers, with each running a single CPU slot of 32 cores. Each container is pinned to a different set of cores.

If you mean the config.xml, there was nothing really in it, just a CPU slot definition created at first run.

Re: Fatal Error with WU

Posted: Thu Mar 19, 2020 7:49 pm
by Joe_H
Another thought, since you mentioned the slot configuration put in at first run, do you know if the docker used the FAHControl Configure function or equivalent to set the slots at a specific CPU thread count? The default settings at install time normally for the client are aimed at home folders, so it starts the client at Medium and sets that number to -1 to let the software determine how many cores were available. The install reserves a CPU thread for any GPU detected as well, but that does not seem to apply here.

What that does is with the default Medium setting another CPU thread is reserved for the system's user. Full doesn't. If things were at default settings, that could explain the request for 63 threads.

Re: Fatal Error with WU

Posted: Fri Mar 20, 2020 2:13 am
by dfreeman
The initial core count was not set, the config file didn't contain an entry for that. The Dockerfile listed in that git repository simply starts with a blank Debian system, installs F@H with its dependencies, snapshots it for future use, and then runs the client with chosen command-line options. One of those options is gpu=false. There is no explicit configuration step.

Note that the Dockerfile has been updated 15 hours ago to include switching to a non-root user, but I don't have that version.

I'm just wondering if there is a hard-coded limit that the core count can't be >= 64? Are other users running with more? Also, is there an advantage to large core counts? I read that bigadv requires >= 24, but then I also read that bigadv was discontinued. This is confusing as bigadv is still mentioned on the F@H web site.

It would be most efficient to either run one CPU slot per socket, or per node, so as to remove any NUMA bottlenecks. Per socket, there would be four slots of 16 cores each. This would likely have a bit higher raw performance, but would it rule out any WU? I'd rather be inclusive of WUs that depend on high core count.

Re: Fatal Error with WU

Posted: Fri Mar 20, 2020 2:42 am
by Joe_H
No hard coded limit the I know of, the client and core has been set up with many more cores configured than 64. The A7 core will fold on a large number of CPU threads if the protein being simulated is large enough. In the past there have been some tests on WU's that could work with over 100 core threads, but recently the large systems have been going to GPU processing.

A number of the COVID-19 CPU projects are large enough to use more cores, lack of systems with many has limited the ability to test for how many though. Prior large core number setting tests did show that for some WU's there were points at which giving more would work, but not give much improvement in processing time. You are welcome to try different thread count setups and see what works well for your system.

But 16 cores for each of 4 slots will process well too. You mentioned NUMA, and I recall there having been a setting for that in the client or for one of the folding cores, but do not recall details and have no idea if it is still there.