Page 1 of 2

WU crashing infinitely

Posted: Fri Nov 08, 2019 3:38 am
by Nuitari

Code: Select all

03:35:38:WU01:FS00:FahCore 0xa7 started
03:35:38:WU01:FS00:0xa7:*********************** Log Started 2019-11-08T03:35:38Z ***********************
03:35:38:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
03:35:38:WU01:FS00:0xa7:       Type: 0xa7
03:35:38:WU01:FS00:0xa7:       Core: Gromacs
03:35:38:WU01:FS00:0xa7:       Args: -dir 01 -suffix 01 -version 705 -lifeline 8918 -checkpoint 15 -np
03:35:38:WU01:FS00:0xa7:             15
03:35:38:WU01:FS00:0xa7:************************************ CBang *************************************
03:35:38:WU01:FS00:0xa7:       Date: Nov 5 2019
03:35:38:WU01:FS00:0xa7:       Time: 06:06:57
03:35:38:WU01:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
03:35:38:WU01:FS00:0xa7:     Branch: master
03:35:38:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
03:35:38:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
03:35:38:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
03:35:38:WU01:FS00:0xa7:       Bits: 64
03:35:38:WU01:FS00:0xa7:       Mode: Release
03:35:38:WU01:FS00:0xa7:************************************ System ************************************
03:35:38:WU01:FS00:0xa7:        CPU: AMD Ryzen 7 3700X 8-Core Processor
03:35:38:WU01:FS00:0xa7:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
03:35:38:WU01:FS00:0xa7:       CPUs: 16
03:35:38:WU01:FS00:0xa7:     Memory: 31.35GiB
03:35:38:WU01:FS00:0xa7:Free Memory: 8.31GiB
03:35:38:WU01:FS00:0xa7:    Threads: POSIX_THREADS
03:35:38:WU01:FS00:0xa7: OS Version: 5.3
03:35:38:WU01:FS00:0xa7:Has Battery: false
03:35:38:WU01:FS00:0xa7: On Battery: false
03:35:38:WU01:FS00:0xa7: UTC Offset: -5
03:35:38:WU01:FS00:0xa7:        PID: 8922
03:35:38:WU01:FS00:0xa7:        CWD: /opt/foldingathome/work
03:35:38:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
03:35:38:WU01:FS00:0xa7:    Version: 0.0.18
03:35:38:WU01:FS00:0xa7:     Author: Joseph Coffland <[email protected]>
03:35:38:WU01:FS00:0xa7:  Copyright: 2019 foldingathome.org
03:35:38:WU01:FS00:0xa7:   Homepage: https://foldingathome.org/
03:35:38:WU01:FS00:0xa7:       Date: Nov 5 2019
03:35:38:WU01:FS00:0xa7:       Time: 06:13:26
03:35:38:WU01:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
03:35:38:WU01:FS00:0xa7:     Branch: master
03:35:38:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
03:35:38:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
03:35:38:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
03:35:38:WU01:FS00:0xa7:       Bits: 64
03:35:38:WU01:FS00:0xa7:       Mode: Release
03:35:38:WU01:FS00:0xa7:************************************ Build *************************************
03:35:38:WU01:FS00:0xa7:       SIMD: avx_256
03:35:38:WU01:FS00:0xa7:********************************************************************************
03:35:38:WU01:FS00:0xa7:Project: 14246 (Run 0, Clone 69, Gen 68)
03:35:38:WU01:FS00:0xa7:Unit: 0x0000006380fccb0a5d6fe21f1bfc07a1
03:35:38:WU01:FS00:0xa7:Reading tar file core.xml
03:35:38:WU01:FS00:0xa7:Reading tar file frame68.tpr
03:35:38:WU01:FS00:0xa7:Digital signatures verified
03:35:38:WU01:FS00:0xa7:Calling: mdrun -s frame68.tpr -o frame68.trr -x frame68.xtc -cpt 15 -nt 15
03:35:38:WU01:FS00:0xa7:Steps: first=17000000 total=250000
03:35:38:WU01:FS00:0xa7:ERROR:
03:35:38:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
03:35:38:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
03:35:38:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
03:35:38:WU01:FS00:0xa7:ERROR:
03:35:38:WU01:FS00:0xa7:ERROR:Fatal error:
03:35:38:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 15 ranks that is compatible with the given box and a minimum cell size of 1.45733 nm
03:35:38:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
03:35:38:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
03:35:38:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
03:35:38:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
03:35:38:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
03:35:43:WU01:FS00:0xa7:WARNING:Unexpected exit() call
03:35:43:WU01:FS00:0xa7:WARNING:Unexpected exit from science code
03:35:43:WU01:FS00:0xa7:Saving result file ../logfile_01.txt
03:35:43:WU01:FS00:0xa7:Saving result file md.log
03:35:43:WU01:FS00:0xa7:Saving result file science.log
03:35:43:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
The same thing happened 663 times, wasting 12h of time that could have been used for actual folding.

No overclocking. I've manually removed the WU from the folder, but kept the files in case they are needed.

Re: WU crashing infinitely

Posted: Fri Nov 08, 2019 3:57 am
by JimboPalmer
Your cpu has 16 threads and is using 15 of them. (You do not include the configuration portion of the log, so I can't be sure, but the latest version, 7.5.1 should avoid this) Your Work Unit won't divide 15 ways. You can configure your cpu slot to use 16 or 12 CPUs.

Re: WU crashing infinitely

Posted: Fri Nov 08, 2019 5:22 am
by Nuitari
AMD Ryzen 7 3700X, so 16 core with hyper threading.
The new version hasn't made it down gentoo yet, so the update will have to wait until that happens.

It might be nice to consider having the WU handle it and go down automatically to the nearest workable number of threads...

Re: WU crashing infinitely

Posted: Fri Nov 08, 2019 6:09 am
by JimboPalmer
The client is written by a Computer Programmer, the WUs are written by Biochemists.
I am sorry you have not decided to install the latest client. https://packages.gentoo.org/packages/sc ... dingathome
I would set your CPUs to 12 or 16.

Re: WU crashing infinitely

Posted: Fri Nov 08, 2019 6:33 am
by MeeLee
I would try comparing PPD and power consumption, between running with HT disabled (7 threads), and HT enabled (13-14 threads).
You'll need at least 2 cores (if not 3) reserved for non-fah things.
You'll notice, if you run 14, 15 or 16 threads, that your CPU will run at 100% load anyway (probably 13 to 14 cores will get you 95-98% load).
Chances are you'll see only very little improvement between running on 7 cores, vs running on 13-14 threads, as the CPU will run cooler, but if your CPU supports it, also at higher boost frequencies.

Re: WU crashing infinitely

Posted: Fri Nov 08, 2019 10:44 am
by bollix47
@Nuitari

First let me clear up what may be confusing from the above answers:

The current version of the folding client is 7.5.1
The cpu core will not run when using large prime numbers. Some projects will have trouble with a prime number as low as 5.

So 7, 13 or 14 will not work. You can, as Jimbo suggested, use 12 or 16.

Linux has a command called ldd which will list the dependencies needed to run the core. Open files and navigate to the FAHCore_a7 location where you should be able to right-click and select open a terminal and type or copy/paste the following:

Code: Select all

ldd FAHCore_a7
That will show you what, if anything, is missing and if you post the results here we can help solve any problems.

See the following post for usage of ldd & strings (your location will be different because you've used a non-default install):
viewtopic.php?p=309712#p309712

Your location may look something like the following but those are only guesses because your log doesn't show enough info:
/opt/foldingathome/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7
OR
/opt/foldingathome/work/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7

Re: WU crashing infinitely

Posted: Fri Nov 08, 2019 3:08 pm
by bruce
In this case, the WU is being partioned into 3x5x1 segments (-nt 15) so in this case, the factor 5 is essentially a "large prime" as far as GROMACS is concerned. Having the software use 12 of your 15 cores might be considered but there are a lot of WUs that successfully use the factor 5. Dumping the WU after this type of failure can also be considered, as opposed to retrying it.

Did the software retry Project: 14246 (Run 0, Clone 69, Gen 68) 663 times or did the same failure process that many different WUs? If it was the latter, what other projects were involved?

I recommend that you avoid the problem entirely by manually creating a CPU slot with 12 threads and another with 3 threads (or, if you don't have a GPU, a single slot with 16 CPUs).

I'm not enough of a GROMACS user to know what -rcon or -dds or the LINCS settings can do, but I'll alert the project owner and let him/her research that possibility. (and whether changing those settings can be applied to projects that can be assigned to an unknown number of threads.

Re: WU crashing infinitely

Posted: Sat Nov 09, 2019 2:45 am
by Nuitari
I do run folding at home 7.5.1 (your post original version had 7.5.5.1)

It was the same unit that failed 663 times. The client should probably have dumped it on its own.

I've checked through the historical logs files and none of the project's WU ever successfully completed, but the client automatically sent it back to FAULTY

This is an example with a different WU:

Code: Select all

13:00:09:WU00:FS00:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:14246 run:0 clone:88 gen:23 core:0xa7 unit:0x0000002380fccb0a5d6fe21e1415b4d9
13:00:48:WU00:FS00:Starting
13:00:48:WU00:FS00:Running FahCore: /opt/foldingathome/FAHCoreWrapper /opt/foldingathome/cores/cores.foldingathome.org/Linux/AMD64/AVX/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 705
 -lifeline 7281 -checkpoint 15 -np 15
13:00:48:WU00:FS00:Started FahCore on PID 9350
13:00:48:WU00:FS00:Core PID:9354
13:00:48:WU00:FS00:FahCore 0xa7 started
13:00:48:WU00:FS00:0xa7:*********************** Log Started 2019-10-04T13:00:48Z ***********************
13:00:48:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
13:00:48:WU00:FS00:0xa7:       Type: 0xa7
13:00:48:WU00:FS00:0xa7:       Core: Gromacs
13:00:48:WU00:FS00:0xa7:    Website: https://foldingathome.org/
13:00:48:WU00:FS00:0xa7:  Copyright: (c) 2009-2018 foldingathome.org
13:00:48:WU00:FS00:0xa7:     Author: Joseph Coffland <[email protected]>
13:00:48:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 705 -lifeline 9350 -checkpoint 15 -np
13:00:48:WU00:FS00:0xa7:             15
13:00:48:WU00:FS00:0xa7:     Config: <none>
13:00:48:WU00:FS00:0xa7:************************************ Build *************************************
13:00:48:WU00:FS00:0xa7:    Version: 0.0.17
13:00:48:WU00:FS00:0xa7:       Date: Apr 27 2018
13:00:48:WU00:FS00:0xa7:       Time: 19:09:21
13:00:48:WU00:FS00:0xa7: Repository: Git
13:00:48:WU00:FS00:0xa7:   Revision: 21359963583d09ec2063ef946399441c4df4ccd7
13:00:48:WU00:FS00:0xa7:     Branch: master
13:00:48:WU00:FS00:0xa7:   Compiler: GNU 6.3.0 20170516
13:00:48:WU00:FS00:0xa7:    Options: -std=gnu++98 -O3 -funroll-loops
13:00:48:WU00:FS00:0xa7:   Platform: linux2 4.14.0-3-amd64
13:00:48:WU00:FS00:0xa7:       Bits: 64
13:00:48:WU00:FS00:0xa7:       Mode: Release
13:00:48:WU00:FS00:0xa7:       SIMD: avx_256
13:00:48:WU00:FS00:0xa7:************************************ System ************************************
13:00:48:WU00:FS00:0xa7:        CPU: AMD Ryzen 7 3700X 8-Core Processor
13:00:48:WU00:FS00:0xa7:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
13:00:48:WU00:FS00:0xa7:       CPUs: 16
13:00:48:WU00:FS00:0xa7:     Memory: 31.35GiB
13:00:48:WU00:FS00:0xa7:Free Memory: 12.45GiB
13:00:48:WU00:FS00:0xa7:    Threads: POSIX_THREADS
13:00:48:WU00:FS00:0xa7: OS Version: 5.3
13:00:48:WU00:FS00:0xa7:Has Battery: false
13:00:48:WU00:FS00:0xa7: On Battery: false
13:00:48:WU00:FS00:0xa7: UTC Offset: -4
13:00:48:WU00:FS00:0xa7:        PID: 9354
13:00:48:WU00:FS00:0xa7:        CWD: /opt/foldingathome/work
13:00:48:WU00:FS00:0xa7:         OS: Linux 5.3.0-gentoo x86_64
13:00:48:WU00:FS00:0xa7:    OS Arch: AMD64
13:00:48:WU00:FS00:0xa7:********************************************************************************
13:00:48:WU00:FS00:0xa7:Project: 14246 (Run 0, Clone 88, Gen 23)
13:00:48:WU00:FS00:0xa7:Unit: 0x0000002380fccb0a5d6fe21e1415b4d9
13:00:48:WU00:FS00:0xa7:Reading tar file core.xml
13:00:48:WU00:FS00:0xa7:Reading tar file frame23.tpr
13:00:48:WU00:FS00:0xa7:Digital signatures verified
13:00:48:WU00:FS00:0xa7:Calling: mdrun -s frame23.tpr -o frame23.trr -x frame23.xtc -cpt 15 -nt 15
13:00:48:WU00:FS00:0xa7:Steps: first=5750000 total=250000
13:00:48:WU00:FS00:0xa7:ERROR:
13:00:48:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
13:00:48:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20161122-4846b12ba-unknown
13:00:48:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
13:00:48:WU00:FS00:0xa7:ERROR:
13:00:48:WU00:FS00:0xa7:ERROR:Fatal error:
13:00:48:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 15 ranks that is compatible with the given box and a minimum cell size of 1.45733 nm
13:00:48:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
13:00:48:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
13:00:48:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
13:00:48:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
13:00:48:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
13:00:53:WU00:FS00:0xa7:WARNING:Unexpected exit() call
13:00:53:WU00:FS00:0xa7:WARNING:Unexpected exit from science code
13:00:53:WU00:FS00:0xa7:Saving result file ../logfile_01.txt
13:00:53:WU00:FS00:0xa7:Saving result file md.log
13:00:53:WU00:FS00:0xa7:Saving result file science.log
13:00:53:WU00:FS00:0xa7:Folding@home Core Shutdown: BAD_WORK_UNIT
13:00:54:WARNING:WU00:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
13:00:54:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:14246 run:0 clone:88 gen:23 core:0xa7 unit:0x0000002380fccb0a5d6fe21e1415b4d9
13:00:54:WU00:FS00:Uploading 19.50KiB to 128.252.203.10
13:00:54:WU00:FS00:Connecting to 128.252.203.10:8080
13:00:54:WU00:FS00:Upload complete
13:00:54:WU00:FS00:Server responded WORK_ACK (400)
13:00:54:WU00:FS00:Cleaning up
The WU in cause here had this log instead:

Code: Select all

16:35:18:WU01:FS00:Starting
16:35:18:WU01:FS00:Running FahCore: /opt/foldingathome/FAHCoreWrapper /opt/foldingathome/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version 70
5 -lifeline 7442 -checkpoint 15 -np 15
16:35:18:WU01:FS00:Started FahCore on PID 16797
16:35:18:WU01:FS00:Core PID:16802
16:35:18:WU01:FS00:FahCore 0xa7 started
16:35:18:WU01:FS00:0xa7:*********************** Log Started 2019-11-07T16:35:18Z ***********************
16:35:18:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
16:35:18:WU01:FS00:0xa7:       Type: 0xa7
16:35:18:WU01:FS00:0xa7:       Core: Gromacs
16:35:18:WU01:FS00:0xa7:       Args: -dir 01 -suffix 01 -version 705 -lifeline 16797 -checkpoint 15 -np
16:35:18:WU01:FS00:0xa7:             15
16:35:18:WU01:FS00:0xa7:************************************ CBang *************************************
16:35:18:WU01:FS00:0xa7:       Date: Nov 5 2019
16:35:18:WU01:FS00:0xa7:       Time: 06:06:57
16:35:18:WU01:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
16:35:18:WU01:FS00:0xa7:     Branch: master
16:35:18:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
16:35:18:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
16:35:18:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
16:35:18:WU01:FS00:0xa7:       Bits: 64
16:35:18:WU01:FS00:0xa7:       Mode: Release
16:35:18:WU01:FS00:0xa7:************************************ System ************************************
16:35:18:WU01:FS00:0xa7:        CPU: AMD Ryzen 7 3700X 8-Core Processor
16:35:18:WU01:FS00:0xa7:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
16:35:18:WU01:FS00:0xa7:       CPUs: 16
16:35:18:WU01:FS00:0xa7:     Memory: 31.35GiB
16:35:18:WU01:FS00:0xa7:Free Memory: 8.39GiB
16:35:18:WU01:FS00:0xa7:    Threads: POSIX_THREADS
16:35:18:WU01:FS00:0xa7: OS Version: 5.3
16:35:18:WU01:FS00:0xa7:Has Battery: false
16:35:18:WU01:FS00:0xa7: On Battery: false
16:35:18:WU01:FS00:0xa7: UTC Offset: -5
16:35:18:WU01:FS00:0xa7:        PID: 16802
16:35:18:WU01:FS00:0xa7:        CWD: /opt/foldingathome/work
16:35:18:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
16:35:18:WU01:FS00:0xa7:    Version: 0.0.18
16:35:18:WU01:FS00:0xa7:     Author: Joseph Coffland <[email protected]>
16:35:18:WU01:FS00:0xa7:  Copyright: 2019 foldingathome.org
16:35:18:WU01:FS00:0xa7:   Homepage: https://foldingathome.org/
16:35:18:WU01:FS00:0xa7:       Date: Nov 5 2019
16:35:18:WU01:FS00:0xa7:       Time: 06:13:26
16:35:18:WU01:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
16:35:18:WU01:FS00:0xa7:     Branch: master
16:35:18:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
16:35:18:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
16:35:18:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
16:35:18:WU01:FS00:0xa7:       Bits: 64
16:35:18:WU01:FS00:0xa7:       Mode: Release
16:35:18:WU01:FS00:0xa7:************************************ Build *************************************
16:35:18:WU01:FS00:0xa7:       SIMD: avx_256
16:35:18:WU01:FS00:0xa7:********************************************************************************
16:35:18:WU01:FS00:0xa7:Project: 14246 (Run 0, Clone 69, Gen 68)
16:35:18:WU01:FS00:0xa7:Unit: 0x0000006380fccb0a5d6fe21f1bfc07a1
16:35:18:WU01:FS00:0xa7:Reading tar file core.xml
16:35:18:WU01:FS00:0xa7:Reading tar file frame68.tpr
16:35:18:WU01:FS00:0xa7:Digital signatures verified
16:35:18:WU01:FS00:0xa7:Calling: mdrun -s frame68.tpr -o frame68.trr -x frame68.xtc -cpt 15 -nt 15
16:35:18:WU01:FS00:0xa7:Steps: first=17000000 total=250000
16:35:18:WU01:FS00:0xa7:ERROR:
16:35:18:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
16:35:18:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
16:35:18:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
16:35:18:WU01:FS00:0xa7:ERROR:
16:35:18:WU01:FS00:0xa7:ERROR:Fatal error:
16:35:18:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 15 ranks that is compatible with the given box and a minimum cell size of 1.45733 nm
16:35:18:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
16:35:18:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
16:35:18:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
16:35:18:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
16:35:18:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
16:35:23:WU01:FS00:0xa7:WARNING:Unexpected exit() call
16:35:23:WU01:FS00:0xa7:WARNING:Unexpected exit from science code
16:35:23:WU01:FS00:0xa7:Saving result file ../logfile_01.txt
16:35:23:WU01:FS00:0xa7:Saving result file md.log
16:35:23:WU01:FS00:0xa7:Saving result file science.log
16:35:23:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
16:35:24:WU01:FS00:Starting
16:35:24:WU01:FS00:Running FahCore: /opt/foldingathome/FAHCoreWrapper /opt/foldingathome/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version 705 -lifeline 7442 -checkpoint 15 -np 15

None of the FahCode_a7 files are missing dependencies.

Code: Select all

gandalf /opt/foldingathome # ldd ./cores/cores.foldingathome.org/Linux/AMD64/AVX/Core_a7.fah/FahCore_a7
        linux-vdso.so.1 (0x00007ffffc8f8000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fd7d7c49000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fd7d7c43000)
        libstdc++.so.6 => /usr/lib/gcc/x86_64-pc-linux-gnu/9.1.0/libstdc++.so.6 (0x00007fd7d79c6000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fd7d787a000)
        libgcc_s.so.1 => /usr/lib/gcc/x86_64-pc-linux-gnu/9.1.0/libgcc_s.so.1 (0x00007fd7d7860000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fd7d768e000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fd7d9203000)
gandalf /opt/foldingathome # ldd ./cores/cores.foldingathome.org/Linux/AMD64/Core_a7.fah/FahCore_a7
        linux-vdso.so.1 (0x00007ffc4b308000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fc308f5c000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fc308f56000)
        libstdc++.so.6 => /usr/lib/gcc/x86_64-pc-linux-gnu/9.1.0/libstdc++.so.6 (0x00007fc308cd9000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fc308b8d000)
        libgcc_s.so.1 => /usr/lib/gcc/x86_64-pc-linux-gnu/9.1.0/libgcc_s.so.1 (0x00007fc308b73000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fc3089a1000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fc30a353000)
gandalf /opt/foldingathome # ldd ./cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7
        linux-vdso.so.1 (0x00007ffe8cc64000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f37f8136000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f37f8130000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f37f7fe4000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f37f7e12000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f37f81bc000)
gandalf /opt/foldingathome # ldd ./cores/fahwebx.stanford.edu/cores/Linux/AMD64/Core_a7.fah/FahCore_a7
        linux-vdso.so.1 (0x00007ffcba349000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fbdf152a000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fbdf1524000)
        libstdc++.so.6 => /usr/lib/gcc/x86_64-pc-linux-gnu/9.1.0/libstdc++.so.6 (0x00007fbdf12a7000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fbdf115b000)
        libgcc_s.so.1 => /usr/lib/gcc/x86_64-pc-linux-gnu/9.1.0/libgcc_s.so.1 (0x00007fbdf1141000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fbdf0f6f000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fbdf15b0000)
I will play around the config to split the cpus

Re: WU crashing infinitely

Posted: Sat Nov 09, 2019 3:12 am
by bruce
Nuitari wrote:I will play around the config to split the cpus
So please report what happens when you try to run those same WUs with fewer CPUs? ... especially with even numbers of CPUs.

Re: WU crashing infinitely

Posted: Fri Nov 22, 2019 3:34 am
by Nuitari
Project 14245 is also having the problem at 16CPUs. It works with 8.
Is there a guide somewhere for the syntax in config.xml ?

Re: WU crashing infinitely

Posted: Fri Nov 22, 2019 8:35 pm
by bruce
If you have more than one CPU slot, you can configure it like this
-<slot type="CPU" id="0">
<cpus v="6"/>
</slot>

If you want all the slots same (or you only have one), the <cpus v=N/> can go in the general section before the slots are defined.

Re: WU crashing infinitely

Posted: Mon Mar 16, 2020 5:37 am
by IvantheDugtrio
I'm getting this issue as well though I figured it would be kind of a waste to only allocate 12 threads to a 128 thread dual EPYC system. I can't image what it's like for people with EPYC Rome servers with twice as many cores. For jobs that cannot handle many threads, is it possible to run multiples of them independently and concurrently? I am using a docker image of FAHClient v7.5.1.

Re: WU crashing infinitely

Posted: Mon Mar 16, 2020 6:09 am
by Joe_H
Yes, you can run multiple CPU folding slots, each set to use some fraction of the threads you have available. Just don't have the total add up to more CPU threads than you have.

In the past there have been large WU's for CPU processing, and the CPU core would scale to over 100 CPU threads. But few people have large servers like an EPYC, and many have mid to higher end GPUs. So the largest projects in terms of number of atoms being simulated are targeted towards GPU processing these days.

Re: WU crashing infinitely

Posted: Thu Apr 02, 2020 6:18 pm
by dsmclau
I have the same issue. Has been failing for a couple of days.

Code: Select all

18:04:06:WU00:FS00:0xa7:*********************** Log Started 2020-04-02T18:04:05Z ***********************
18:04:06:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
18:04:06:WU00:FS00:0xa7:       Type: 0xa7
18:04:06:WU00:FS00:0xa7:       Core: Gromacs
18:04:06:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 705 -lifeline 4274 -checkpoint 15 -np
18:04:06:WU00:FS00:0xa7:             11
18:04:06:WU00:FS00:0xa7:************************************ CBang *************************************
18:04:06:WU00:FS00:0xa7:       Date: Nov 5 2019
18:04:06:WU00:FS00:0xa7:       Time: 06:06:57
18:04:06:WU00:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
18:04:06:WU00:FS00:0xa7:     Branch: master
18:04:06:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
18:04:06:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
18:04:06:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
18:04:06:WU00:FS00:0xa7:       Bits: 64
18:04:06:WU00:FS00:0xa7:       Mode: Release
18:04:06:WU00:FS00:0xa7:************************************ System ************************************
18:04:06:WU00:FS00:0xa7:        CPU: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
18:04:06:WU00:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 158 Stepping 10
18:04:06:WU00:FS00:0xa7:       CPUs: 12
18:04:06:WU00:FS00:0xa7:     Memory: 31.30GiB
18:04:06:WU00:FS00:0xa7:Free Memory: 27.35GiB
18:04:06:WU00:FS00:0xa7:    Threads: POSIX_THREADS
18:04:06:WU00:FS00:0xa7: OS Version: 5.3
18:04:06:WU00:FS00:0xa7:Has Battery: false
18:04:06:WU00:FS00:0xa7: On Battery: false
18:04:06:WU00:FS00:0xa7: UTC Offset: -4
18:04:06:WU00:FS00:0xa7:        PID: 4278
18:04:06:WU00:FS00:0xa7:        CWD: /var/lib/fahclient/work
18:04:06:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
18:04:06:WU00:FS00:0xa7:    Version: 0.0.18
18:04:06:WU00:FS00:0xa7:     Author: Joseph Coffland <[email protected]>
18:04:06:WU00:FS00:0xa7:  Copyright: 2019 foldingathome.org
18:04:06:WU00:FS00:0xa7:   Homepage: https://foldingathome.org/
18:04:06:WU00:FS00:0xa7:       Date: Nov 5 2019
18:04:06:WU00:FS00:0xa7:       Time: 06:13:26
18:04:06:WU00:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
18:04:06:WU00:FS00:0xa7:     Branch: master
18:04:06:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
18:04:06:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
18:04:06:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
18:04:06:WU00:FS00:0xa7:       Bits: 64
18:04:06:WU00:FS00:0xa7:       Mode: Release
18:04:06:WU00:FS00:0xa7:************************************ Build *************************************
18:04:06:WU00:FS00:0xa7:       SIMD: avx_256
18:04:06:WU00:FS00:0xa7:********************************************************************************
18:04:06:WU00:FS00:0xa7:Project: 13833 (Run 0, Clone 3627, Gen 5)
18:04:06:WU00:FS00:0xa7:Unit: 0x0000000980fccb095e6e55bb1f5d4033
18:04:06:WU00:FS00:0xa7:Reading tar file core.xml
18:04:06:WU00:FS00:0xa7:Reading tar file frame5.tpr
18:04:06:WU00:FS00:0xa7:Digital signatures verified
18:04:06:WU00:FS00:0xa7:Reducing thread count from 11 to 10 to avoid domain decomposition by a prime number > 3
18:04:06:WU00:FS00:0xa7:Calling: mdrun -s frame5.tpr -o frame5.trr -x frame5.xtc -cpt 15 -nt 10
18:04:06:WU00:FS00:0xa7:Steps: first=1250000 total=250000
18:04:06:WU00:FS00:0xa7:ERROR:
18:04:06:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
18:04:06:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
18:04:06:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
18:04:06:WU00:FS00:0xa7:ERROR:
18:04:06:WU00:FS00:0xa7:ERROR:Fatal error:
18:04:06:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 10 ranks that is compatible with the given box and a minimum cell size of 1.45733 nm
18:04:06:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
18:04:06:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
18:04:06:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
18:04:06:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
18:04:06:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
18:04:08:WU01:FS01:0x22:Completed 170000 out of 1000000 steps (17%)
18:04:10:WU00:FS00:0xa7:WARNING:Unexpected exit() call
18:04:10:WU00:FS00:0xa7:WARNING:Unexpected exit from science code
18:04:10:WU00:FS00:0xa7:Saving result file ../logfile_01.txt
18:04:10:WU00:FS00:0xa7:Saving result file md.log
18:04:10:WU00:FS00:0xa7:Saving result file science.log
18:04:11:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
18:05:05:WU00:FS00:Starting
18:05:05:WU00:FS00:Removing old file './work/00/logfile_01-20200402-173304.txt'
18:05:05:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 705 -lifeline 2154 -checkpoint 15 -np 11
18:05:05:WU00:FS00:Started FahCore on PID 4344
18:05:05:WU00:FS00:Core PID:4348
18:05:05:WU00:FS00:FahCore 0xa7 started
18:05:06:WU00:FS00:0xa7:*********************** Log Started 2020-04-02T18:05:05Z ***********************
18:05:06:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
18:05:06:WU00:FS00:0xa7:       Type: 0xa7
18:05:06:WU00:FS00:0xa7:       Core: Gromacs
18:05:06:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 705 -lifeline 4344 -checkpoint 15 -np
18:05:06:WU00:FS00:0xa7:             11
18:05:06:WU00:FS00:0xa7:************************************ CBang *************************************
18:05:06:WU00:FS00:0xa7:       Date: Nov 5 2019
18:05:06:WU00:FS00:0xa7:       Time: 06:06:57
18:05:06:WU00:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
18:05:06:WU00:FS00:0xa7:     Branch: master
18:05:06:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
18:05:06:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
18:05:06:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
18:05:06:WU00:FS00:0xa7:       Bits: 64
18:05:06:WU00:FS00:0xa7:       Mode: Release
18:05:06:WU00:FS00:0xa7:************************************ System ************************************
18:05:06:WU00:FS00:0xa7:        CPU: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
18:05:06:WU00:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 158 Stepping 10
18:05:06:WU00:FS00:0xa7:       CPUs: 12
18:05:06:WU00:FS00:0xa7:     Memory: 31.30GiB
18:05:06:WU00:FS00:0xa7:Free Memory: 27.25GiB
18:05:06:WU00:FS00:0xa7:    Threads: POSIX_THREADS
18:05:06:WU00:FS00:0xa7: OS Version: 5.3
18:05:06:WU00:FS00:0xa7:Has Battery: false
18:05:06:WU00:FS00:0xa7: On Battery: false
18:05:06:WU00:FS00:0xa7: UTC Offset: -4
18:05:06:WU00:FS00:0xa7:        PID: 4348
18:05:06:WU00:FS00:0xa7:        CWD: /var/lib/fahclient/work
18:05:06:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
18:05:06:WU00:FS00:0xa7:    Version: 0.0.18
18:05:06:WU00:FS00:0xa7:     Author: Joseph Coffland <[email protected]>
18:05:06:WU00:FS00:0xa7:  Copyright: 2019 foldingathome.org
18:05:06:WU00:FS00:0xa7:   Homepage: https://foldingathome.org/
18:05:06:WU00:FS00:0xa7:       Date: Nov 5 2019
18:05:06:WU00:FS00:0xa7:       Time: 06:13:26
18:05:06:WU00:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
18:05:06:WU00:FS00:0xa7:     Branch: master
18:05:06:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
18:05:06:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
18:05:06:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
18:05:06:WU00:FS00:0xa7:       Bits: 64
18:05:06:WU00:FS00:0xa7:       Mode: Release
18:05:06:WU00:FS00:0xa7:************************************ Build *************************************
18:05:06:WU00:FS00:0xa7:       SIMD: avx_256
18:05:06:WU00:FS00:0xa7:********************************************************************************
18:05:06:WU00:FS00:0xa7:Project: 13833 (Run 0, Clone 3627, Gen 5)
18:05:06:WU00:FS00:0xa7:Unit: 0x0000000980fccb095e6e55bb1f5d4033
18:05:06:WU00:FS00:0xa7:Reading tar file core.xml
18:05:06:WU00:FS00:0xa7:Reading tar file frame5.tpr
18:05:06:WU00:FS00:0xa7:Digital signatures verified
18:05:06:WU00:FS00:0xa7:Reducing thread count from 11 to 10 to avoid domain decomposition by a prime number > 3
18:05:06:WU00:FS00:0xa7:Calling: mdrun -s frame5.tpr -o frame5.trr -x frame5.xtc -cpt 15 -nt 10
18:05:06:WU00:FS00:0xa7:Steps: first=1250000 total=250000
18:05:06:WU00:FS00:0xa7:ERROR:
18:05:06:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
18:05:06:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
18:05:06:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
18:05:06:WU00:FS00:0xa7:ERROR:
18:05:06:WU00:FS00:0xa7:ERROR:Fatal error:
18:05:06:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 10 ranks that is compatible with the given box and a minimum cell size of 1.45733 nm
18:05:06:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
18:05:06:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
18:05:06:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
18:05:06:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
18:05:06:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
18:05:10:WU00:FS00:0xa7:WARNING:Unexpected exit() call
18:05:10:WU00:FS00:0xa7:WARNING:Unexpected exit from science code
18:05:10:WU00:FS00:0xa7:Saving result file ../logfile_01.txt
18:05:10:WU00:FS00:0xa7:Saving result file md.log
18:05:10:WU00:FS00:0xa7:Saving result file science.log
18:05:11:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
18:05:24:WU01:FS01:0x22:Completed 180000 out of 1000000 steps (18%)
18:06:03:FS00:Paused
18:06:03:FS01:Paused
18:06:03:FS01:Shutting core down
18:06:03:WU01:FS01:0x22:Caught signal SIGINT(2) on PID 3289
18:06:03:WU01:FS01:0x22:Exiting, please wait. . .
18:06:03:WU01:FS01:0x22:Folding@home Core Shutdown: INTERRUPTED
18:06:03:WU01:FS01:FahCore returned: INTERRUPTED (102 = 0x66)
18:06:04:Removing old file 'configs/config-20200330-213451.xml'
18:06:04:Saving configuration to /etc/fahclient/config.xml
18:06:04:<config>
18:06:04:  <!-- Client Control -->
18:06:04:  <fold-anon v='true'/>
18:06:04:
18:06:04:  <!-- Network -->
18:06:04:  <proxy v=':8080'/>
18:06:04:
18:06:04:  <!-- Slot Control -->
18:06:04:  <power v='full'/>
18:06:04:
18:06:04:  <!-- User Information -->
18:06:04:  <passkey v='********************************'/>
18:06:04:  <team v='223518'/>
18:06:04:  <user v='Cincinnati_Kid'/>
18:06:04:
18:06:04:  <!-- Folding Slots -->
18:06:04:  <slot id='0' type='CPU'>
18:06:04:    <paused v='true'/>
18:06:04:  </slot>
18:06:04:  <slot id='1' type='GPU'>
18:06:04:    <paused v='true'/>
18:06:04:  </slot>
18:06:04:</config>

Re: WU crashing infinitely

Posted: Thu Apr 02, 2020 7:11 pm
by Neil-B
So 12cores -1for GPU makes 11 … Client tries to get out of the large by down sizing core count - unfortunately to a multiple of 5 which is unfortunate :( … some projects don't have issues with 5 -this seems to try setting the CPU slot as 9 which may resolve this issue and avoid problems in the future