Page 1 of 1
Project 16417 fails on high core count machines
Posted: Mon Apr 06, 2020 10:45 pm
by Area256
I'm having a problem with Project 16417 (and possibly similar ones) on very high core count machines (128 threads in this case).
I think the issue is having too many threads assigned to the core. If I limit the number of threads then the unit will run. Also frustratingly when it fails it just keeps retrying instead of reporting the failure and trying to get another unit, so my system stays idle constantly retrying until I limit the number of cores for this unit by switching to "Light" folding power.
Full logs:
Code: Select all
22:34:41:WU00:FS00:Starting
22:34:41:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 704 -lifeline 8778 -checkpoint 15 -np 128
22:34:41:WU00:FS00:Started FahCore on PID 11885
22:34:41:WU00:FS00:Core PID:11889
22:34:41:WU00:FS00:FahCore 0xa7 started
22:34:41:WU00:FS00:0xa7:*********************** Log Started 2020-04-06T22:34:41Z ***********************
22:34:41:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
22:34:41:WU00:FS00:0xa7: Type: 0xa7
22:34:41:WU00:FS00:0xa7: Core: Gromacs
22:34:41:WU00:FS00:0xa7: Args: -dir 00 -suffix 01 -version 704 -lifeline 11885 -checkpoint 15 -np
22:34:41:WU00:FS00:0xa7: 128
22:34:41:WU00:FS00:0xa7:************************************ CBang *************************************
22:34:41:WU00:FS00:0xa7: Date: Nov 5 2019
22:34:41:WU00:FS00:0xa7: Time: 06:06:57
22:34:41:WU00:FS00:0xa7: Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
22:34:41:WU00:FS00:0xa7: Branch: master
22:34:41:WU00:FS00:0xa7: Compiler: GNU 8.3.0
22:34:41:WU00:FS00:0xa7: Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
22:34:41:WU00:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
22:34:41:WU00:FS00:0xa7: Bits: 64
22:34:41:WU00:FS00:0xa7: Mode: Release
22:34:41:WU00:FS00:0xa7:************************************ System ************************************
22:34:41:WU00:FS00:0xa7: CPU: AMD EPYC 7702 64-Core Processor
22:34:41:WU00:FS00:0xa7: CPU ID: AuthenticAMD Family 23 Model 49 Stepping 0
22:34:41:WU00:FS00:0xa7: CPUs: 128
22:34:41:WU00:FS00:0xa7: Memory: 251.54GiB
22:34:41:WU00:FS00:0xa7:Free Memory: 244.47GiB
22:34:41:WU00:FS00:0xa7: Threads: POSIX_THREADS
22:34:41:WU00:FS00:0xa7: OS Version: 5.3
22:34:41:WU00:FS00:0xa7:Has Battery: false
22:34:41:WU00:FS00:0xa7: On Battery: false
22:34:41:WU00:FS00:0xa7: UTC Offset: 0
22:34:41:WU00:FS00:0xa7: PID: 11889
22:34:41:WU00:FS00:0xa7: CWD: /var/lib/fahclient/work
22:34:41:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
22:34:41:WU00:FS00:0xa7: Version: 0.0.18
22:34:41:WU00:FS00:0xa7: Author: Joseph Coffland <[email protected]>
22:34:41:WU00:FS00:0xa7: Copyright: 2019 foldingathome.org
22:34:41:WU00:FS00:0xa7: Homepage: https://foldingathome.org/
22:34:41:WU00:FS00:0xa7: Date: Nov 5 2019
22:34:41:WU00:FS00:0xa7: Time: 06:13:26
22:34:41:WU00:FS00:0xa7: Revision: 490c9aa2957b725af319379424d5c5cb36efb656
22:34:41:WU00:FS00:0xa7: Branch: master
22:34:41:WU00:FS00:0xa7: Compiler: GNU 8.3.0
22:34:41:WU00:FS00:0xa7: Options: -std=c++11 -O3 -funroll-loops -fno-pie
22:34:41:WU00:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
22:34:41:WU00:FS00:0xa7: Bits: 64
22:34:41:WU00:FS00:0xa7: Mode: Release
22:34:41:WU00:FS00:0xa7:************************************ Build *************************************
22:34:41:WU00:FS00:0xa7: SIMD: avx_256
22:34:41:WU00:FS00:0xa7:********************************************************************************
22:34:41:WU00:FS00:0xa7:Project: 16417 (Run 535, Clone 2, Gen 7)
22:34:41:WU00:FS00:0xa7:Unit: 0x0000000796880e6e5e8a61a9533cfa03
22:34:41:WU00:FS00:0xa7:Reading tar file core.xml
22:34:41:WU00:FS00:0xa7:Reading tar file frame7.tpr
22:34:41:WU00:FS00:0xa7:Digital signatures verified
22:34:41:WU00:FS00:0xa7:Calling: mdrun -s frame7.tpr -o frame7.trr -x frame7.xtc -cpt 15 -nt 128
22:34:41:WU00:FS00:0xa7:Steps: first=1750000 total=250000
22:34:41:WU00:FS00:0xa7:ERROR:
22:34:41:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
22:34:41:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
22:34:41:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
22:34:41:WU00:FS00:0xa7:ERROR:
22:34:41:WU00:FS00:0xa7:ERROR:Fatal error:
22:34:41:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 96 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm
22:34:41:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
22:34:41:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
22:34:41:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
22:34:41:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
22:34:41:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
22:34:46:WU00:FS00:0xa7:WARNING:Unexpected exit() call
22:34:46:WU00:FS00:0xa7:WARNING:Unexpected exit from science code
22:34:46:WU00:FS00:0xa7:Saving result file ../logfile_01.txt
22:34:46:WU00:FS00:0xa7:Saving result file md.log
22:34:46:WU00:FS00:0xa7:Saving result file science.log
22:34:46:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
My config.xml
Code: Select all
<config>
<!-- Client Control -->
<fold-anon v='true'/>
<!-- Folding Slot Configuration -->
<gpu v='false'/>
<!-- HTTP Server -->
<allow v='127.0.0.1'/>
<!-- Network -->
<proxy v=':8080'/>
<!-- Remote Command Server -->
<password v='***'/>
<!-- Slot Control -->
<power v='light'/>
<!-- User Information -->
<passkey v='***''/>
<team v='***'/>
<user v='***'/>
<!-- Web Server -->
<web-allow v='127.0.0.1/>
<!-- Folding Slots -->
<slot id='0' type='CPU'>
<client-type v='bigadv'/>
</slot>
</config>
Re: Project 16417 fails on high core count machines
Posted: Mon Apr 06, 2020 10:53 pm
by Joe_H
There are a limited number of CPU projects that will run on that many threads, the rest usually have upper limits set for assignment.
What was the size you were able to run this project on? I will pass this information back to the researcher so they can check and adjust the limits for this project.
Re: Project 16417 fails on high core count machines
Posted: Mon Apr 06, 2020 11:59 pm
by Area256
I was able to run it on 64 threads (which seems to be the limit set by the "Light") option. I'm afraid I didn't test anything higher.
Re: Project 16417 fails on high core count machines
Posted: Tue Apr 07, 2020 3:12 am
by Joe_H
Other settings would need to be made through FAHControl's Configure. But that at least gives some useful limits.
Re: Project 16417 fails on high core count machines
Posted: Tue Apr 07, 2020 3:20 am
by sukritsingh
Thanks for flagging! I've updated the project thread limits so that it only uses 64 cores and below since that was the limit you tested.
Re: Project 16417 fails on high core count machines
Posted: Tue Apr 07, 2020 11:24 am
by Neil-B
FYI - A larger core count might have been the reason my Project: 16417 (Run 2023, Clone 4, Gen 2) failed on a 24core slot .. it has since been successfully completed by someone else - so upper limit might be less than 64 ??
Code: Select all
12:26:12:WU01:FS00:Connecting to 65.254.110.245:8080
12:26:12:WU01:FS00:Assigned to work server 150.136.14.110
12:26:12:WU01:FS00:Requesting new work unit for slot 00: READY cpu:24 from 150.136.14.110
12:26:12:WU01:FS00:Connecting to 150.136.14.110:8080
12:26:13:WU01:FS00:Downloading 2.34MiB
12:26:14:WU01:FS00:Download complete
12:26:14:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:16417 run:2023 clone:4 gen:2 core:0xa7 unit:0x0000000296880e6e5e8a604c5fef0873
12:26:14:WU01:FS00:Starting
12:26:14:WU01:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\OpDoubleHelix\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/avx/Core_a7.fah/FahCore_a7.exe -dir 01 -suffix 01 -version 705 -lifeline 13272 -checkpoint 5 -np 24
12:26:14:WU01:FS00:Started FahCore on PID 10236
12:26:14:WU01:FS00:Core PID:10032
12:26:14:WU01:FS00:FahCore 0xa7 started
12:26:14:WU01:FS00:0xa7:*********************** Log Started 2020-04-06T12:26:14Z ***********************
12:26:14:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
12:26:14:WU01:FS00:0xa7: Type: 0xa7
12:26:14:WU01:FS00:0xa7: Core: Gromacs
12:26:14:WU01:FS00:0xa7: Args: -dir 01 -suffix 01 -version 705 -lifeline 10236 -checkpoint 5 -np
12:26:14:WU01:FS00:0xa7: 24
12:26:14:WU01:FS00:0xa7:************************************ CBang *************************************
12:26:14:WU01:FS00:0xa7: Date: Oct 26 2019
12:26:14:WU01:FS00:0xa7: Time: 01:38:25
12:26:14:WU01:FS00:0xa7: Revision: c46a1a011a24143739ac7218c5a435f66777f62f
12:26:14:WU01:FS00:0xa7: Branch: master
12:26:14:WU01:FS00:0xa7: Compiler: Visual C++ 2008
12:26:14:WU01:FS00:0xa7: Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
12:26:14:WU01:FS00:0xa7: Platform: win32 10
12:26:14:WU01:FS00:0xa7: Bits: 64
12:26:14:WU01:FS00:0xa7: Mode: Release
12:26:14:WU01:FS00:0xa7:************************************ System ************************************
12:26:14:WU01:FS00:0xa7: CPU: Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
12:26:14:WU01:FS00:0xa7: CPU ID: GenuineIntel Family 6 Model 63 Stepping 2
12:26:14:WU01:FS00:0xa7: CPUs: 56
12:26:14:WU01:FS00:0xa7: Memory: 511.75GiB
12:26:14:WU01:FS00:0xa7:Free Memory: 500.27GiB
12:26:14:WU01:FS00:0xa7: Threads: WINDOWS_THREADS
12:26:14:WU01:FS00:0xa7: OS Version: 6.2
12:26:14:WU01:FS00:0xa7:Has Battery: false
12:26:14:WU01:FS00:0xa7: On Battery: false
12:26:14:WU01:FS00:0xa7: UTC Offset: 1
12:26:14:WU01:FS00:0xa7: PID: 10032
12:26:14:WU01:FS00:0xa7: CWD: C:\Users\OpDoubleHelix\AppData\Roaming\FAHClient\work
12:26:14:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
12:26:14:WU01:FS00:0xa7: Version: 0.0.18
12:26:14:WU01:FS00:0xa7: Author: Joseph Coffland <[email protected]>
12:26:15:WU01:FS00:0xa7: Copyright: 2019 foldingathome.org
12:26:15:WU01:FS00:0xa7: Homepage: https://foldingathome.org/
12:26:15:WU01:FS00:0xa7: Date: Oct 26 2019
12:26:15:WU01:FS00:0xa7: Time: 01:52:30
12:26:15:WU01:FS00:0xa7: Revision: c1e3513b1bc0c16013668f2173ee969e5995b38e
12:26:15:WU01:FS00:0xa7: Branch: master
12:26:15:WU01:FS00:0xa7: Compiler: Visual C++ 2008
12:26:15:WU01:FS00:0xa7: Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
12:26:15:WU01:FS00:0xa7: Platform: win32 10
12:26:15:WU01:FS00:0xa7: Bits: 64
12:26:15:WU01:FS00:0xa7: Mode: Release
12:26:15:WU01:FS00:0xa7:************************************ Build *************************************
12:26:15:WU01:FS00:0xa7: SIMD: avx_256
12:26:15:WU01:FS00:0xa7:********************************************************************************
12:26:15:WU01:FS00:0xa7:Project: 16417 (Run 2023, Clone 4, Gen 2)
12:26:15:WU01:FS00:0xa7:Unit: 0x0000000296880e6e5e8a604c5fef0873
12:26:15:WU01:FS00:0xa7:Reading tar file core.xml
12:26:15:WU01:FS00:0xa7:Reading tar file frame2.tpr
12:26:15:WU01:FS00:0xa7:Digital signatures verified
12:26:15:WU01:FS00:0xa7:Calling: mdrun -s frame2.tpr -o frame2.trr -x frame2.xtc -cpt 5 -nt 24
12:26:15:WU01:FS00:0xa7:Steps: first=500000 total=250000
12:26:15:WU01:FS00:0xa7:ERROR:
12:26:15:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
12:26:15:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
12:26:15:WU01:FS00:0xa7:ERROR:Source code file: C:\build\fah\core-a7-avx-release\windows-10-64bit-core-a7-avx-release\gromacs-core\build\gromacs\src\gromacs\mdlib\domdec.c, line: 6902
12:26:15:WU01:FS00:0xa7:ERROR:
12:26:15:WU01:FS00:0xa7:ERROR:Fatal error:
12:26:15:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 20 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm
12:26:15:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
12:26:15:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
12:26:15:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
12:26:15:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
12:26:15:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
12:26:20:WU01:FS00:0xa7:WARNING:Unexpected exit() call
12:26:20:WU01:FS00:0xa7:WARNING:Unexpected exit from science code
12:26:20:WU01:FS00:0xa7:Saving result file ..\logfile_01.txt
12:26:20:WU01:FS00:0xa7:Saving result file md.log
12:26:20:WU01:FS00:0xa7:Saving result file science.log
12:26:20:WU01:FS00:0xa7:WARNING:While cleaning up: boost::filesystem::remove: The process cannot access the file because it is being used by another process: "01/md.log"
12:26:20:WU01:FS00:0xa7:Folding@home Core Shutdown: BAD_WORK_UNIT
12:26:20:WARNING:WU01:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
12:26:20:WU01:FS00:Sending unit results: id:01 state:SEND error:FAULTY project:16417 run:2023 clone:4 gen:2 core:0xa7 unit:0x0000000296880e6e5e8a604c5fef0873
12:26:20:WU01:FS00:Uploading 20.00KiB to 150.136.14.110
12:26:20:WU01:FS00:Connecting to 150.136.14.110:8080
12:26:20:WU01:FS00:Upload complete
12:26:20:WU01:FS00:Server responded WORK_ACK (400)
12:26:20:WU01:FS00:Cleaning up
Re: Project 16417 fails on high core count machines
Posted: Tue Apr 07, 2020 5:48 pm
by Neil-B
Further update from another thread …
viewtopic.php?f=19&t=34072#p323443 might have to go as low as 8cores for this?
Ignore me … Joe_H has responded on other thread … given the 64 core that worked he reckons probably multiple of 5 is the issue … not sure about my 24core failure but happy for that to be an anomaly.
Re: Project 16417 fails on high core count machines
Posted: Sun Apr 12, 2020 12:09 am
by PantherX
FYI, I have confirmation from the Project owner that Project 16417 will no longer be assigned to 24 CPUs. Thanks all for your report
Re: Project 16417 fails on high core count machines
Posted: Mon Apr 20, 2020 1:56 pm
by Zzyzx
PantherX wrote:FYI, I have confirmation from the Project owner that Project 16417 will no longer be assigned to 24 CPUs. Thanks all for your report
Hey there! I got assigned 16417 on a 24c/48t machine today and had the same issue:
Code: Select all
13:47:45:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 706 -lifeline 84821 -checkpoint 5 -np 48
13:47:45:WU00:FS00:Started FahCore on PID 64119
13:47:45:WU00:FS00:Core PID:64123
13:47:45:WU00:FS00:FahCore 0xa7 started
13:47:45:WU00:FS00:0xa7:*********************** Log Started 2020-04-20T13:47:45Z ***********************
13:47:45:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
13:47:45:WU00:FS00:0xa7: Type: 0xa7
13:47:45:WU00:FS00:0xa7: Core: Gromacs
13:47:45:WU00:FS00:0xa7: Args: -dir 00 -suffix 01 -version 706 -lifeline 64119 -checkpoint 5 -np
13:47:45:WU00:FS00:0xa7: 48
13:47:45:WU00:FS00:0xa7:************************************ CBang *************************************
13:47:45:WU00:FS00:0xa7: Date: Nov 5 2019
13:47:45:WU00:FS00:0xa7: Time: 06:06:57
13:47:45:WU00:FS00:0xa7: Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
13:47:45:WU00:FS00:0xa7: Branch: master
13:47:45:WU00:FS00:0xa7: Compiler: GNU 8.3.0
13:47:45:WU00:FS00:0xa7: Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
13:47:45:WU00:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
13:47:45:WU00:FS00:0xa7: Bits: 64
13:47:45:WU00:FS00:0xa7: Mode: Release
13:47:45:WU00:FS00:0xa7:************************************ System ************************************
13:47:45:WU00:FS00:0xa7: CPU: Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
13:47:45:WU00:FS00:0xa7: CPU ID: GenuineIntel Family 6 Model 62 Stepping 4
13:47:45:WU00:FS00:0xa7: CPUs: 48
13:47:45:WU00:FS00:0xa7: Memory: 15.48GiB
13:47:45:WU00:FS00:0xa7:Free Memory: 8.42GiB
13:47:45:WU00:FS00:0xa7: Threads: POSIX_THREADS
13:47:45:WU00:FS00:0xa7: OS Version: 4.18
13:47:45:WU00:FS00:0xa7:Has Battery: false
13:47:45:WU00:FS00:0xa7: On Battery: false
13:47:45:WU00:FS00:0xa7: UTC Offset: -7
13:47:45:WU00:FS00:0xa7: PID: 64123
13:47:45:WU00:FS00:0xa7: CWD: /var/lib/fahclient/work
13:47:45:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
13:47:45:WU00:FS00:0xa7: Version: 0.0.18
13:47:45:WU00:FS00:0xa7: Author: Joseph Coffland <[email protected]>
13:47:45:WU00:FS00:0xa7: Copyright: 2019 foldingathome.org
13:47:45:WU00:FS00:0xa7: Homepage: https://foldingathome.org/
13:47:45:WU00:FS00:0xa7: Date: Nov 5 2019
13:47:45:WU00:FS00:0xa7: Time: 06:13:26
13:47:45:WU00:FS00:0xa7: Revision: 490c9aa2957b725af319379424d5c5cb36efb656
13:47:45:WU00:FS00:0xa7: Branch: master
13:47:45:WU00:FS00:0xa7: Compiler: GNU 8.3.0
13:47:45:WU00:FS00:0xa7: Options: -std=c++11 -O3 -funroll-loops -fno-pie
13:47:45:WU00:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
13:47:45:WU00:FS00:0xa7: Bits: 64
13:47:45:WU00:FS00:0xa7: Mode: Release
13:47:45:WU00:FS00:0xa7:************************************ Build *************************************
13:47:45:WU00:FS00:0xa7: SIMD: avx_256
13:47:45:WU00:FS00:0xa7:********************************************************************************
13:47:45:WU00:FS00:0xa7:Project: 16417 (Run 473, Clone 2, Gen 83)
13:47:45:WU00:FS00:0xa7:Unit: 0x0000005a96880e6e5e8a61200c024db9
13:47:45:WU00:FS00:0xa7:Reading tar file core.xml
13:47:45:WU00:FS00:0xa7:Reading tar file frame83.tpr
13:47:45:WU00:FS00:0xa7:Digital signatures verified
13:47:45:WU00:FS00:0xa7:Calling: mdrun -s frame83.tpr -o frame83.trr -x frame83.xtc -cpt 5 -nt 48
13:47:45:WU00:FS00:0xa7:Steps: first=20750000 total=250000
13:47:45:WU00:FS00:0xa7:ERROR:
13:47:45:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
13:47:45:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
13:47:45:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
13:47:45:WU00:FS00:0xa7:ERROR:
13:47:45:WU00:FS00:0xa7:ERROR:Fatal error:
13:47:45:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 40 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm
13:47:45:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
13:47:45:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
13:47:45:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
13:47:45:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
13:47:45:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
13:47:50:WU00:FS00:0xa7:WARNING:Unexpected exit() call
13:47:50:WU00:FS00:0xa7:WARNING:Unexpected exit from science code
13:47:50:WU00:FS00:0xa7:Saving result file ../logfile_01.txt
13:47:50:WU00:FS00:0xa7:Saving result file md.log
13:47:50:WU00:FS00:0xa7:Saving result file science.log
13:47:51:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
Here is md.log in case it helps:
Code: Select all
Log file opened on Mon Apr 20 06:47:45 2020
Host: direwolf-fah.wolfeindustrie.com pid: 64123 rank ID: 0 number of ranks: 1
GROMACS: GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
GROMACS is written by:
Emile Apol Rossen Apostolov Herman J.C. Berendsen Par Bjelkmar
Aldert van Buuren Rudi van Drunen Anton Feenstra Sebastian Fritsch
Gerrit Groenhof Christoph Junghans Peter Kasson Carsten Kutzner
Per Larsson Justin A. Lemkul Magnus Lundborg Pieter Meulenhoff
Erik Marklund Teemu Murtola Szilard Pall Sander Pronk
Roland Schulz Alexey Shvetsov Michael Shirts Alfons Sijbers
Peter Tieleman Christian Wennberg Maarten Wolf
and the project leaders:
Mark Abraham, Berk Hess, Erik Lindahl, and David van der Spoel
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2014, The GROMACS development team at
Uppsala University, Stockholm University and
the Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.
GROMACS: GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
Gromacs version: VERSION 5.0.4-20191026-456f0d636-unknown
GIT SHA1 hash: 456f0d636b694d70ef483843dbb1b1383643ee12
Branched from: unknown
Precision: single
Memory model: 64 bit
MPI library: thread_mpi
OpenMP support: disabled
GPU support: disabled
invsqrt routine: gmx_software_invsqrt(x)
SIMD instructions: AVX_256
FFT library: fftw-3.3.8-sse2-avx
RDTSCP usage: disabled
C++11 compilation: disabled
TNG support: enabled
Tracing support: disabled
Built on: Wed Mar 22 01:02:31 UTC 2017
Built by: root@69562b3fdcef [CMAKE]
Build OS/arch: Linux 4.9.0-1-amd64 x86_64
Build CPU vendor: GenuineIntel
Build CPU brand: Intel(R) Core(TM) i7-3770S CPU @ 3.10GHz
Build CPU family: 6 Model: 58 Stepping: 9
Build CPU features: aes apic avx clfsh cmov cx8 cx16 f16c htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
C compiler: /usr/bin/cc GNU 8.3.0
C compiler flags: -mavx -I/host/debian-stable-64bit-core-a7-avx-release/libfah/build/src -I/host/debian-stable-64bit-core-a7-avx-release/cbang/build/include -Wno-maybe-uninitialized -Wextra -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall -Wno-unused -Wunused-value -Wunused-parameter -Wno-unknown-pragmas -O3 -DNDEBUG -fomit-frame-pointer -funroll-all-loops -fexcess-precision=fast -Wno-array-bounds
C++ compiler: /usr/bin/c++ GNU 8.3.0
C++ compiler flags: -mavx -I/host/debian-stable-64bit-core-a7-avx-release/libfah/build/src -I/host/debian-stable-64bit-core-a7-avx-release/cbang/build/include -Wextra -Wno-missing-field-initializers -Wpointer-arith -Wall -Wno-unused-function -Wno-unknown-pragmas -O3 -DNDEBUG -fomit-frame-pointer -funroll-all-loops -fexcess-precision=fast -Wno-array-bounds
Boost version: 1.55.0 (internal)
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess and C. Kutzner and D. van der Spoel and E. Lindahl
GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable
molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 435-447
-------- -------- --- Thank You --- -------- --------
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
D. van der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark and H. J. C.
Berendsen
GROMACS: Fast, Flexible and Free
J. Comp. Chem. 26 (2005) pp. 1701-1719
-------- -------- --- Thank You --- -------- --------
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
E. Lindahl and B. Hess and D. van der Spoel
GROMACS 3.0: A package for molecular simulation and trajectory analysis
J. Mol. Mod. 7 (2001) pp. 306-317
-------- -------- --- Thank You --- -------- --------
++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
H. J. C. Berendsen, D. van der Spoel and R. van Drunen
GROMACS: A message-passing parallel molecular dynamics implementation
Comp. Phys. Comm. 91 (1995) pp. 43-56
-------- -------- --- Thank You --- -------- --------
Can not increase nstlist because verlet-buffer-tolerance is not set or used
Input Parameters:
integrator = md
tinit = 0
dt = 0.004
nsteps = 250000
init-step = 20750000
simulation-part = 1
comm-mode = Linear
nstcomm = 5
bd-fric = 0
ld-seed = 2396924895
emtol = 10
emstep = 0.01
niter = 20
fcstep = 0
nstcgsteep = 1000
nbfgscorr = 10
rtpi = 0.05
nstxout = 0
nstvout = 0
nstfout = 0
nstlog = 0
nstcalcenergy = 0
nstenergy = 0
nstxout-compressed = 5000
compressed-x-precision = 1000
cutoff-scheme = Verlet
nstlist = 10
ns-type = Grid
pbc = xyz
periodic-molecules = FALSE
verlet-buffer-tolerance = -1
rlist = 1.1
rlistlong = 1.1
nstcalclr = 10
coulombtype = PME
coulomb-modifier = Potential-shift
rcoulomb-switch = 0
rcoulomb = 0.9
epsilon-r = 1
epsilon-rf = inf
vdw-type = Cut-off
vdw-modifier = Potential-shift
rvdw-switch = 0
rvdw = 0.9
DispCorr = EnerPres
table-extension = 1
fourierspacing = 0.12
fourier-nx = 72
fourier-ny = 72
fourier-nz = 72
pme-order = 4
ewald-rtol = 1e-05
ewald-rtol-lj = 0.001
lj-pme-comb-rule = Geometric
ewald-geometry = 0
epsilon-surface = 0
implicit-solvent = No
gb-algorithm = Still
nstgbradii = 1
rgbradii = 1
gb-epsilon-solvent = 80
gb-saltconc = 0
gb-obc-alpha = 1
gb-obc-beta = 0.8
gb-obc-gamma = 4.85
gb-dielectric-offset = 0.009
sa-algorithm = Ace-approximation
sa-surface-tension = 2.05016
tcoupl = V-rescale
nsttcouple = 10
nh-chain-length = 0
print-nose-hoover-chain-variables = FALSE
pcoupl = Parrinello-Rahman
pcoupltype = Isotropic
nstpcouple = 10
tau-p = 1
compressibility (3x3):
compressibility[ 0]={ 4.50000e-05, 0.00000e+00, 0.00000e+00}
compressibility[ 1]={ 0.00000e+00, 4.50000e-05, 0.00000e+00}
compressibility[ 2]={ 0.00000e+00, 0.00000e+00, 4.50000e-05}
ref-p (3x3):
ref-p[ 0]={ 1.00000e+00, 0.00000e+00, 0.00000e+00}
ref-p[ 1]={ 0.00000e+00, 1.00000e+00, 0.00000e+00}
ref-p[ 2]={ 0.00000e+00, 0.00000e+00, 1.00000e+00}
refcoord-scaling = All
posres-com (3):
posres-com[0]= 0.00000e+00
posres-com[1]= 0.00000e+00
posres-com[2]= 0.00000e+00
posres-comB (3):
posres-comB[0]= 0.00000e+00
posres-comB[1]= 0.00000e+00
posres-comB[2]= 0.00000e+00
QMMM = FALSE
QMconstraints = 0
QMMMscheme = 0
MMChargeScaleFactor = 1
qm-opts:
ngQM = 0
constraint-algorithm = Lincs
continuation = TRUE
Shake-SOR = FALSE
shake-tol = 0.0001
lincs-order = 6
lincs-iter = 2
lincs-warnangle = 30
nwall = 0
wall-type = 9-3
wall-r-linpot = -1
wall-atomtype[0] = -1
wall-atomtype[1] = -1
wall-density[0] = 0
wall-density[1] = 0
wall-ewald-zfac = 3
pull = no
rotation = FALSE
interactiveMD = FALSE
disre = No
disre-weighting = Conservative
disre-mixed = FALSE
dr-fc = 1000
dr-tau = 0
nstdisreout = 100
orire-fc = 0
orire-tau = 0
nstorireout = 100
free-energy = no
cos-acceleration = 0
deform (3x3):
deform[ 0]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 1]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
deform[ 2]={ 0.00000e+00, 0.00000e+00, 0.00000e+00}
simulated-tempering = FALSE
E-x:
n = 0
E-xt:
n = 0
E-y:
n = 0
E-yt:
n = 0
E-z:
n = 0
E-zt:
n = 0
swapcoords = no
adress = FALSE
userint1 = 0
userint2 = 0
userint3 = 0
userint4 = 0
userreal1 = 0
userreal2 = 0
userreal3 = 0
userreal4 = 0
grpopts:
nrdf: 86119
ref-t: 300
tau-t: 0.1
annealing: No
annealing-npoints: 0
acc: 0 0 0
nfreeze: N N N
energygrp-flags[ 0]: 0
Initializing Domain Decomposition on 48 ranks
Dynamic load balancing: auto
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
two-body bonded interactions: 0.429 nm, LJ-14, atoms 4153 4162
multi-body bonded interactions: 0.429 nm, Proper Dih., atoms 4153 4162
Minimum cell size due to bonded interactions: 0.472 nm
Maximum distance for 7 constraints, at 120 deg. angles, all-trans: 1.138 nm
Estimated maximum distance required for P-LINCS: 1.138 nm
This distance will limit the DD cell size, you can override this with -rcon
Guess for relative PME load: 0.17
Will use 40 particle-particle and 8 PME only ranks
This is a guess, check the performance at the end of the log file
Using 8 separate PME ranks, as guessed by mdrun
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 40 cells with a minimum initial size of 1.423 nm
The maximum allowed number of cells is: X 4 Y 4 Z 4
By turning FAH down to medium so it only runs on 47 threads, it runs fine:
Code: Select all
13:55:07:WU00:FS00:Starting
13:55:07:WARNING:WU00:FS00:Changed SMP threads from 48 to 47 this can cause some work units to fail
13:55:07:WU00:FS00:Removing old file 'work/00/logfile_01-20200420-131644.txt'
13:55:07:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 706 -lifeline 84821 -checkpoint 5 -np 47
13:55:07:WU00:FS00:Started FahCore on PID 64727
13:55:07:WU00:FS00:Core PID:64731
13:55:07:WU00:FS00:FahCore 0xa7 started
13:55:07:WU00:FS00:0xa7:*********************** Log Started 2020-04-20T13:55:07Z ***********************
13:55:07:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
13:55:07:WU00:FS00:0xa7: Type: 0xa7
13:55:07:WU00:FS00:0xa7: Core: Gromacs
13:55:07:WU00:FS00:0xa7: Args: -dir 00 -suffix 01 -version 706 -lifeline 64727 -checkpoint 5 -np
13:55:07:WU00:FS00:0xa7: 47
13:55:07:WU00:FS00:0xa7:************************************ CBang *************************************
13:55:07:WU00:FS00:0xa7: Date: Nov 5 2019
13:55:07:WU00:FS00:0xa7: Time: 06:06:57
13:55:07:WU00:FS00:0xa7: Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
13:55:07:WU00:FS00:0xa7: Branch: master
13:55:07:WU00:FS00:0xa7: Compiler: GNU 8.3.0
13:55:07:WU00:FS00:0xa7: Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
13:55:07:WU00:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
13:55:07:WU00:FS00:0xa7: Bits: 64
13:55:07:WU00:FS00:0xa7: Mode: Release
13:55:07:WU00:FS00:0xa7:************************************ System ************************************
13:55:07:WU00:FS00:0xa7: CPU: Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
13:55:07:WU00:FS00:0xa7: CPU ID: GenuineIntel Family 6 Model 62 Stepping 4
13:55:07:WU00:FS00:0xa7: CPUs: 48
13:55:07:WU00:FS00:0xa7: Memory: 15.48GiB
13:55:07:WU00:FS00:0xa7:Free Memory: 8.38GiB
13:55:07:WU00:FS00:0xa7: Threads: POSIX_THREADS
13:55:07:WU00:FS00:0xa7: OS Version: 4.18
13:55:07:WU00:FS00:0xa7:Has Battery: false
13:55:07:WU00:FS00:0xa7: On Battery: false
13:55:07:WU00:FS00:0xa7: UTC Offset: -7
13:55:07:WU00:FS00:0xa7: PID: 64731
13:55:07:WU00:FS00:0xa7: CWD: /var/lib/fahclient/work
13:55:07:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
13:55:07:WU00:FS00:0xa7: Version: 0.0.18
13:55:07:WU00:FS00:0xa7: Author: Joseph Coffland <[email protected]>
13:55:07:WU00:FS00:0xa7: Copyright: 2019 foldingathome.org
13:55:07:WU00:FS00:0xa7: Homepage: https://foldingathome.org/
13:55:07:WU00:FS00:0xa7: Date: Nov 5 2019
13:55:07:WU00:FS00:0xa7: Time: 06:13:26
13:55:07:WU00:FS00:0xa7: Revision: 490c9aa2957b725af319379424d5c5cb36efb656
13:55:07:WU00:FS00:0xa7: Branch: master
13:55:07:WU00:FS00:0xa7: Compiler: GNU 8.3.0
13:55:07:WU00:FS00:0xa7: Options: -std=c++11 -O3 -funroll-loops -fno-pie
13:55:07:WU00:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
13:55:07:WU00:FS00:0xa7: Bits: 64
13:55:07:WU00:FS00:0xa7: Mode: Release
13:55:07:WU00:FS00:0xa7:************************************ Build *************************************
13:55:07:WU00:FS00:0xa7: SIMD: avx_256
13:55:07:WU00:FS00:0xa7:********************************************************************************
13:55:07:WU00:FS00:0xa7:Project: 16417 (Run 473, Clone 2, Gen 83)
13:55:07:WU00:FS00:0xa7:Unit: 0x0000005a96880e6e5e8a61200c024db9
13:55:07:WU00:FS00:0xa7:Reading tar file core.xml
13:55:07:WU00:FS00:0xa7:Reading tar file frame83.tpr
13:55:07:WU00:FS00:0xa7:Digital signatures verified
13:55:07:WU00:FS00:0xa7:Reducing thread count from 47 to 46 to avoid domain decomposition by a prime number > 3
13:55:07:WU00:FS00:0xa7:Reducing thread count from 46 to 45 to avoid domain decomposition with large prime factor 23
13:55:07:WU00:FS00:0xa7:Calling: mdrun -s frame83.tpr -o frame83.trr -x frame83.xtc -cpt 5 -nt 45
13:55:07:WU00:FS00:0xa7:Steps: first=20750000 total=250000
13:55:08:Removing old file 'configs/config-20200407-010322.xml'
13:55:08:Saving configuration to /etc/fahclient/config.xml
13:55:08:<config>
13:55:08: <!-- Folding Core -->
13:55:08: <checkpoint v='5'/>
13:55:08:
13:55:08: <!-- Folding Slot Configuration -->
13:55:08: <gpu v='false'/>
13:55:08:
13:55:08: <!-- HTTP Server -->
13:55:08: <allow v='10.10.10.0/24 127.0.0.1'/>
13:55:08:
13:55:08: <!-- Network -->
13:55:08: <proxy v=':8080'/>
13:55:08:
13:55:08: <!-- Remote Command Server -->
13:55:08: <command-allow-no-pass v='10.10.10.0/24 127.0.0.1'/>
13:55:08: <password v='*****'/>
13:55:08:
13:55:08: <!-- User Information -->
13:55:08: <passkey v='*****'/>
13:55:08: <team v='241312'/>
13:55:08: <user v='whlee'/>
13:55:08:
13:55:08: <!-- Folding Slots -->
13:55:08: <slot id='0' type='CPU'>
13:55:08: <client-type v='bigbeta'/>
13:55:08: </slot>
13:55:08:</config>
13:55:09:WU00:FS00:0xa7:Completed 1 out of 250000 steps (0%)
13:55:24:WU00:FS00:0xa7:Completed 2500 out of 250000 steps (1%)
13:55:38:WU00:FS00:0xa7:Completed 5000 out of 250000 steps (2%)
13:55:52:WU00:FS00:0xa7:Completed 7500 out of 250000 steps (3%)
13:56:06:WU00:FS00:0xa7:Completed 10000 out of 250000 steps (4%)
Re: Project 16417 fails on high core count machines
Posted: Mon Apr 20, 2020 2:24 pm
by Neil-B
You may find setting core count to 32 will complete the WU.
Re: Project 16417 fails on high core count machines
Posted: Mon Apr 20, 2020 6:29 pm
by Joe_H
Neil-B wrote:You may find setting core count to 32 will complete the WU.
However, if the WU was downloaded at a setting of 24 for the CPU thread count you will not be able to raise it over that number.
Re: Project 16417 fails on high core count machines
Posted: Mon Apr 20, 2020 7:22 pm
by Neil-B
Sorry, was responding to Zzyzx post which had log showing was running 48 then 47 threads … and I guess there may be a number between 32 and 47/48 that works as well … My bad.
Re: Project 16417 fails on high core count machines
Posted: Tue Apr 21, 2020 5:39 am
by PantherX
Zzyzx wrote:...
13:55:08: <client-type v='bigbeta'/>
...
Please note that in the current client, there's no argument value called "bigbeta" so you can remove it.
Re: Project 16417 fails on high core count machines
Posted: Wed Apr 22, 2020 6:16 am
by Zzyzx
Neil-B wrote:Sorry, was responding to Zzyzx post which had log showing was running 48 then 47 threads … and I guess there may be a number between 32 and 47/48 that works as well … My bad.
Yeah, I was getting the error with 48 threads. I found by turning it down to 47 (which actually decayed down to 45 because of primes,) it ran just fine.
PantherX wrote:Zzyzx wrote:...
13:55:08: <client-type v='bigbeta'/>
...
Please note that in the current client, there's no argument value called "bigbeta" so you can remove it.
Ah, thanks, updated!
Re: Project 16417 fails on high core count machines
Posted: Sat Jun 20, 2020 3:36 pm
by HendricksSA
Project: 16417 (Run 904, Clone 3, Gen 243) Decomposition Fail. I thought this would not get assigned to 48 thread machines after all this conversation, but I got one this morning. Changed to 45 threads per _r2w_ben advice and processing fine. Just passing FYI.