[Linux] Problem with BAD_WORK_UNIT
Posted: Sun Apr 26, 2020 10:41 pm
Hi there:
Currently I'm folding in headless Linux machines and I noticed that sometimes I get these errors:
As far as I read is due to the number of cores used but I can't change via GUI and the nodes has 40 cores each. Is there any way to configure a smaller number of cores using config.xml? which number of cores do you recommend to avoid this type of error? Any way to auto-relaunch the client to avoid waste time?
Thank you!
Currently I'm folding in headless Linux machines and I noticed that sometimes I get these errors:
Code: Select all
22:31:44:WU01:FS00:0xa7:*********************** Log Started 2020-04-26T22:31:43Z ***********************
22:31:44:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
22:31:44:WU01:FS00:0xa7: Type: 0xa7
22:31:44:WU01:FS00:0xa7: Core: Gromacs
22:31:44:WU01:FS00:0xa7: Args: -dir 01 -suffix 01 -version 706 -lifeline 36972 -checkpoint 15 -np
22:31:44:WU01:FS00:0xa7: 39
22:31:44:WU01:FS00:0xa7:************************************ CBang *************************************
22:31:44:WU01:FS00:0xa7: Date: Nov 5 2019
22:31:44:WU01:FS00:0xa7: Time: 06:06:57
22:31:44:WU01:FS00:0xa7: Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
22:31:44:WU01:FS00:0xa7: Branch: master
22:31:44:WU01:FS00:0xa7: Compiler: GNU 8.3.0
22:31:44:WU01:FS00:0xa7: Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
22:31:44:WU01:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
22:31:44:WU01:FS00:0xa7: Bits: 64
22:31:44:WU01:FS00:0xa7: Mode: Release
22:31:44:WU01:FS00:0xa7:************************************ System ************************************
22:31:44:WU01:FS00:0xa7: CPU: Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz
22:31:44:WU01:FS00:0xa7: CPU ID: GenuineIntel Family 6 Model 79 Stepping 1
22:31:44:WU01:FS00:0xa7: CPUs: 40
22:31:44:WU01:FS00:0xa7: Memory: 62.65GiB
22:31:44:WU01:FS00:0xa7:Free Memory: 61.41GiB
22:31:44:WU01:FS00:0xa7: Threads: POSIX_THREADS
22:31:44:WU01:FS00:0xa7: OS Version: 3.10
22:31:44:WU01:FS00:0xa7:Has Battery: false
22:31:44:WU01:FS00:0xa7: On Battery: false
22:31:44:WU01:FS00:0xa7: UTC Offset: 2
22:31:44:WU01:FS00:0xa7: PID: 36976
22:31:44:WU01:FS00:0xa7: CWD: /var/lib/fahclient/work
22:31:44:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
22:31:44:WU01:FS00:0xa7: Version: 0.0.18
22:31:44:WU01:FS00:0xa7: Author: Joseph Coffland <[email protected]>
22:31:44:WU01:FS00:0xa7: Copyright: 2019 foldingathome.org
22:31:44:WU01:FS00:0xa7: Homepage: https://foldingathome.org/
22:31:44:WU01:FS00:0xa7: Date: Nov 5 2019
22:31:44:WU01:FS00:0xa7: Time: 06:13:26
22:31:44:WU01:FS00:0xa7: Revision: 490c9aa2957b725af319379424d5c5cb36efb656
22:31:44:WU01:FS00:0xa7: Branch: master
22:31:44:WU01:FS00:0xa7: Compiler: GNU 8.3.0
22:31:44:WU01:FS00:0xa7: Options: -std=c++11 -O3 -funroll-loops -fno-pie
22:31:44:WU01:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
22:31:44:WU01:FS00:0xa7: Bits: 64
22:31:44:WU01:FS00:0xa7: Mode: Release
22:31:44:WU01:FS00:0xa7:************************************ Build *************************************
22:31:44:WU01:FS00:0xa7: SIMD: avx_256
22:31:44:WU01:FS00:0xa7:********************************************************************************
22:31:44:WU01:FS00:0xa7:Project: 16417 (Run 1751, Clone 0, Gen 110)
22:31:44:WU01:FS00:0xa7:Unit: 0x0000007a96880e6e5e8a608553ba549c
22:31:44:WU01:FS00:0xa7:Reading tar file core.xml
22:31:44:WU01:FS00:0xa7:Reading tar file frame110.tpr
22:31:44:WU01:FS00:0xa7:Digital signatures verified
22:31:44:WU01:FS00:0xa7:Reducing thread count from 39 to 38 to avoid domain decomposition with large prime factor 13
22:31:44:WU01:FS00:0xa7:Reducing thread count from 38 to 37 to avoid domain decomposition with large prime factor 19
22:31:44:WU01:FS00:0xa7:Reducing thread count from 37 to 36 to avoid domain decomposition by a prime number > 3
22:31:44:WU01:FS00:0xa7:Calling: mdrun -s frame110.tpr -o frame110.trr -x frame110.xtc -cpt 15 -nt 36
22:31:44:WU01:FS00:0xa7:Steps: first=27500000 total=250000
22:31:44:WU01:FS00:0xa7:ERROR:
22:31:44:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
22:31:44:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
22:31:44:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
22:31:44:WU01:FS00:0xa7:ERROR:
22:31:44:WU01:FS00:0xa7:ERROR:Fatal error:
22:31:44:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 30 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm
22:31:44:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
22:31:44:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
22:31:44:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
22:31:44:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
22:31:44:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
22:31:49:WU01:FS00:0xa7:WARNING:Unexpected exit() call
22:31:49:WU01:FS00:0xa7:WARNING:Unexpected exit from science code
22:31:49:WU01:FS00:0xa7:Saving result file ../logfile_01.txt
22:31:49:WU01:FS00:0xa7:Saving result file md.log
22:31:49:WU01:FS00:0xa7:Saving result file science.log
22:31:49:WU01:FS00:0xa7:Folding@home Core Shutdown: BAD_WORK_UNIT
22:31:49:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
Thank you!