My apologies if this isn't the correct place to post. I ran into an issue overnight with one of my servers running dual Intel Gold 5220's. Each time the machine went to run the work, it encountered this fatal error: There is no domain decomposition for 54 ranks that is compatible with the given box and a minimum cell size of 1.45733 nm
I have been able to work around the issue by reducing the original CPU work slot to 36 cores, and then creating a second slot of 36 cores. I am hoping someone could provide best practice for these types of compute environments. Is it better to have one slot with all 72 cores, two with 36, or 4 with 18? Should I turn off hyper-threading and only allow the system to run on physical CPU cores?
We have 5 more of these machines sitting idle that I'd like to put to work towards F@H. Your help is appreciated!
Logs below - Thanks!
Code: Select all
05:59:12:WU00:FS00:0xa7:*********************** Log Started 2020-03-28T05:59:12Z ***********************
05:59:12:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
05:59:12:WU00:FS00:0xa7: Type: 0xa7
05:59:12:WU00:FS00:0xa7: Core: Gromacs
05:59:12:WU00:FS00:0xa7: Args: -dir 00 -suffix 01 -version 704 -lifeline 1427 -checkpoint 15 -np
05:59:12:WU00:FS00:0xa7: 72
05:59:12:WU00:FS00:0xa7:************************************ CBang *************************************
05:59:12:WU00:FS00:0xa7: Date: Nov 5 2019
05:59:12:WU00:FS00:0xa7: Time: 06:06:57
05:59:12:WU00:FS00:0xa7: Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
05:59:12:WU00:FS00:0xa7: Branch: master
05:59:12:WU00:FS00:0xa7: Compiler: GNU 8.3.0
05:59:12:WU00:FS00:0xa7: Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
05:59:12:WU00:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
05:59:12:WU00:FS00:0xa7: Bits: 64
05:59:12:WU00:FS00:0xa7: Mode: Release
05:59:12:WU00:FS00:0xa7:************************************ System ************************************
05:59:12:WU00:FS00:0xa7: CPU: Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz
05:59:12:WU00:FS00:0xa7: CPU ID: GenuineIntel Family 6 Model 85 Stepping 7
05:59:12:WU00:FS00:0xa7: CPUs: 72
05:59:12:WU00:FS00:0xa7: Memory: 15.65GiB
05:59:12:WU00:FS00:0xa7:Free Memory: 15.13GiB
05:59:12:WU00:FS00:0xa7: Threads: POSIX_THREADS
05:59:12:WU00:FS00:0xa7: OS Version: 4.19
05:59:12:WU00:FS00:0xa7:Has Battery: false
05:59:12:WU00:FS00:0xa7: On Battery: false
05:59:12:WU00:FS00:0xa7: UTC Offset: 0
05:59:12:WU00:FS00:0xa7: PID: 1431
05:59:12:WU00:FS00:0xa7: CWD: /var/lib/fahclient/work
05:59:12:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
05:59:12:WU00:FS00:0xa7: Version: 0.0.18
05:59:12:WU00:FS00:0xa7: Author: Joseph Coffland <[email protected]>
05:59:12:WU00:FS00:0xa7: Copyright: 2019 foldingathome.org
05:59:12:WU00:FS00:0xa7: Homepage: https://foldingathome.org/
05:59:12:WU00:FS00:0xa7: Date: Nov 5 2019
05:59:12:WU00:FS00:0xa7: Time: 06:13:26
05:59:12:WU00:FS00:0xa7: Revision: 490c9aa2957b725af319379424d5c5cb36efb656
05:59:12:WU00:FS00:0xa7: Branch: master
05:59:12:WU00:FS00:0xa7: Compiler: GNU 8.3.0
05:59:12:WU00:FS00:0xa7: Options: -std=c++11 -O3 -funroll-loops -fno-pie
05:59:12:WU00:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
05:59:12:WU00:FS00:0xa7: Bits: 64
05:59:12:WU00:FS00:0xa7: Mode: Release
05:59:12:WU00:FS00:0xa7:************************************ Build *************************************
05:59:12:WU00:FS00:0xa7: SIMD: avx_256
05:59:12:WU00:FS00:0xa7:********************************************************************************
05:59:12:WU00:FS00:0xa7:Project: 14584 (Run 0, Clone 548, Gen 32)
05:59:12:WU00:FS00:0xa7:Unit: 0x000000210d5262775e7a6b6d7026244c
05:59:12:WU00:FS00:0xa7:Reading tar file core.xml
05:59:12:WU00:FS00:0xa7:Reading tar file frame32.tpr
05:59:12:WU00:FS00:0xa7:Digital signatures verified
05:59:12:WU00:FS00:0xa7:Calling: mdrun -s frame32.tpr -o frame32.trr -x frame32.xtc -cpt 15 -nt 72
05:59:12:WU00:FS00:0xa7:Steps: first=8000000 total=250000
05:59:12:WU00:FS00:0xa7:ERROR:
05:59:12:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
05:59:12:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
05:59:12:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
05:59:12:WU00:FS00:0xa7:ERROR:
05:59:12:WU00:FS00:0xa7:ERROR:Fatal error:
05:59:12:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 54 ranks that is compatible with the given box and a minimum cell size of 1.45733 nm
05:59:12:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
05:59:12:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
05:59:12:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
05:59:12:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
05:59:12:WU00:FS00:0xa7:ERROR:-------------------------------------------------------