14700 (689, 1, 0)

Moderators: Site Moderators, FAHC Science Team

Post Reply
DeHackEd
Posts: 5
Joined: Wed May 06, 2020 2:10 am

14700 (689, 1, 0)

Post by DeHackEd »

Code: Select all

01:58:28:WU00:FS00:Connecting to 65.254.110.245:80
01:58:28:WU00:FS00:Assigned to work server 155.247.166.219
01:58:28:WU00:FS00:Requesting new work unit for slot 00: READY cpu:16 from 155.247.166.219
01:58:28:WU00:FS00:Connecting to 155.247.166.219:8080
01:58:28:WU00:FS00:Downloading 3.01MiB
01:58:33:WU00:FS00:Download complete
01:58:33:WU00:FS00:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:14700 run:689 clone:1 gen:0 core:0xa7 unit:0x000000010002894b5ea9fccab51a8e6a
01:58:33:WU00:FS00:Starting
01:58:33:WU00:FS00:Running FahCore: /opt/foldingathome/FAHCoreWrapper /opt/foldingathome/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 706 -lifeline 4644 -checkpoint 15 -np 16
01:58:33:WU00:FS00:Started FahCore on PID 10645
01:58:34:WU00:FS00:Core PID:10650
01:58:34:WU00:FS00:FahCore 0xa7 started
01:58:34:WU00:FS00:0xa7:*********************** Log Started 2020-05-06T01:58:34Z ***********************
01:58:34:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
01:58:34:WU00:FS00:0xa7:       Type: 0xa7
01:58:34:WU00:FS00:0xa7:       Core: Gromacs
01:58:34:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 706 -lifeline 10645 -checkpoint 15 -np
01:58:34:WU00:FS00:0xa7:             16
01:58:34:WU00:FS00:0xa7:************************************ CBang *************************************
01:58:34:WU00:FS00:0xa7:       Date: Nov 5 2019
01:58:34:WU00:FS00:0xa7:       Time: 06:06:57
01:58:34:WU00:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
01:58:34:WU00:FS00:0xa7:     Branch: master
01:58:34:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
01:58:34:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
01:58:34:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
01:58:34:WU00:FS00:0xa7:       Bits: 64
01:58:34:WU00:FS00:0xa7:       Mode: Release
01:58:34:WU00:FS00:0xa7:************************************ System ************************************
01:58:34:WU00:FS00:0xa7:        CPU: AMD Ryzen 9 3900X 12-Core Processor
01:58:34:WU00:FS00:0xa7:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
01:58:34:WU00:FS00:0xa7:       CPUs: 24
01:58:34:WU00:FS00:0xa7:     Memory: 31.22GiB
01:58:34:WU00:FS00:0xa7:Free Memory: 5.56GiB
01:58:34:WU00:FS00:0xa7:    Threads: POSIX_THREADS
01:58:34:WU00:FS00:0xa7: OS Version: 4.9
01:58:34:WU00:FS00:0xa7:Has Battery: false
01:58:34:WU00:FS00:0xa7: On Battery: false
01:58:34:WU00:FS00:0xa7: UTC Offset: 0
01:58:34:WU00:FS00:0xa7:        PID: 10650
01:58:34:WU00:FS00:0xa7:        CWD: /opt/foldingathome/work
01:58:34:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
01:58:34:WU00:FS00:0xa7:    Version: 0.0.18
01:58:34:WU00:FS00:0xa7:     Author: Joseph Coffland <[email protected]>
01:58:34:WU00:FS00:0xa7:  Copyright: 2019 foldingathome.org
01:58:34:WU00:FS00:0xa7:   Homepage: https://foldingathome.org/
01:58:34:WU00:FS00:0xa7:       Date: Nov 5 2019
01:58:34:WU00:FS00:0xa7:       Time: 06:13:26
01:58:34:WU00:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
01:58:34:WU00:FS00:0xa7:     Branch: master
01:58:34:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
01:58:34:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
01:58:34:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
01:58:34:WU00:FS00:0xa7:       Bits: 64
01:58:34:WU00:FS00:0xa7:       Mode: Release
01:58:34:WU00:FS00:0xa7:************************************ Build *************************************
01:58:34:WU00:FS00:0xa7:       SIMD: avx_256
01:58:34:WU00:FS00:0xa7:********************************************************************************
01:58:34:WU00:FS00:0xa7:Project: 14700 (Run 689, Clone 1, Gen 0)
01:58:34:WU00:FS00:0xa7:Unit: 0x000000010002894b5ea9fccab51a8e6a
01:58:34:WU00:FS00:0xa7:Reading tar file core.xml
01:58:34:WU00:FS00:0xa7:Reading tar file frame0.tpr
01:58:34:WU00:FS00:0xa7:Digital signatures verified
01:58:34:WU00:FS00:0xa7:Calling: mdrun -s frame0.tpr -o frame0.trr -cpt 15 -nt 16
01:58:34:WU00:FS00:0xa7:Steps: first=0 total=250000
01:58:36:WU00:FS00:0xa7:Completed 1 out of 250000 steps (0%)
01:58:36:WU00:FS00:0xa7:ERROR:
01:58:36:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
01:58:36:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
01:58:36:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/pme.c, line: 754
01:58:36:WU00:FS00:0xa7:ERROR:
01:58:36:WU00:FS00:0xa7:ERROR:Fatal error:
01:58:36:WU00:FS00:0xa7:ERROR:4 particles communicated to PME rank 10 are more than 2/3 times the cut-off out of the domain decomposition cell of their charge group in dimension x.
01:58:36:WU00:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated.
01:58:36:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
01:58:36:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
01:58:36:WU00:FS00:0xa7:ERROR:-------------------------------------------------------

Code: Select all

[3065017.183142] FahCore_a7[10710]: segfault at 10 ip 0000000001200897 sp 00007ffda3261d60 error 4 in FahCore_a7[406000+10cc000]
Running client-type=advanced

Core is stuck in a crash loop, restarting and crashing again. Interestingly the number of failed particles varies between 1 and 4, but the rest of the error message is consistent.

One other user reported it faulty, so I'm dumping it.
PantherX
Site Moderator
Posts: 6986
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: 14700 (689, 1, 0)

Post by PantherX »

Welcome to the F@H Forum DeHackEd,

Thanks for reporting this. I have informed the researcher about this. I think that 16 CPUs might be too much for this project but let's wait and see what happens.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon [email protected], 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon [email protected], 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: [email protected], 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: 14700 (689, 1, 0)

Post by Neil-B »

Ran a number of these whilst in Beta ... 24/56 core and 32/56 core slots ran project with no issues so not sure it is a too many CPUs issue.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
PantherX
Site Moderator
Posts: 6986
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: 14700 (689, 1, 0)

Post by PantherX »

Humm... In that case, maybe _r2w_ben can explain what this means has they have been awesome at documenting the results (viewtopic.php?f=72&t=34350):
4 particles communicated to PME rank 10 are more than 2/3 times the cut-off out of the domain decomposition cell of their charge group in dimension x
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon [email protected], 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon [email protected], 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: [email protected], 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: 14700 (689, 1, 0)

Post by Neil-B »

The only thing I saw that was slightly odd was that for a few of this family of projects 14700, 14717 and 14800 I saw a stats anomaly for my 24/56 core slot where WUs returned low PPDs (14700 15%, 14717 5% and 14800 10% below minimum expected range), whereas the 32/56 core WUs for these Projects are is middle of range - All completed fine, but I mentioned it at the time as an anomaly (but probably not relevant).

I have a vague recollection there was a project a while back that had issues with mid range CPU counts?
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
vvoelz
Pande Group Member
Posts: 553
Joined: Sun Dec 02, 2007 8:07 pm
Location: Temple University, Philadelphia PA

Re: 14700 (689, 1, 0)

Post by vvoelz »

Hi PantherX --

Thanks for letting us know about this. Sorry you stuck in a crash loop!

16 CPUs shouldn't be too much for this project. If anything I think you have a "bad WU". Each RUN (there are 3000 in this project alone) is a different ligand we are screening as part of the COVID moonshot. Very rarely, we have noticed that a simulation appears to be poorly equilibrated and "blows up". I will stop this clone for now, and will keep an eye out. Interestingly, you are the first person to simulation this particular RUN. (https://apps.foldingathome.org/wu#proje ... ne=1&gen=0) So maybe the clone is bad or the whole RUN is. Other RUNs in the project seem fine.

UPDATE: p14700, r689, c1 has been STOPPED.
DeHackEd
Posts: 5
Joined: Wed May 06, 2020 2:10 am

Re: 14700 (689, 1, 0)

Post by DeHackEd »

To clarify I opened the thread and I think I'm the second person to run it. There's one attempt listed on the WU status page (that's not me) with a code of "Faulty 2", and my client instead got stuck and I deleted it so I assume I just won't be reported at all.
_r2w_ben
Posts: 285
Joined: Wed Apr 23, 2008 3:11 pm

Re: 14700 (689, 1, 0)

Post by _r2w_ben »

PantherX wrote:Humm... In that case, maybe _r2w_ben can explain what this means has they have been awesome at documenting the results (viewtopic.php?f=72&t=34350):
4 particles communicated to PME rank 10 are more than 2/3 times the cut-off out of the domain decomposition cell of their charge group in dimension x
When the failure mentions equilibrated, I think it's a bad work unit.

An incompatibility between the work unit and the number of cores assigned to a work unit looks like this:
There is no domain decomposition for ## ranks that is compatible with the given box and a minimum cell size of ## nm

Code: Select all

08:33:22:WU03:FS00:0xa7:ERROR:-------------------------------------------------------
08:33:22:WU03:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
08:33:22:WU03:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
08:33:22:WU03:FS00:0xa7:ERROR:
08:33:22:WU03:FS00:0xa7:ERROR:Fatal error:
08:33:22:WU03:FS00:0xa7:ERROR:There is no domain decomposition for 60 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
08:33:22:WU03:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
08:33:22:WU03:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
08:33:22:WU03:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
08:33:22:WU03:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
08:33:22:WU03:FS00:0xa7:ERROR:-------------------------------------------------------
PantherX
Site Moderator
Posts: 6986
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: 14700 (689, 1, 0)

Post by PantherX »

Ah, thanks for the clarification, vvoelz & _r2w_ben :)
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Post Reply