[Bad WU, Possiable freak crash] PRCG 13833 (0,4937,10)

Moderators: Site Moderators, FAHC Science Team

Post Reply
HSF
Posts: 5
Joined: Tue Mar 17, 2020 9:17 pm

[Bad WU, Possiable freak crash] PRCG 13833 (0,4937,10)

Post by HSF »

Log attached below.

Code: Select all

08:43:05:WU02:FS00:0xa7:Project: 13833 (Run 0, Clone 4937, Gen 10)
08:43:05:WU02:FS00:0xa7:Unit: 0x0000000e80fccb095e6e556528ff8640
08:43:05:WU02:FS00:0xa7:Reading tar file core.xml
08:43:05:WU02:FS00:0xa7:Reading tar file frame10.tpr
08:43:05:WU02:FS00:0xa7:Digital signatures verified
08:43:05:WU02:FS00:0xa7:Calling: mdrun -s frame10.tpr -o frame10.trr -x frame10.xtc -cpt 15 -nt 15
08:43:05:WU02:FS00:0xa7:Steps: first=2500000 total=250000
08:43:05:WU02:FS00:0xa7:ERROR:
08:43:05:WU02:FS00:0xa7:ERROR:-------------------------------------------------------
08:43:05:WU02:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
08:43:05:WU02:FS00:0xa7:ERROR:Source code file: C:\build\fah\core-a7-avx-release\windows-10-64bit-core-a7-avx-release\gromacs-core\build\gromacs\src\gromacs\mdlib\domdec.c, line: 6902
08:43:05:WU02:FS00:0xa7:ERROR:
08:43:05:WU02:FS00:0xa7:ERROR:Fatal error:
08:43:05:WU02:FS00:0xa7:ERROR:There is no domain decomposition for 15 ranks that is compatible with the given box and a minimum cell size of 1.45733 nm
08:43:05:WU02:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
08:43:05:WU02:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
08:43:05:WU02:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
08:43:05:WU02:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
08:43:05:WU02:FS00:0xa7:ERROR:-------------------------------------------------------
08:43:10:WU02:FS00:0xa7:WARNING:Unexpected exit() call
08:43:10:WU02:FS00:0xa7:WARNING:Unexpected exit from science code
08:43:10:WU02:FS00:0xa7:Saving result file ..\logfile_01.txt
08:43:10:WU02:FS00:0xa7:Saving result file md.log
08:43:10:WU02:FS00:0xa7:Saving result file science.log
08:43:10:WU02:FS00:0xa7:WARNING:While cleaning up: boost::filesystem::remove: The process cannot access the file because it is being used by another process: "01/md.log"
08:43:10:WU02:FS00:0xa7:Folding@home Core Shutdown: BAD_WORK_UNIT
08:43:10:WARNING:WU02:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
08:43:10:WU02:FS00:Sending unit results: id:02 state:SEND error:FAULTY project:13833 run:0 clone:4937 gen:10 core:0xa7 unit:0x0000000e80fccb095e6e556528ff8640
Considering I'm running other WU's completely fine, possiable freak crash and/or bad generation?
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon [email protected], 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon [email protected], 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: [email protected], 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: [Bad WU, Possiable freak crash] PRCG 13833 (0,4937,10)

Post by Neil-B »

Think this has been spotted ... believe this project may now no longer be being issued for 15cores ... someone will confirm but there is a recent post on this.

Edit ... actually might have been a different one I'm checking ... it was a different one but possible same type of issue relating to number of cores you are folding with - a search for large primes may throw light on it - most projects can cope with cores multiple of 5 but some have been sensitive to this
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
Joe_H
Site Admin
Posts: 7990
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Studio M1 Max 32 GB smp6
Mac Hack i7-7700K 48 GB smp4
Location: W. MA

Re: [Bad WU, Possiable freak crash] PRCG 13833 (0,4937,10)

Post by Joe_H »

Another problem I spot is here:

Code: Select all

08:43:10:WU02:FS00:0xa7:WARNING:While cleaning up: boost::filesystem::remove: The process cannot access the file because it is being used by another process: "01/md.log"
Either part of the process had not exited properly and the file was still open when it shouldn't have been, or there is some filesytem problem. I would go with the first as being the explanation as long as you don't see this kind of error repeating.
Image
Post Reply