Page 1 of 1

[Bad WU, Possiable freak crash] PRCG 13833 (0,4937,10)

Posted: Mon Mar 30, 2020 8:49 am
by HSF
Log attached below.

Code: Select all

08:43:05:WU02:FS00:0xa7:Project: 13833 (Run 0, Clone 4937, Gen 10)
08:43:05:WU02:FS00:0xa7:Unit: 0x0000000e80fccb095e6e556528ff8640
08:43:05:WU02:FS00:0xa7:Reading tar file core.xml
08:43:05:WU02:FS00:0xa7:Reading tar file frame10.tpr
08:43:05:WU02:FS00:0xa7:Digital signatures verified
08:43:05:WU02:FS00:0xa7:Calling: mdrun -s frame10.tpr -o frame10.trr -x frame10.xtc -cpt 15 -nt 15
08:43:05:WU02:FS00:0xa7:Steps: first=2500000 total=250000
08:43:05:WU02:FS00:0xa7:ERROR:
08:43:05:WU02:FS00:0xa7:ERROR:-------------------------------------------------------
08:43:05:WU02:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
08:43:05:WU02:FS00:0xa7:ERROR:Source code file: C:\build\fah\core-a7-avx-release\windows-10-64bit-core-a7-avx-release\gromacs-core\build\gromacs\src\gromacs\mdlib\domdec.c, line: 6902
08:43:05:WU02:FS00:0xa7:ERROR:
08:43:05:WU02:FS00:0xa7:ERROR:Fatal error:
08:43:05:WU02:FS00:0xa7:ERROR:There is no domain decomposition for 15 ranks that is compatible with the given box and a minimum cell size of 1.45733 nm
08:43:05:WU02:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
08:43:05:WU02:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
08:43:05:WU02:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
08:43:05:WU02:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
08:43:05:WU02:FS00:0xa7:ERROR:-------------------------------------------------------
08:43:10:WU02:FS00:0xa7:WARNING:Unexpected exit() call
08:43:10:WU02:FS00:0xa7:WARNING:Unexpected exit from science code
08:43:10:WU02:FS00:0xa7:Saving result file ..\logfile_01.txt
08:43:10:WU02:FS00:0xa7:Saving result file md.log
08:43:10:WU02:FS00:0xa7:Saving result file science.log
08:43:10:WU02:FS00:0xa7:WARNING:While cleaning up: boost::filesystem::remove: The process cannot access the file because it is being used by another process: "01/md.log"
08:43:10:WU02:FS00:0xa7:Folding@home Core Shutdown: BAD_WORK_UNIT
08:43:10:WARNING:WU02:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
08:43:10:WU02:FS00:Sending unit results: id:02 state:SEND error:FAULTY project:13833 run:0 clone:4937 gen:10 core:0xa7 unit:0x0000000e80fccb095e6e556528ff8640
Considering I'm running other WU's completely fine, possiable freak crash and/or bad generation?

Re: [Bad WU, Possiable freak crash] PRCG 13833 (0,4937,10)

Posted: Mon Mar 30, 2020 9:01 am
by Neil-B
Think this has been spotted ... believe this project may now no longer be being issued for 15cores ... someone will confirm but there is a recent post on this.

Edit ... actually might have been a different one I'm checking ... it was a different one but possible same type of issue relating to number of cores you are folding with - a search for large primes may throw light on it - most projects can cope with cores multiple of 5 but some have been sensitive to this

Re: [Bad WU, Possiable freak crash] PRCG 13833 (0,4937,10)

Posted: Mon Mar 30, 2020 4:15 pm
by Joe_H
Another problem I spot is here:

Code: Select all

08:43:10:WU02:FS00:0xa7:WARNING:While cleaning up: boost::filesystem::remove: The process cannot access the file because it is being used by another process: "01/md.log"
Either part of the process had not exited properly and the file was still open when it shouldn't have been, or there is some filesytem problem. I would go with the first as being the explanation as long as you don't see this kind of error repeating.