1. This doesn't seem to be due to bad WUs. If they reach 100% and you see this block, they have completed successfully:
Code: Select all
17:59:35:WU02:FS02:0x22:Completed 1000000 out of 1000000 steps (100%)
17:59:35:WU02:FS02:0x22:Average performance: 251.895 ns/day
17:59:35:WU02:FS02:0x22:Saving result file ..\logfile_01.txt
17:59:35:WU02:FS02:0x22:Saving result file checkpointState.xml
17:59:35:WU02:FS02:0x22:Saving result file globals.csv
17:59:35:WU02:FS02:0x22:Saving result file positions.xtc
17:59:35:WU02:FS02:0x22:Saving result file science.log
17:59:35:WU02:FS02:0x22:Folding@home Core Shutdown: FINISHED_UNIT
3. This would appear to be some kind of bug either in the core code or libraries, the core build process, or how the client handles the data packet.
Did these issues only start appearing with the core22 0.0.10 build? We recently upgraded our internal libcbang/libfah for core builds from 1.2.0 to 1.5.0, and I worry that one of these libraries may have introduced some instability.
My suspicion is that you're seeing this with 13415 in particular because the WUs are short and you are seeing the result of calling to some unstable library call more frequently triggering more frequent failures.
Thanks so much for bearing with us, and for helping us get to the bottom of this! We're still getting a ton of useful data helping us with the COVID Moonshot work, but we're committed to improving the stability to make things better for everyone.
~ John Chodera // MSKCC