Page 1 of 1

Project: 7703 (Run 2, Clone 5, Gen 7) fails

Posted: Mon Dec 19, 2011 11:19 pm
by Jan van de Velde
For the second time Project: 7703 (Run 2, Clone 5, Gen 7) failed somewhere over halfway on my machine.

This time without any apparent reason it started all over upon restarting my machine:

Code: Select all

[13:21:16] Called DecompressByteArray: compressed_data_size=1004645 data_size=2267148, decompressed_data_size=2267148 diff=0
[13:21:16] - Digital signature verified
[13:21:16] 
[13:21:16] Project: 7703 (Run 2, Clone 5, Gen 7)
[13:21:16] 
[13:21:19] Assembly optimizations on if available.
[13:21:19] Entering M.D.
[13:21:25] Using Gromacs checkpoints
[13:21:26] Mapping NT from 1 to 1 
[13:21:33] Resuming from checkpoint
[13:21:34] Verified work/wudata_01.log
[13:21:34] Verified work/wudata_01.trr
[13:21:34] Verified work/wudata_01.xtc
[13:21:34] Verified work/wudata_01.edr
[13:21:36] Completed 562800 out of 1000000 steps  (56%)
[13:58:57] Completed 570000 out of 1000000 steps  (57%)
[14:51:15] Completed 580000 out of 1000000 steps  (58%)

Folding@Home Client Shutdown.


--- Opening Log file [December 19 15:03:36 UTC] 


# Windows CPU Systray Edition #################################################
###############################################################################

                       Folding@Home Client Version 6.23

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\Program FilesFolding@home\CORE1\Folding@home\Folding@home-x86


[15:03:36] - Ask before connecting: No
[15:03:36] - User name: Jan_van_de_Velde (Team 48658)
[15:03:36] - User ID: 63E313039CFE675
[15:03:36] - Machine ID: 2
[15:03:36] 
[15:03:37] Loaded queue successfully.
[15:03:37] Initialization complete
[15:03:37] 
[15:03:37] + Processing work unit
[15:03:37] Core required: FahCore_a4.exe
[15:03:37] Core found.
[15:03:38] Working on queue slot 01 [December 19 15:03:38 UTC]
[15:03:38] + Working ...
[15:03:39] 
[15:03:39] *------------------------------*
[15:03:39] Folding@Home Gromacs GB Core
[15:03:39] Version 2.27 (Dec. 15, 2010)
[15:03:39] 
[15:03:39] Preparing to commence simulation
[15:03:39] - Looking at optimizations...
[15:03:39] - Files status OK
[15:03:40] - Expanded 1004645 -> 2267148 (decompressed 225.6 percent)
[15:03:40] Called DecompressByteArray: compressed_data_size=1004645 data_size=2267148, decompressed_data_size=2267148 diff=0
[15:03:40] - Digital signature verified
[15:03:41] 
[15:03:41] Project: 7703 (Run 2, Clone 5, Gen 7)
[15:03:41] 
[15:03:41] Assembly optimizations on if available.
[15:03:41] Entering M.D.
[15:03:47] Mapping NT from 1 to 1 
[15:03:54] Completed 0 out of 1000000 steps  (0%)
[16:00:41] Completed 10000 out of 1000000 steps  (1%)
[16:57:25] Completed 20000 out of 1000000 steps  (2%)
[17:54:19] Completed 30000 out of 1000000 steps  (3%)
[18:57:10] Completed 40000 out of 1000000 steps  (4%)
As this was the second time on this same WU (first time it ran into problems was about a week ago) I have now deleted the entire workmap and the machine is now working on another project.

Mod Edit: Changed Quote Tags To Code Tags - PantherX

Re: Project: 7703 (Run 2, Clone 5, Gen 7) fails

Posted: Tue Dec 20, 2011 4:44 am
by PantherX
The WU isn't a bad one as it was completed by another donor:
Your WU (P7703 R2 C5 G7) was added to the stats database on 2011-12-06 02:08:20 for 825 points of credit.

Re: Project: 7703 (Run 2, Clone 5, Gen 7) fails

Posted: Sat Dec 24, 2011 12:18 pm
by Jan van de Velde
Can anyone find a reason in those logs why that WU started all over again?

Re: Project: 7703 (Run 2, Clone 5, Gen 7) fails

Posted: Sat Dec 24, 2011 5:10 pm
by bruce
Jan van de Velde wrote:Can anyone find a reason in those logs why that WU started all over again?
When it restarts correctly, you see these messages:
[13:21:33] Resuming from checkpoint
[13:21:34] Verified work/wudata_01.log
[13:21:34] Verified work/wudata_01.trr
[13:21:34] Verified work/wudata_01.xtc
[13:21:34] Verified work/wudata_01.edr

Whenever you restart, the checkpoint information is verified, and if it is found to be corrupt, the WU is restarted from the beginning.

Data can be corrupted by some other program modifying something (sometimes an AntiVirus program or whatever) but it can also be corrupted if the OS was shut down in a way that prevented it from writing all FAH data from cache to the harddisk (such as a power failure or BSOD).

Re: Project: 7703 (Run 2, Clone 5, Gen 7) fails

Posted: Mon Jan 02, 2012 1:20 pm
by Jan van de Velde
In other words, no clear reason to be found here? In all the years I have been Folding I had the occasional restart, usually to be traced back to not properly closing (e.g. power failure, and indicated in the logs in such a manner that even I could understand that something went horribly wrong), two restarts from scratch in the same WU without any clear indications as to the reason now were a matter of chance, and the chance of it happening a third time would have been extremely slight.

Well, since this mishap I have turned in a few other WU's without a hitch. So let's keep folding. :ewink:

Re: Project: 7703 (Run 2, Clone 5, Gen 7) fails

Posted: Mon Jan 02, 2012 5:55 pm
by Joe_H
It does look like no clear reason can be determined from the log file. I can add one other possibility, the checkpoint could have coincided with the shutdown. That can cause it to be corrupted. I have tracked a few restarts from the beginning on my folding machines to that. When I can, I check the modified times on the checkpoint files before shutting down and make sure to give the system a minute or two to flush the data all the way to disk.