Project: 5915 (Run 8, Clone 659, Gen 0)

thecomputerdude · Post by **thecomputerdude** » Thu Nov 26, 2009 8:17 pm

Machine isn't unstable, or even overclocked, but I keep getting this error for this WU (Run 11 doesn't have this issue). Why is it resuming from a checkpoint anyway?

Code: Select all

[17:41:06] + Attempting to send results [November 26 17:41:06 UTC]
[17:41:07] + Results successfully sent
[17:41:07] Thank you for your contribution to Folding@Home.
[17:41:07] + Number of Units Completed: 76

[17:41:11] - Preparing to get new work unit...
[17:41:11] + Attempting to get work packet
[17:41:11] - Connecting to assignment server
[17:41:12] - Successful: assigned to (171.64.65.20).
[17:41:12] + News From Folding@Home: Welcome to Folding@Home
[17:41:12] Loaded queue successfully.
[17:41:13] + Closed connections
[17:41:13] 
[17:41:13] + Processing work unit
[17:41:13] Core required: FahCore_14.exe
[17:41:13] Core found.
[17:41:13] Working on queue slot 05 [November 26 17:41:13 UTC]
[17:41:13] + Working ...
[17:41:13] 
[17:41:13] *------------------------------*
[17:41:13] Folding@Home GPU Core - Beta
[17:41:13] Version 1.26 (Wed Oct 14 13:09:26 PDT 2009)
[17:41:13] 
[17:41:13] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
[17:41:13] Build host: vspm46
[17:41:13] Board Type: Nvidia
[17:41:13] Core      : 
[17:41:13] Preparing to commence simulation
[17:41:13] - Assembly optimizations manually forced on.
[17:41:13] - Not checking prior termination.
[17:41:13] - Expanded 66014 -> 357580 (decompressed 541.6 percent)
[17:41:13] Called DecompressByteArray: compressed_data_size=66014 data_size=357580, decompressed_data_size=357580 diff=0
[17:41:13] - Digital signature verified
[17:41:13] 
[17:41:13] Project: 5915 (Run 8, Clone 659, Gen 0)
[17:41:13] 
[17:41:13] Assembly optimizations on if available.
[17:41:13] Entering M.D.
[17:41:19] Will resume from checkpoint file
[17:41:19] Tpr hash work/wudata_05.tpr:  1401122206 2687625502 3103304894 2164034549 1862053871
[17:41:19] Working on Protein
[17:41:20] Client config found, loading data.
[17:41:20] Resuming from checkpoint
[17:41:20] fcCheckPointResume: retrieved and current tpr file hash:
[17:41:20]    0      3800001   1401122206
[17:41:20]    1   3217540665   2687625502
[17:41:20]    2   3220006114   3103304894
[17:41:20]    3   3224298259   2164034549
[17:41:20]    4   3217217635   1862053871
[17:41:20] fcCheckPointResume: file hashes different -- aborting.
[17:41:20] mdrun_gpu returned 
[17:41:20] Checkpoint failure
[17:41:20] 
[17:41:20] Folding@home Core Shutdown: UNSTABLE_MACHINE
[17:41:23] CoreStatus = 7A (122)
[17:41:23] Sending work to server
[17:41:23] Project: 5915 (Run 8, Clone 659, Gen 0)
[17:41:23] - Read packet limit of 540015616... Set to 524286976.
[17:41:23] - Error: Could not get length of results file work/wuresults_05.dat
[17:41:23] - Error: Could not read unit 05 file. Removing from queue.
[17:41:23] - Preparing to get new work unit...
[17:41:23] + Attempting to get work packet
[17:41:23] - Connecting to assignment server
[17:41:23] - Successful: assigned to (171.64.65.20).
[17:41:23] + News From Folding@Home: Welcome to Folding@Home
[17:41:24] Loaded queue successfully.
[17:41:24] + Closed connections

Post by **toTOW** » Thu Nov 26, 2009 10:19 pm

There are probably some checkpoint files left there by other failures that didn't clean well ... you should clean you /work folder (check in the log to know which queue entry you're working on, and delete all file in /work that doesn't have the same number as the slot you're running).

Post by **bruce** » Fri Nov 27, 2009 3:14 am

If you post the information after the segment that you did post, there's a pretty good chance that the same WU was processed correctly. The message "checkpoint failure" is a lot more accurate description of the problem than the message "UNSTABLE_MACHINE"

As toTOW has suggested, there are circumstances where a WU that was just downloaded gets matched up with a checkpoint left over from some previous error. Under those circumstances, the FahCore validates whether the checkpoint is correct or not, finds that it doesn't match the current project, and issues an error message while discarding the old checkpoint. It just happens that there are a number of other errors which use the same error exit so the message isn't specific.

Cleaning up the left-over checkpoint cost you a total of 13 seconds of processing time . . . not worth worrying about.

Folding Forum

Project: 5915 (Run 8, Clone 659, Gen 0)

Project: 5915 (Run 8, Clone 659, Gen 0)

Re: Project: 5915 (Run 8, Clone 659, Gen 0)

Re: Project: 5915 (Run 8, Clone 659, Gen 0)