Page 1 of 1

Project: 3062 (Run 5, Clone 36, Gen 33) = EUE @ 87%

Posted: Mon Apr 21, 2008 4:02 am
by jevans64
I am currently working on this WU and I noticed that FahMon is reporting that I started it TWO days ago and isn't going to make the FINAL deadline. I searched through my log and noticed that my PC worked on this unit before and it EUEed at 87% BOTH times. I am running a single SMP instance on a Q6600 OCed to 3.0 GHz but this is the first time I've seen an EUE in quite some time.

Log...

Code: Select all

[14:42:02] *------------------------------*
[14:42:02] Folding@Home Gromacs SMP Core
[14:42:02] Version 1.74 (March 10, 2007)
[14:42:02] 
[14:42:02] Preparing to commence simulation
[14:42:02] - Ensuring status. Please wait.
[14:42:19] - Assembly optimizations manually forced on.
[14:42:19] - Not checking prior termination.
[14:42:20] - Expanded 607175 -> 3260637 (decompressed 537.0 percent)
[14:42:20] - Starting from initial work packet
[14:42:20] 
[14:42:20] Project: 3062 (Run 5, Clone 36, Gen 33)
[14:42:20] 
[14:42:20] Assembly optimizations on if available.
[14:42:20] Entering M.D.
[14:42:26] Rejecting checkpoint
[14:42:26] ProteinWriting local files
[14:42:26] Extra SSE boost OK.
[14:42:26] 
[14:42:26] Extra SSE boost OK.
[14:42:26] Writing local files
[14:42:26] Completed 0 out of 5000000 steps  (0 percent)
[14:53:11] Writing local files
[14:53:11] Completed 50000 out of 5000000 steps  (1 percent)
[15:03:57] Writing local files
[15:03:57] Completed 100000 out of 5000000 steps  (2 percent)
* * * * *
[06:23:05] Completed 4300000 out of 5000000 steps  (86 percent)
[06:33:48] Writing local files
[06:33:48] Completed 4350000 out of 5000000 steps  (87 percent)
[06:38:38] Warning:  long 1-4 interactions
[06:38:38] Quit 101 - NaN detected: (ener[0])
[06:38:38] 
[06:38:38] Simulation instability has been encountered. The run has entered a
[06:38:38]   state from which no further progress can be made.
[06:38:38] This may be the correct result of the simulation, however if you
[06:38:38]   often see other project units terminating early like this
[06:38:38]   too, you may wish to check the stability of your computer (issues
[06:38:38]   such as high temperature, overclocking, etc.).
[06:38:38] Going to send back what have done.
[06:38:38] logfile size: 116970
[06:38:38] - Writing 117519 bytes of core data to disk...
[06:38:38]   ... Done.
[06:40:38] 
[06:40:38] Folding@home Core Shutdown: EARLY_UNIT_END
[06:40:38] 
[06:40:38] Folding@home Core Shutdown: EARLY_UNIT_END
[06:40:41] CoreStatus = 7B (123)
[06:40:41] Client-core communications error: ERROR 0x7b
[06:40:41] Deleting current work unit & continuing...
[06:43:03] - Warning: Could not delete all work unit files (8): Core returned invalid code
[06:43:03] Trying to send all finished work units
[06:43:03] + No unsent completed units remaining.
[06:43:03] - Preparing to get new work unit...
[06:43:03] + Attempting to get work packet
[06:43:03] - Will indicate memory of 2047 MB
[06:43:03] - Connecting to assignment server
[06:43:03] Connecting to http://assign.stanford.edu:8080/
[06:43:05] Posted data.
[06:43:05] Initial: 40AB; - Successful: assigned to (171.64.65.63).
[06:43:05] + News From Folding@Home: Welcome to Folding@Home
[06:43:05] Loaded queue successfully.
[06:43:05] Connecting to http://171.64.65.63:8080/
[06:43:06] Posted data.
[06:43:06] Initial: 0000; - Receiving payload (expected size: 607687)
[06:43:08] - Downloaded at ~296 kB/s
[06:43:08] - Averaged speed for that direction ~282 kB/s
[06:43:08] + Received work.
[06:43:08] + Closed connections
[06:43:13] 
[06:43:13] + Processing work unit
[06:43:13] Core required: FahCore_a1.exe
[06:43:13] Core found.
[06:43:13] Working on Unit 09 [April 19 06:43:13]
[06:43:13] + Working ...
[06:43:13] - Calling 'mpiexec -channel auto -np 4 FahCore_a1.exe -dir work/ -suffix 09 -checkpoint 15 -forceasm -verbose -lifeline 3236 -version 591'

[06:43:14] 
[06:43:14] *------------------------------*
[06:43:14] Folding@Home Gromacs SMP Core
[06:43:14] Version 1.74 (March 10, 2007)
[06:43:14] 
[06:43:14] Preparing to commence simulation
[06:43:14] - Ensuring status. Please wait.
[06:43:31] - Assembly optimizations manually forced on.
[06:43:31] - Not checking prior termination.
[06:43:31] - Expanded 607175 -> 3260637 (decompressed 537.0 percent)
[06:43:31] - Starting from initial work packet
[06:43:31] 
[06:43:31] Project: 3062 (Run 5, Clone 36, Gen 33)
[06:43:31] 
[06:43:31] Assembly optimizations on if available.
[06:43:31] Entering M.D.
[06:43:37] Rejecting checkpoint
[06:43:38] ProteinWriting local files
[06:43:38] Extra SSE boost OK.
[06:43:38] 
[06:43:38] Extra SSE boost OK.
[06:43:38] Writing local files
[06:43:38] Completed 0 out of 5000000 steps  (0 percent)
[06:54:22] Writing local files
[06:54:22] Completed 50000 out of 5000000 steps  (1 percent)
[07:05:10] Writing local files
[07:05:10] Completed 100000 out of 5000000 steps  (2 percent)
[07:15:54] Writing local files
* * * * *
[22:11:01] Completed 4300000 out of 5000000 steps  (86 percent)
[22:21:43] Writing local files
[22:21:43] Completed 4350000 out of 5000000 steps  (87 percent)
[22:26:32] Warning:  long 1-4 interactions
[22:26:33] Quit 101 - NaN detected: (ener[0])
[22:26:33] 
[22:26:33] Simulation instability has been encountered. The run has entered a
[22:26:33]   state from which no further progress can be made.
[22:26:33] This may be the correct result of the simulation, however if you
[22:26:33]   often see other project units terminating early like this
[22:26:33]   too, you may wish to check the stability of your computer (issues
[22:26:33]   such as high temperature, overclocking, etc.).
[22:26:33] Going to send back what have done.
[22:26:33] logfile size: 116970
[22:26:33] - Writing 117519 bytes of core data to disk...
[22:26:33]   ... Done.
[22:26:33] 
[22:26:33] Folding@home Core Shutdown: EARLY_UNIT_END
[22:26:33] 
[22:26:33] Folding@home Core Shutdown: EARLY_UNIT_END
[22:26:36] CoreStatus = 7B (123)
[22:26:36] Client-core communications error: ERROR 0x7b
[22:26:36] Deleting current work unit & continuing...
[22:28:56] - Warning: Could not delete all work unit files (9): Core returned invalid code
[22:28:56] Trying to send all finished work units
[22:28:56] + No unsent completed units remaining.
[22:28:56] - Preparing to get new work unit...
[22:28:56] + Attempting to get work packet
[22:28:56] - Will indicate memory of 2047 MB
[22:28:56] - Connecting to assignment server
[22:28:56] Connecting to http://assign.stanford.edu:8080/
[22:28:59] Posted data.
[22:28:59] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[22:28:59] + News From Folding@Home: Welcome to Folding@Home
[22:28:59] Loaded queue successfully.
[22:28:59] Connecting to http://171.64.65.64:8080/
[22:29:02] Posted data.
[22:29:02] Initial: 0000; - Receiving payload (expected size: 2447185)
[22:29:11] - Downloaded at ~265 kB/s
[22:29:11] - Averaged speed for that direction ~278 kB/s
[22:29:11] + Received work.
[22:29:11] + Closed connections
[22:29:16] 
[22:29:16] + Processing work unit
[22:29:16] Core required: FahCore_a1.exe
[22:29:16] Core found.
[22:29:16] Working on Unit 00 [April 19 22:29:16]
[22:29:16] + Working ...
[22:29:16] - Calling 'mpiexec -channel auto -np 4 FahCore_a1.exe -dir work/ -suffix 00 -checkpoint 15 -forceasm -verbose -lifeline 3236 -version 591'
After that, it completed a Project: 2653 (Run 35, Clone 63, Gen 99) and stared on the p3062 again.

Re: Project: 3062 ( R5, C36, G33 ) = EUE @ 87%

Posted: Mon Apr 21, 2008 5:19 am
by bruce
jevans64 wrote:[14:42:20] Project: 3062 (Run 5, Clone 36, Gen 33)
. . .
[06:33:48] Completed 4350000 out of 5000000 steps (87 percent)
[06:38:38] Warning: long 1-4 interactions
[06:38:38] Quit 101 - NaN detected: (ener[0])
[06:40:38] Folding@home Core Shutdown: EARLY_UNIT_END
[06:40:38]
[06:40:38] Folding@home Core Shutdown: EARLY_UNIT_END
[06:40:41] CoreStatus = 7B (123)
[06:40:41] Client-core communications error: ERROR 0x7b
[06:40:41] Deleting current work unit & continuing...

[06:43:13] Working on Unit 09 [April 19 06:43:13]
[06:43:05] Initial: 40AB; - Successful: assigned to (171.64.65.63).
[06:43:31] Project: 3062 (Run 5, Clone 36, Gen 33)
. . .
[22:21:43] Completed 4350000 out of 5000000 steps (87 percent)
[22:26:32] Warning: long 1-4 interactions
[22:26:33] Quit 101 - NaN detected: (ener[0])
[22:26:33] Folding@home Core Shutdown: EARLY_UNIT_END
[22:26:33] Folding@home Core Shutdown: EARLY_UNIT_END
[22:26:36] CoreStatus = 7B (123)
[22:26:36] Client-core communications error: ERROR 0x7b
[22:26:36] Deleting current work unit & continuing...

[22:28:59] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[22:29:16] Working on Unit 00 [April 19 22:29:16]

After that, it completed a Project: 2653 (Run 35, Clone 63, Gen 99) and stared on the p3062 again.
Some WUs are predestined to die. When you get the same instability at 87% on the same WU, it's almost certainly a fault of the WU, not your hardware. One of the limitations of the beta software is that it does not capture the necessary data to give partial credit so something like that happens a lot more frequently that it should.

Before the top line of the portion of FahLog that you posted, did it say something like:
[hh:mm:ss] Working on Unit 0x[Month dd hh:mm:ss]?
Was x=8 or x=9? That makes a big difference in what the deadline should be. The data should also be in unitinfo.txt while the WU is running. From the times shown, it looks like the project 3062 WU took about 16 hours to get to 87%, so 100% would have been well within the deadline.

Re: Project: 3062 ( R5, C36, G33 ) = EUE @ 87%

Posted: Mon Apr 21, 2008 11:31 am
by jevans64
bruce wrote:
Some WUs are predestined to die. When you get the same instability at 87% on the same WU, it's almost certainly a fault of the WU, not your hardware. One of the limitations of the beta software is that it does not capture the necessary data to give partial credit so something like that happens a lot more frequently that it should.

Before the top line of the portion of FahLog that you posted, did it say something like:
[hh:mm:ss] Working on Unit 0x[Month dd hh:mm:ss]?
Was x=8 or x=9? That makes a big difference in what the deadline should be. The data should also be in unitinfo.txt while the WU is running. From the times shown, it looks like the project 3062 WU took about 16 hours to get to 87%, so 100% would have been well within the deadline.
Sorry. I didn't C&P that line.

Here it is... [14:42:02] Working on Unit 08 [April 18 14:42:02]

It ran twice in a row ( Unit 08 & 09 ) and EUEed at 87% then ran another WU ( Unit 00 ) then started on it a third time ( Unit 01).

I knew it was a problem with the WU but since partial results don't get sent back, reporting it here is the only quick way Pande knows a WU is bad. They can look here and see why a particular WU isn't returning results OR inform me that my rig is FUBAR. :D