Page 1 of 1

Project: 2665 (Run 3, Clone 300, Gen 36)

Posted: Tue Aug 19, 2008 4:41 am
by Zagen30
EUE'd on me twice (comp specs in profile):

Code: Select all

[09:06:21] Completed 242500 out of 250000 steps  (97 percent)
[09:14:18] Warning:  long 1-4 interactions
[09:14:21] Quit 101 - NaN detected: (ener[20])
[09:14:21] 
[09:14:21] Simulation instability has been encountered. The run has entered a
[09:14:21]   state from which no further progress can be made.
[09:14:21] This may be the correct result of the simulation, however if you
[09:14:21]   often see other project units terminating early like this
[09:14:21]   too, you may wish to check the stability of your computer (issues
[09:14:21]   such as high temperature, overclocking, etc.).
[09:14:21] Going to send back what have done.
[09:14:21] logfile size: 205208
[09:14:21] - Writing 205758 bytes of core data to disk...
[09:14:21]   ... Done.
[09:14:21] - Failed to delete work/wudata_04.arc
[09:14:21] No C.P. to delete.
[09:14:21] Warning:  check for stray files
[09:14:21] 
[09:14:21] Folding@home Core Shutdown: EARLY_UNIT_END
[09:14:21] 
[09:14:21] Folding@home Core Shutdown: EARLY_UNIT_END

Folding@Home Client Shutdown at user request.

Folding@Home Client Shutdown.

[14:46:41] - Ask before connecting: No
[14:46:41] - User name: Zagen30 (Team 0)
[14:46:41] - User ID: 4B0CBF697DF8B48F
[14:46:41] - Machine ID: 2
[14:46:41] 
[14:46:41] Loaded queue successfully.
[14:46:41] 
[14:46:41] + Processing work unit
[14:46:41] Core required: FahCore_a1.exe
[14:46:41] Core found.
[14:46:41] Working on Unit 04 [August 18 14:46:41]
[14:46:41] + Working ...
[14:46:42] 
[14:46:42] *------------------------------*
[14:46:42] Folding@Home Gromacs SMP Core
[14:46:42] Version 1.74 (March 10, 2007)
[14:46:42] 
[14:46:42] Preparing to commence simulation
[14:46:42] - Ensuring status. Please wait.
[14:46:59] - Looking at optimizations...
[14:46:59] - Working with standard loops on this execution.
[14:46:59] - Previous termination of core was improper.
[14:46:59] - Files status OK
[14:48:59] 
[14:48:59] Folding@home Core Shutdown: MISSING_WORK_FILES
[14:48:59] Finalizing output
[14:49:02] CoreStatus = 1 (1)
[14:49:02] Client-core communications error: ERROR 0x1
[14:49:02] Deleting current work unit & continuing...
[14:51:23] - Preparing to get new work unit...
[14:51:23] + Attempting to get work packet
[14:51:23] - Connecting to assignment server
[14:51:23] - Successful: assigned to (171.64.65.64).
[14:51:23] + News From Folding@Home: Welcome to Folding@Home
[14:51:24] Loaded queue successfully.
[14:51:44] + Closed connections
[14:51:49] 
[14:51:49] + Processing work unit
[14:51:49] Core required: FahCore_a1.exe
[14:51:49] Core found.
[14:51:49] Working on Unit 05 [August 18 14:51:49]
[14:51:49] + Working ...
[14:51:49] 
[14:51:49] *------------------------------*
[14:51:49] Folding@Home Gromacs SMP Core
[14:51:49] Version 1.74 (March 10, 2007)
[14:51:49] 
[14:51:49] Preparing to commence simulation
[14:51:49] - Ensuring status. Please wait.
[14:52:06] - Looking at optimizations...
[14:52:06] - Working with standard loops on this execution.
[14:52:06] - Previous termination of core was improper.
[14:52:06] - Going toatus OK
[14:52:06] ndard loops.
[14:52:06] - Files status OK
[14:52:29] (decompressed 513.4 percent)
[14:52:30] cket
[14:52:30] 
[14:52:30] Project: 2665 (Run 3, Clone 300, Gen 36)
[14:52:30] 
[14:52:30] 65 (Run 3, Clone 300, Gen 36)
[14:52:30] 
[14:52:33] 65 (Run 3, Clone 300, Gen 36)
[14:52:33] 
[14:52:36] Entering M.D.
[14:52:42] Rejecting checkpoint
[14:52:44] 
[14:52:44] Writing local files
[14:52:45] 
[14:52:45] Writing local files
[14:52:56] Extra SSE boost OK.
[14:52:57] Writing local files
[14:52:58] Completed 0 out of 250000 steps  (0 percent)
[15:18:45] Writing local files
[15:18:45] Completed 2500 out of 250000 steps  (1 percent)
[15:42:46] Writing local files
[15:42:46] Completed 5000 out of 250000 steps  (2 percent)
[16:07:45] Writing local files
[16:07:45] Completed 7500 out of 250000 steps  (3 percent)
[16:32:36] Writing local files
[16:32:36] Completed 10000 out of 250000 steps  (4 percent)
[16:58:17] Writing local files
[16:58:17] Completed 12500 out of 250000 steps  (5 percent)
[17:23:16] Writing local files
[17:23:17] Completed 15000 out of 250000 steps  (6 percent)
[17:48:14] Writing local files
[17:48:14] Completed 17500 out of 250000 steps  (7 percent)
[18:13:20] Writing local files
[18:13:21] Completed 20000 out of 250000 steps  (8 percent)
[18:38:26] Writing local files
[18:38:27] Completed 22500 out of 250000 steps  (9 percent)
[19:03:42] Writing local files
[19:03:42] Completed 25000 out of 250000 steps  (10 percent)
[19:31:27] Writing local files
[19:31:27] Completed 27500 out of 250000 steps  (11 percent)
[19:56:57] Writing local files
[19:56:58] Completed 30000 out of 250000 steps  (12 percent)
[20:22:01] Writing local files
[20:22:01] Completed 32500 out of 250000 steps  (13 percent)
[20:46:57] Writing local files
[20:46:57] Completed 35000 out of 250000 steps  (14 percent)
[21:11:49] Writing local files
[21:11:49] Completed 37500 out of 250000 steps  (15 percent)
[21:36:57] Writing local files
[21:36:58] Completed 40000 out of 250000 steps  (16 percent)
[22:01:33] Writing local files
[22:01:34] Completed 42500 out of 250000 steps  (17 percent)
[22:25:26] Writing local files
[22:25:27] Completed 45000 out of 250000 steps  (18 percent)
[22:49:18] Writing local files
[22:49:18] Completed 47500 out of 250000 steps  (19 percent)
[23:13:01] Writing local files
[23:13:01] Completed 50000 out of 250000 steps  (20 percent)
[23:36:45] Writing local files
[23:36:46] Completed 52500 out of 250000 steps  (21 percent)
[00:00:58] Writing local files
[00:00:59] Completed 55000 out of 250000 steps  (22 percent)
[00:35:03] Writing local files
[00:35:03] Completed 57500 out of 250000 steps  (23 percent)
[01:00:16] Writing local files
[01:00:16] Completed 60000 out of 250000 steps  (24 percent)
[01:24:21] Writing local files
[01:24:21] Completed 62500 out of 250000 steps  (25 percent)
[01:48:19] Writing local files
[01:48:19] Completed 65000 out of 250000 steps  (26 percent)
[02:12:13] Writing local files
[02:12:14] Completed 67500 out of 250000 steps  (27 percent)
[02:36:41] Writing local files
[02:36:42] Completed 70000 out of 250000 steps  (28 percent)
[02:38:55] Warning:  long 1-4 interactions
[02:38:56] Gromacs cannot continue further.
[02:38:56] Going to send back what have done.
[02:38:56] logfile size: 18841
[02:38:56] - Writing 19377 bytes of core data to disk...
[02:38:56]   ... Done.
[02:38:56] - Failed to delete work/wudata_05.sas
[02:38:56] - Failed to delete work/wudata_05.goe
[02:38:56] Warning:  check for stray files
[02:38:56] 
[02:38:56] Folding@home Core Shutdown: EARLY_UNIT_END
[02:38:56] 
[02:38:56] Folding@home Core Shutdown: EARLY_UNIT_END
Tried to qfix it after the first failure, but didn't work. After the second failure, qfix fixed the first result, and I was just able to send that one back in, apparently for full credit (the local tally was increased by 1). I guess the original was close enough to completion to be fully valid.

Is it normal for qfix to only work if there's more than 1 WU in the queue?

Re: Project: 2665 (Run 3, Clone 300, Gen 36)

Posted: Tue Aug 19, 2008 11:45 am
by toTOW
That's because your queue slot is not properly freed (symptom : the MISSING_WORK_FILEs error ... the client deletes work files, but fails (I don't know why) to delete queue informations). If you run -delete XX (where XX is the number in queue of the faulty WU) before running qfix, it would be able to recover it. If qfix finds a result file, but the queue is not empty, it won't fix anything.

For example on your first WU :
* fah.exe -delete 04
* qfix.exe
* fah.exe -send all (or simply restarting the client with your usual shortcut)

Re: Project: 2665 (Run 3, Clone 300, Gen 36)

Posted: Tue Aug 19, 2008 2:08 pm
by Zagen30
Maybe someone should update the FAQ on how to use qfix, because it specifically has the delete XX step after running qfix and sending the results.

Re: Project: 2665 (Run 3, Clone 300, Gen 36)

Posted: Tue Aug 19, 2008 6:13 pm
by toTOW
In fact the issue you're seeing is quite new ... I can't remember if we already saw it with v5.9x clients ... :(

Re: Project: 2665 (Run 3, Clone 300, Gen 36)

Posted: Tue Aug 19, 2008 7:30 pm
by Zagen30
Well you've seen it at least once, since after having troubles with 6.22 MPICH, I rolled back to 5.91.