Page 1 of 1

Project: 3062 (Run 2, Clone 71, Gen 7)

Posted: Tue Mar 18, 2008 3:15 pm
by 314159
Q6600, Linux Client, Stable Machine, Stock Clock

Code: Select all

[23:36:42] *------------------------------*
[23:36:42] Folding@Home Gromacs SMP Core
[23:36:59] - Starting from initial work packet 
[23:36:59] Project: 3062 (Run 2, Clone 71, Gen 7)
[23:37:06] Protein: p3062_lambda5_99sbExtra SSE boost OK.
[23:37:06] Extra SSE boost OK.
[23:37:06] Writing local files
[23:37:06] Completed 0 out of 5000000 steps  (0 percent)

[08:57:18] Writing local files
[08:57:18] Completed 2800000 out of 5000000 steps  (56 percent)
[09:07:17] Warning:  long 1-4 interactions
[09:07:21] CoreStatus = 0 (0)
[09:07:21] Client-core communications error: ERROR 0x0
[09:07:21] Deleting current work unit & continuing...

[09:11:55] *------------------------------*
[09:11:55] Folding@Home Gromacs SMP Core
[09:11:55] Version 1.74 (November 27, 2006)
[09:12:12] Project: 3062 (Run 2, Clone 71, Gen 7)
[09:12:19] Protein: p3062_lambda5_99sbExtra SSE boost OK.
[09:12:19] Extra SSE boost OK.
[09:12:19] Completed 0 out of 5000000 steps  (0 percent)
[14:43:14] Completed 1650000 out of 5000000 steps  (33 percent)
I do not understand the "logic" of reassigning the same WU to a machine that has experienced a client core comm error.
I also do not have time to baby-sit the second (or third) assignment of the same WU.

Last week, when I was unavailable, I noted that one SMP WU had been assigned three times consecutively after 0X0'ing at the same point.
This machine, also a Q6600, then completed three WUs sucessfully after which the same offending WU was again assigned.

The error trapping in the SMP clients is quite substandard and perhaps another approach to this issue would be appropriate until such time as the code is improved.

Re: Project: 3062 (Run 2, Clone 71, Gen 7)

Posted: Wed Mar 19, 2008 12:12 pm
by susato
This particular work unit represents an exception to the rule that when a unit fails from simulation instability (e.g. long 1-4 interactions), it will fail again:

Project 3062, Run 2, Clone 71, Gen 7

Donator: 314159 Team: 1971
CPUId: 759EXXXXXXXXXXXXE8
Credit: 1732 Credit Time: 2008-03-18 20:19:15
Entered into logs at: 2008-03-18 20:00:04
WU assigned to donor at: 2008-03-18 01:09:58
Days taken to complete WU: 0.78
Error code: 0

Hi 314159 (team 1971),
Your WU (P3062 R2 C71 G7) was added to the stats database on 2008-03-18 20:19:15 for 1732 points of credit.

This entry represents your second try at the work unit, starting at 3/18/08, 9:09 UTC as shown in your FAHlog.txt excerpt.
Congrats on the completion - unusual in my experience with failed units. Did you perhaps stop and restart it partway, to help it finish?

Re: Project: 3062 (Run 2, Clone 71, Gen 7)

Posted: Wed Mar 19, 2008 3:23 pm
by 314159
Contrary to my statement that I did not have time to baby-sit this WU, I spent considerable time attempting to complete it - using every trick known to me. :)

It was my initial attempt at this particular project and I really wanted it to complete. When it did, I broke out the (non-alcoholic) champagne.

Suspecting a Core crash as the cause, I restarted the WU several times and backed it up at each attempt. (I did not have to go to the backups)

My only concern is whether the results returned are scientifically valid. My conclusion is that they are.

BTW, on two occasions, I happened to be refreshing the Mac SMP client JUST at the time it had logged long 1-4 interactions (talk about luck - and a true story).
I immediately stopped the client and upon restart, it picked up at the previous checkpoint in each case and completed normally!!

Thank you very much for the log detail. (you didn't think that I deleted it, did you?) !!!! :D

Re: Project: 3062 (Run 2, Clone 71, Gen 7)

Posted: Wed Mar 19, 2008 4:05 pm
by 7im
My only concern is whether the results returned are scientifically valid. My conclusion is that they are.
Good conclusion. Stanford has security and data checks during upload. If you got points, the data was accepted.