Project: 2665 (Run 3, Clone 826, Gen 25)

314159 · Post by **314159** » Mon Aug 25, 2008 5:51 pm

[13:38:00] *------------------------------*
[13:38:00] Folding@Home Gromacs SMP Core
[13:38:00] Version 1.74 (November 27, 2006)
[13:38:00] 
[13:38:00] Preparing to commence simulation
[13:38:00] - Ensuring status. Please wait.
[13:38:17] - Assembly optimizations manually forced on.
[13:38:17] - Not checking prior termination.
[13:38:18] - Expanded 4711387 -> 24426905 (decompressed 518.4 percent)
[13:38:18] - Starting from initial work packet
[13:38:18] 
[13:38:18] Project: 2665 (Run 3, Clone 826, Gen 25)
[13:38:18] 
[13:38:18] Assembly optimizations on if available.
[13:38:18] Entering M.D.
[13:38:24] Rejecting checkpoint
[13:38:25] Protein: HGG in waterExtra SSE boost OK.
[13:38:25] 
[13:38:26] Extra SSE boost OK.
[13:38:26] Writing local files
[13:38:26] Completed 0 out of 250000 steps  (0 percent)
[13:52:56] Writing local files
[13:52:56] Completed 2500 out of 250000 steps  (1 percent)
[14:07:28] Writing local files
[14:07:29] Completed 5000 out of 250000 steps  (2 percent)
[14:22:05] Writing local files
[14:22:06] Completed 7500 out of 250000 steps  (3 percent)
[14:36:42] Writing local files
[14:36:42] Completed 10000 out of 250000 steps  (4 percent)
[14:51:18] Writing local files
[14:51:18] Completed 12500 out of 250000 steps  (5 percent)
[15:05:47] Writing local files
[15:05:47] Completed 15000 out of 250000 steps  (6 percent)
[15:08:22] - Autosending finished units...
[15:08:22] Trying to send all finished work units
[15:08:22] + No unsent completed units remaining.
[15:08:22] - Autosend completed
[15:20:16] Writing local files
[15:20:16] Completed 17500 out of 250000 steps  (7 percent)
[15:29:30] 
[15:29:30] Folding@home Core Shutdown: INTERRUPTED
[15:29:34] CoreStatus = 66 (102)
[15:29:34] + Shutdown requested by user. Exiting.***** Got a SIGTERM signal (15)

(NOT Shutdown by User - Terminal indicates Segmentation Fault)<------------
Identical results with fresh Core, reboot, etc.

[15:29:34] Killing all core threads

Folding@Home Client Shutdown.

Linux Client, Q6600, stock clock.

Ok. I need to ask a question (or two or three):

Is there ANY way to prevent being reassigned this WU several times (or I am missing something).

(-delete x works fine but one gets the identical WU several times PLUS an unattended machine would simply remain idle until the stopped condition of the client was detected)

What is the proper technique for dealing with these situations?

I have one Quad that was assigned a WU, failed at frame 70 with failure ultimately being detected and WU being deleted.
Two different WUs were then assigned and completed successfully.
Then the previously reassigned defective WU was re-assigned, only to once again fail at frame 70.
I believe that I reported that one here.

I also do NOT like the fact that the WU count is reset to 0 when one encounters this sort of situation (probably by that "bad" packet from server that one sees).
The machine here has successfully completed over 125 WUs - had one bad one - count reset - ran 10 or 20 more to completion - and I assume is now reset to 0.

I think that that stinks.

I know how to stop a WU prior to failure and either run it on another machine or continue with restart (and I back the darned things up a few times as it progresses).
In most cases, the latter has permitted me to complete the WU.
Was not successful with this one.

BTW, I do NOT want to delete queue.dat since I monitor my farm from one server and the information contained in that file is useful.

Post by **kasson** » Mon Aug 25, 2008 6:41 pm

We're working on better EUE handling, but the faux-INTERRUPTED one is hard. From the client end, it's hard to differentiate this from the user actually hitting Ctrl-C.

I have one idea to fix this from the A1 core end--it's a little hard to track down the problem because the cores aren't giving a lot of debug data, but I have a guess...

Post by **bruce** » Mon Aug 25, 2008 6:49 pm

314159 wrote:I also do NOT like the fact that the WU count is reset to 0 when one encounters this sort of situation (probably by that "bad" packet from server that one sees).
The machine here has successfully completed over 125 WUs - had one bad one - count reset - ran 10 or 20 more to completion - and I assume is now reset to 0.

I think that that stinks.

I do not believe that this error causes the local WU count to be reset. That value is stored in client.cfg and bad packets should not cause changes to client.cfg. I've seen quite a few of these EUEs but my local count has not been reset, though it could depend on the platform. On the other hand, an error which corrupts client.cfg (including improper manual edits) will certainly reset the local WU count.

Next time it happens to you, it would be good if you could post client.cfg both before and after the EUE/Bad-packet condition.

314159 · Post by **314159** » Mon Aug 25, 2008 7:41 pm

Thank you gentlemen.

I can understand the difficulty in debugging "segmentation" errors but anxiously look forward to a client with improved error handling in general.

Bruce: I agree with you 100% that the error itself does not cause the reset.

However: do a -delete x some time, which was my only viable option in this case, and see what you get.
I could, of course, be wrong on this but my thought is that that "bad packet" message is the count data reset.

I have never manually changed client.cfg. Just FYI. The client does a fine job on this if ever necessary.

Are you absolutely certain that the completed WU count is contained in client.cfg? (Linux SMP console)

What is the proper technique for handling these situations given the idiosyncracies of the client/cores?
Thanks!

Post by **bruce** » Mon Aug 25, 2008 9:36 pm

314159 wrote:I have never manually changed client.cfg. Just FYI. The client does a fine job on this if ever necessary.
Are you absolutely certain that the completed WU count is contained in client.cfg? (Linux SMP console)

Check client.cfg for the line "LOCAL=nnnn" (I gave my Linux systems some time off.)

Folding Forum

Project: 2665 (Run 3, Clone 826, Gen 25)

Project: 2665 (Run 3, Clone 826, Gen 25)

Re: Project: 2665 (Run 3, Clone 826, Gen 25)

Re: Project: 2665 (Run 3, Clone 826, Gen 25)

Re: Project: 2665 (Run 3, Clone 826, Gen 25)

Re: Project: 2665 (Run 3, Clone 826, Gen 25)