Page 1 of 1

Project: 3043 (Run 6, Clone 39, Gen 75)

Posted: Mon Feb 25, 2008 4:07 pm
by ChasR
Project: 3043 (Run 6, Clone 39, Gen 75) runs to 8%, reports long 1-4 interactions and hangs the client. Stopping and restarting the client results in WU restarting @ 8%, but quitting with a segmentation fault within a few minutes,wherepon the WU is deleted by the client. The process is then repeated with identical results. Q6600 @ 3.4, Native Linux SMP client (6.01beta2)

Client Hang:

Code: Select all

[09:49:16] Project: 3043 (Run 6, Clone 39, Gen 75)
[09:49:16] 
[09:49:16] Assembly optimizations on if available.
[09:49:16] Entering M.D.
[09:49:22] Protein: 9684 p3029_SMP-emsv-03
[09:49:22] Writing local files
[09:49:22] Extra SSE boost OK.
[09:49:22] 
[09:49:22] Extra SSE boost OK.
[09:49:22] Writing local files
[09:49:22] Completed 0 out of 10000000 steps  (0 percent)
[10:00:12] Writing local files
[10:00:12] Completed 100000 out of 10000000 steps  (1 percent)
[10:08:01] Writing local files
[10:08:01] Completed 200000 out of 10000000 steps  (2 percent)
[10:12:45] - Autosending finished units...
[10:12:45] Trying to send all finished work units
[10:12:45] + No unsent completed units remaining.
[10:12:45] - Autosend completed
[10:17:43] Writing local files
[10:17:43] Completed 300000 out of 10000000 steps  (3 percent)
[10:28:37] Writing local files
[10:28:37] Completed 400000 out of 10000000 steps  (4 percent)
[10:39:29] Writing local files
[10:39:29] Completed 500000 out of 10000000 steps  (5 percent)
[10:50:20] Writing local files
[10:50:20] Completed 600000 out of 10000000 steps  (6 percent)
[11:01:08] Writing local files
[11:01:08] Completed 700000 out of 10000000 steps  (7 percent)
[11:11:57] Writing local files
[11:11:57] Completed 800000 out of 10000000 steps  (8 percent)
[11:20:52] Warning:  long 1-4 interactions
[16:12:45] - Autosending finished units...
[16:12:45] Trying to send all finished work units
[16:12:45] + No unsent completed units remaining.
[16:12:45] - Autosend completed
[22:12:45] - Autosending finished units...
[22:12:45] Trying to send all finished work units
[22:12:45] + No unsent completed units remaining.
[22:12:45] - Autosend completed
[04:12:45] - Autosending finished units...
On restart:

Code: Select all

[16:28:15] Project: 3043 (Run 6, Clone 39, Gen 75)
[16:28:15] 
[16:28:15] Assembly optimizations on if available.
[16:28:15] Entering M.D.
[16:28:32]  on if available.
[16:28:32] Entering M.D.
[16:28:38] Protein: 9684 p3029_SMP-emsv-03
[16:28:38] Writing local files
[16:28:38] Completed 800000 out of 10000000 steps  (8 percent)
[16:28:38] Extra SSE boost OK.
[16:28:38] 00 steps  (8 percent)
[16:28:38] Extra SSE boost OK.
[16:30:30] CoreStatus = 1 (1)
[16:30:30] Client-core communications error: ERROR 0x1
[16:30:30] Deleting current work unit & continuing...

Re: Project: 3043 (Run 6, Clone 39, Gen 75)

Posted: Mon Mar 03, 2008 3:18 pm
by ChasR
I got this WU again over the weekend. It needs to be taken out of circulation. If assigned to an unmonitored machine, it may well run doing nothing but autosend until the machine is rebooted. As it was, I lost a day and a half with the client hung.

WU 3043 error

Posted: Thu Mar 20, 2008 3:58 am
by Rolo71
Wu 3043 always goes wrong at 64% on my comp.
This is the log:

[20:52:17] Completed 6400000 out of 10000000 steps (64 percent)
[21:01:21] Warning: long 1-4 interactions
[21:01:21] Gromacs cannot continue further.
[21:01:21] Going to send back what have done.
[21:01:21] logfile size: 169213
[21:01:21] - Writing 169749 bytes of core data to disk...
[21:01:21] ... Done.
[21:01:21] - Failed to delete work/wudata_07.chk
[21:01:22] - Failed to delete work/wudata_07.sas
[21:01:22] Warning: check for stray files
[21:03:22]
[21:03:22] Folding@home Core Shutdown: EARLY_UNIT_END
[21:03:22]
[21:03:22] Folding@home Core Shutdown: EARLY_UNIT_END
[21:03:25] CoreStatus = 7B (123)
[21:03:25] Client-core communications error: ERROR 0x7b
[21:03:25] Deleting current work unit & continuing...


I'm running a [email protected] with 2 smp clients from different folders and different machine id's.
And I'm using the fah affinity changer.

Re: WU 3043 error

Posted: Thu Mar 20, 2008 4:01 am
by anandhanju
Please post the Run, Clone, Gen numbers from the log in this format

Code: Select all

Project: 3043 (Run xx, Clone yy, Gen zz)
This will help one of the moderators look up the ones that EUEd on your system.

Re: WU 3043 error

Posted: Thu Mar 20, 2008 1:56 pm
by Rolo71
It's Project: 3043 (Run 6, Clone 39, Gen 75)

Re: WU 3043 error

Posted: Thu Mar 20, 2008 2:48 pm
by Leoslocks
I have completed two 3043 WU's on a Q6600 stock. Have you completed WU's with out the Affinity Changer?

Code: Select all

[00:36:13] Project: 3043 (Run 0, Clone 82, Gen 6)
[00:36:13] 
[00:36:13] Entering M.D.
[00:36:19] 
[00:36:19] Writing local files
[00:36:19] Extra SSE boost OK.
[00:36:19] SMP-emsv-03Extra SSE boost OK.
[00:36:19] 
[00:36:19] Extra SSE boost OK.
[00:36:19] Writing local files
[00:36:19] Completed 0 out of 10000000 steps  (0 percent)
[00:45:40] Writing local files
[00:45:40] Completed 100000 out of 10000000 steps  (1 percent)
///////////////////////////////////////////////////////
[16:21:56] Completed 10000000 out of 10000000 steps  (100 percent)
[16:21:56] Writing final coordinates.
[16:21:56] Past main M.D. loop
[16:21:56] Will end MPI now

Re: WU 3043 error

Posted: Thu Mar 20, 2008 3:36 pm
by anandhanju
Rolo71 wrote:It's Project: 3043 (Run 6, Clone 39, Gen 75)
Please see this topic indicating similar reports with this WU (although at a different %). Can a mod please look this up and advise?

Re: WU 3043 error

Posted: Thu Mar 20, 2008 4:22 pm
by 7im
One record, obviously not completed fully...

Your WU (P3043 R6 C39 G75) was added to the stats database on 2007-12-22 08:45:14 for 172.67 points of credit.

Nothing more recent.

Re: WU 3043 error

Posted: Thu Mar 20, 2008 4:27 pm
by bruce
anandhanju wrote:Please see this topic indicating similar reports with this WU (although at a different %). Can a mod please look this up and advise?
Sure, but I'm not sure that's too helpful. The WU has been returned only once, and that was from Linux for partial credit.
Your WU (P3043 R6 C39 G75) was added to the stats database on 2007-12-22 08:45:14 for 172.67 points of credit.

We don't have access to any numbers about how many times it has been assigned. The fact that Windows Beta SMP tends to delete most WUs after an error makes it very difficult to tell what's going on in situations like this.

I'll merge the two threads since they're about the same WU.

EDIT: Oops. 7im beat me to it.

Re: Project: 3043 (Run 6, Clone 39, Gen 75)

Posted: Thu Mar 20, 2008 5:27 pm
by Rolo71
The log above is from march 19

This is the log from march 10:

12:02:53] Working on Unit 06 [March 10 12:02:53]
[12:02:53] + Working ...
[12:02:53]
[12:02:53] *------------------------------*
[12:02:53] Folding@Home Gromacs SMP Core
[12:02:53] Version 1.74 (March 10, 2007)
[12:02:53]
[12:02:53] Preparing to commence simulation
[12:02:53] - Ensuring status. Please wait.
[12:02:54] - Couldn't send HTTP request to server
[12:02:54] + Could not connect to Work Server (results)
[12:02:54] (171.64.65.64:8080)
[12:02:54] - Error: Could not transmit unit 05 (completed March 10) to work server.


[12:02:54] + Attempting to send results
[12:03:10] - Looking at optimizations...
[12:03:10] - Working with standard loops on this execution.
[12:03:10] Examination of work files indicates 8 consecutive improper terminations of core.
[12:03:11] - Expanded 283027 -> 1508541 (decompressed 533.0 percent)
[12:03:11]
[12:03:11] Project: 3043 (Run 6, Clone 39, Gen 75)
[12:03:11]
[12:03:11] Entering M.D.
[12:03:14] - Couldn't send HTTP request to server
[12:03:14] + Could not connect to Work Server (results)
[12:03:14] (171.64.122.76:8080)
[12:03:14] Could not transmit unit 05 to Collection server; keeping in queue.
[12:03:20] Calling FAH init
[12:03:20] Read topology
[12:03:21] (Starting from checkpoint)
[12:03:22] SSE boost OK.
[12:03:22] t
[12:03:22] Protein: 9684 p3029_SMP-emsv-03
[12:03:22] Writing local files
[12:03:22] Extra SSE boost OK.
[12:03:22] Writing local files
[12:03:22] Completed 0 out of 10000000 steps (0 percent)
[12:20:30] Writing local files
[[19:13:51] Writing local files
[19:13:51] Completed 3100000 out of 10000000 steps (31 percent)
[19:21:43] Gromacs cannot continue further.
[19:21:43] Going to send back what have done.
[19:21:43] logfile size: 106449
[19:21:43] - Writing 106985 bytes of core data to disk...
[19:21:43] ... Done.
[19:21:43] - Failed to delete work/wudata_06.xtc
[19:21:44] - Failed to delete work/wudata_06.bed
[19:21:44] - Failed to delete work/wudata_06.sas
[19:21:44] - Failed to delete work/wudata_06.goe
[19:21:44] Warning: check for stray files
[19:23:44]
[19:23:44] Folding@home Core Shutdown: EARLY_UNIT_END
[19:23:44]
[19:23:44] Folding@home Core Shutdown: EARLY_UNIT_END
[19:23:47] CoreStatus = 7B (123)
[19:23:47] Client-core communications error: ERROR 0x7b
[19:23:47] Deleting current work unit & continuing...

Re: Project: 3043 (Run 6, Clone 39, Gen 75)

Posted: Thu Mar 20, 2008 5:28 pm
by Rolo71
this is the log from march 13:

[09:48:34] Working on Unit 00 [March 13 09:48:34]
[09:48:34] + Working ...
[09:48:34]
[09:48:34] *------------------------------*
[09:48:34] Folding@Home Gromacs SMP Core
[09:48:34] Version 1.74 (March 10, 2007)
[09:48:34]
[09:48:34] Preparing to commence simulation
[09:48:34] - Ensuring status. Please wait.
[09:48:34] - Starting from initial work packet
[09:48:34]
[09:48:34] Project: 3043 (Run 6, Clone 39, Gen 75)
[09:48:34]
[09:48:34] Assembly optimizations on if available.
[09:48:34] Entering M.D.
[09:48:51] ial work packet
[09:48:52] rting from initial work packet
[09:48:52]
[09:48:52] Project: 3043 (Run 6, Clone 39, Gen 75)
[09:48:52]
[09:48:52] Entering M.D.
[09:48:58]
[09:48:58] Writing local files
[09:48:58] Extra SSE boost OK.
[09:48:58] SMP-emsv-03Extra SSE boost OK.
[09:48:58]
[09:48:58] Extra SSE boost OK.
[09:48:58] Writing local files
[09:48:58] Completed 0 out of 10000000 steps (0 percent)
[10:02:23] Writing local files
[00:26:41] Completed 6400000 out of 10000000 steps (64 percent)
[00:35:48] Warning: long 1-4 interactions
[00:35:48] Gromacs cannot continue further.
[00:35:48] Going to send back what have done.
[00:35:48] logfile size: 169213
[00:35:48] - Writing 169749 bytes of core data to disk...
[00:35:48] ... Done.
[00:35:48] - Failed to delete work/wudata_00.arc
[00:35:48] - Failed to delete work/wudata_00.xtc
[00:35:48] No C.P. to delete.
[00:35:48] - Failed to delete work/wudata_00.sas
[00:35:48] Warning: check for stray files
[00:37:49]
[00:37:49] Folding@home Core Shutdown: EARLY_UNIT_END
[00:37:49]
[00:37:49] Folding@home Core Shutdown: EARLY_UNIT_END
[00:37:53] CoreStatus = 7B (123)
[00:37:53] Client-core communications error: ERROR 0x7b
[00:37:53] Deleting current work unit & continuing...

Re: Project: 3043 (Run 6, Clone 39, Gen 75)

Posted: Thu Mar 20, 2008 5:56 pm
by bruce
So far it appears that every WU is getting ERROR 0x7b which seems to mean very little. All we really know is that Windows cancelled the folding process, with no explanation as to why.

You can TRY stopping the folding process not long before the "Warning: long 1-4 interactions" message. Make a backup. Then resume processing. For some unexplained reason, some people have been able to pass the error and complete WUs like that one after a stop/resume.

Re: Project: 3043 (Run 6, Clone 39, Gen 75)

Posted: Mon Mar 24, 2008 5:41 pm
by Rolo71
I finally managed to fold this WU completely.

Stoppped the console at 60%. Then restarted my comp and continued. It went straight to 100%.