Page 2 of 3

Project: 6012 (Run 0, Clone 390, Gen 99)

Posted: Tue Mar 30, 2010 4:53 pm
by Aardvark
Another failed WU. Did not even make it to 1%.

Work file is again not adequate for return to Stanford.

This is starting to get a little stale....

Log file follows:

Code: Select all

[16:33:30] + Connections closed: You may now disconnect
[16:33:35] 
[16:33:35] + Processing work unit
[16:33:35] Core required: FahCore_a3.exe
[16:33:35] Core found.
[16:33:35] Working on queue slot 01 [March 30 16:33:35 UTC]
[16:33:35] + Working ...
[16:33:35] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 01 -np 2 -checkpoint 15 -verbose -lifeline 7949 -version 629'

[16:33:35] 
[16:33:35] *------------------------------*
[16:33:35] Folding@Home Gromacs SMP Core
[16:33:35] Version 2.17 (Mar 7 2010)
[16:33:35] 
[16:33:35] Preparing to commence simulation
[16:33:35] - Ensuring status. Please wait.
[16:33:45] - Looking at optimizations...
[16:33:45] - Working with standard loops on this execution.
[16:33:45] - Created dyn
[16:33:45] - Files status OK
[16:33:45] - Expanded 1796995 -> 2078149 (decompressed 115.6 percent)
[16:33:45] Called DecompressByteArray: compressed_data_size=1796995 data_size=2078149, decompressed_data_size=2078149 diff=0
[16:33:45] - Digital signature verified
[16:33:45] 
[16:33:45] Project: 6012 (Run 0, Clone 390, Gen 99)
[16:33:45] 
[16:33:45] Entering M.D.
Starting 2 threads
NNODES=2, MYRANK=0, HOSTNAME=thread #0
NNODES=2, MYRANK=1, HOSTNAME=thread #1
Reading file work/wudata_01.tpr, VERSION 4.0.99_development_20090605 (single precision)
Note: tpx file_version 68, software version 70
Making 1D domain decomposition 2 x 1 x 1
starting mdrun 'Protein in POPC'
50000004 steps, 100000.0 ps (continuing from step 49500004,  99000.0 ps).
[16:33:52] Completed 0 out of 500000 steps  (0%)

-------------------------------------------------------
Program mdrun, VERSION 4.0.99-dev-20100305
Source code file: /Users/kasson/a3_devnew/gromacs/src/mdlib/pme.c, line: 563

Fatal error:
8 particles communicated to PME node 1 are more than a cell length out of the domain decomposition cell of their charge group in dimension x
For more information and tips for trouble shooting please check the GROMACS website at
http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

[16:35:04] mdrun returned 255
[16:35:04] Going to send back what have done -- stepsTotalG=500000
[16:35:04] Work fraction=0.0005 steps=500000.
[16:35:05] CoreStatus = 0 (0)
[16:35:05] Sending work to server
[16:35:05] Project: 6012 (Run 0, Clone 390, Gen 99)
[16:35:05] - Error: Could not get length of results file work/wuresults_01.dat
[16:35:05] - Error: Could not read unit 01 file. Removing from queue.
[16:35:05] Trying to send all finished work units
[16:35:05] + No unsent completed units remaining.
[16:35:05] - Preparing to get new work unit...
[16:35:06] > Press "c" to connect to the server to download unit


Re: Various Project: 601x problems

Posted: Tue Mar 30, 2010 5:55 pm
by bruce
Your client is misbehaving and I'm sure that's frustrating, but that doesn't mean you need to abandon FAH. For reliable performance, I suggest you switch to N classic clients (one per physical core) until the issue can be resolved.

As a reminder, please re-read what it says about beta clients on the download page.

Re: Various Project: 601x problems

Posted: Tue Mar 30, 2010 6:00 pm
by AlanH
bruce wrote:As a matter of fact, the main developer is working on the problem. Apparently you didn't see his response to your previous report.
viewtopic.php?f=19&t=13980&p=137199#p137199

Two topics on the same subject merged.
Sure, I saw it.
Kasson indicated that the log info was useful, and that's why I reposted my data here.

Re: Various Project: 601x problems

Posted: Tue Mar 30, 2010 6:01 pm
by P5-133XL
bruce wrote:Your client is misbehaving and I'm sure that's frustrating, but that doesn't mean you need to abandon FAH. For reliable performance, I suggest you switch to N classic clients (one per physical core) until the issue can be resolved.

As a reminder, please re-read what it says about beta clients on the download page.
Or even just drop the -advmethods flag and switch from A3's to A1/A2's. You'll get more points than n uniprocessor clients but not as many as A3's

Re: Various Project: 601x problems

Posted: Tue Mar 30, 2010 7:15 pm
by bruce
Just so you know you're not alone . . . .

Code: Select all

[16:51:24] - Calling '.\FahCore_a3.exe -dir work/ -nice 19 -suffix 02 -np 4 -nocpulock -checkpoint 15 -forceasm -verbose -lifeline 1004 -version 629'

[16:51:24] 
[16:51:24] *------------------------------*
[16:51:24] Folding@Home Gromacs SMP Core
[16:51:24] Version 2.17 (Mar 12, 2010)
[16:51:24] 
[16:51:24] Preparing to commence simulation
[16:51:24] - Assembly optimizations manually forced on.
[16:51:24] - Not checking prior termination.
[16:51:25] - Expanded 1799234 -> 2396877 (decompressed 133.2 percent)
[16:51:25] Called DecompressByteArray: compressed_data_size=1799234 data_size=2396877, decompressed_data_size=2396877 diff=0
[16:51:25] - Digital signature verified
[16:51:25] 
[16:51:25] Project: 6014 (Run 0, Clone 29, Gen 121)
[16:51:25] 
[16:51:25] Assembly optimizations on if available.
[16:51:25] Entering M.D.
[16:51:32] Completed 0 out of 500000 steps  (0%)
[17:18:48] Completed 5000 out of 500000 steps  (1%)
[17:20:30] - Autosending finished units... [March 29 17:20:30 UTC]
[17:20:30] Trying to send all finished work units
[17:20:30] + No unsent completed units remaining.
[17:20:30] - Autosend completed
[17:45:40] Completed 10000 out of 500000 steps  (2%)
[18:12:11] Completed 15000 out of 500000 steps  (3%)
[18:38:43] Completed 20000 out of 500000 steps  (4%)
[19:05:15] Completed 25000 out of 500000 steps  (5%)
[19:32:02] Completed 30000 out of 500000 steps  (6%)
[19:58:48] Completed 35000 out of 500000 steps  (7%)
[20:25:53] Completed 40000 out of 500000 steps  (8%)
[20:52:50] Completed 45000 out of 500000 steps  (9%)
[21:19:45] Completed 50000 out of 500000 steps  (10%)
[21:46:44] Completed 55000 out of 500000 steps  (11%)
[22:13:42] Completed 60000 out of 500000 steps  (12%)
[22:40:34] Completed 65000 out of 500000 steps  (13%)
[23:07:42] Completed 70000 out of 500000 steps  (14%)
[23:20:29] - Autosending finished units... [March 29 23:20:29 UTC]
[23:20:29] Trying to send all finished work units
[23:20:29] + No unsent completed units remaining.
[23:20:29] - Autosend completed
[23:34:49] Completed 75000 out of 500000 steps  (15%)
[00:01:54] Completed 80000 out of 500000 steps  (16%)
[00:29:18] Completed 85000 out of 500000 steps  (17%)
[00:49:57] Gromacs cannot continue further.
[00:49:57] Going to send back what have done -- stepsTotalG=500000
[00:49:57] Work fraction=-1.#IND steps=500000.
--------- Vista popup on the screen demanded attention and work suspended until I responded ----14+ hrs later -----
[05:20:27] - Autosending finished units... [March 30 05:20:27 UTC]
[05:20:27] Trying to send all finished work units
[05:20:27] + No unsent completed units remaining.
[05:20:27] - Autosend completed
[11:20:26] - Autosending finished units... [March 30 11:20:26 UTC]
[11:20:26] Trying to send all finished work units
[11:20:26] + No unsent completed units remaining.
[11:20:26] - Autosend completed
[15:31:15] CoreStatus = C0000005 (-1073741819)
[15:31:15] Client-core communications error: ERROR 0xc0000005
[15:31:15] Deleting current work unit & continuing...
[15:31:29] Trying to send all finished work units
[15:31:29] + No unsent completed units remaining.
[15:31:29] - Preparing to get new work unit...
P5-133XL wrote:Or even just drop the -advmethods flag and switch from A3's to A1/A2's. You'll get more points than n uniprocessor clients but not as many as A3's
Great idea. It's always a hassle to get all four clients to shut down more or less simultaneously (i.e.- the same day) using -oneunit and restart SMP.

Project: 6014 (Run 1, Clone 36, Gen 95)

Posted: Tue Mar 30, 2010 7:24 pm
by Aardvark
I tried to follow PS-133XLs suggestion and did not specify -advmethods. However, I received another a3core WU (See Subject above). I have to assume that the -smp argument is now "hot-wired" into a3 WUs. Is that the way things are supposed to be?

Anyway, it failed at <1%.

Log File follows:

Code: Select all

[18:58:37] + Processing work unit
[18:58:37] Core required: FahCore_a3.exe
[18:58:37] Core found.
[18:58:37] Working on queue slot 02 [March 30 18:58:37 UTC]
[18:58:37] + Working ...
[18:58:37] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 02 -np 2 -checkpoint 15 -verbose -lifeline 695 -version 629'

[18:58:37] 
[18:58:37] *------------------------------*
[18:58:37] Folding@Home Gromacs SMP Core
[18:58:37] Version 2.17 (Mar 7 2010)
[18:58:37] 
[18:58:37] Preparing to commence simulation
[18:58:37] - Ensuring status. Please wait.
[18:58:46] - Looking at optimizations...
[18:58:46] - Working with standard loops on this execution.
[18:58:46] - Created dyn
[18:58:46] - Files status OK
[18:58:47] - Expanded 1798615 -> 2396877 (decompressed 133.2 percent)
[18:58:47] Called DecompressByteArray: compressed_data_size=1798615 data_size=2396877, decompressed_data_size=2396877 diff=0
[18:58:47] - Digital signature verified
[18:58:47] 
[18:58:47] Project: 6014 (Run 1, Clone 36, Gen 95)
[18:58:47] 
[18:58:47] Entering M.D.
Starting 2 threads
NNODES=2, MYRANK=0, HOSTNAME=thread #0
NNODES=2, MYRANK=1, HOSTNAME=thread #1
Reading file work/wudata_02.tpr, VERSION 4.0.99_development_20090605 (single precision)
Note: tpx file_version 68, software version 70
Making 1D domain decomposition 2 x 1 x 1
starting mdrun 'Protein in POPC'
48000004 steps,  96000.0 ps (continuing from step 47500004,  95000.0 ps).
[18:58:54] Completed 0 out of 500000 steps  (0%)

-------------------------------------------------------
Program mdrun, VERSION 4.0.99-dev-20100305
Source code file: /Users/kasson/a3_devnew/gromacs/src/mdlib/pme.c, line: 563

Fatal error:
3 particles communicated to PME node 1 are more than a cell length out of the domain decomposition cell of their charge group in dimension x
For more information and tips for trouble shooting please check the GROMACS website at
http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

[18:59:45] mdrun returned 255
[18:59:45] Going to send back what have done -- stepsTotalG=500000
[18:59:45] Work fraction=0.0004 steps=500000.
[18:59:49] logfile size=11879 infoLength=11879 edr=0 trr=25
[18:59:49] logfile size: 11879 info=11879 bed=0 hdr=25
[18:59:49] - Writing 12417 bytes of core data to disk...
[18:59:49]   ... Done.
[18:59:50] 
[18:59:50] Folding@home Core Shutdown: UNSTABLE_MACHINE
[18:59:50] CoreStatus = 7A (122)
[18:59:50] Sending work to server
[18:59:50] Project: 6014 (Run 1, Clone 36, Gen 95)


[18:59:50] + Attempting to send results [March 30 18:59:50 UTC]
[18:59:50] - Reading file work/wuresults_02.dat from core
[18:59:50]   (Read 12417 bytes from disk)
[18:59:51] > Press "c" to connect to the server to upload results
[
At least it wrote a good Work file and the info should now have made it to Stanford.....

I think I am going to take a break from Folding for a while. Will be checking for that "action plan".

Re: Various Project: 601x problems

Posted: Tue Mar 30, 2010 7:27 pm
by ikerekes
I don't know if this qualifies as the same problem, but:

Code: Select all

[03:10:03] + Processing work unit
[03:10:03] Core required: FahCore_a3.exe
[03:10:03] Core found.
[03:10:03] Working on queue slot 02 [March 30 03:10:03 UTC]
[03:10:03] + Working ...
[03:10:03] 
[03:10:03] *------------------------------*
[03:10:03] Folding@Home Gromacs SMP Core
[03:10:03] Version 2.17 (Mar 12, 2010)
[03:10:03] 
[03:10:03] Preparing to commence simulation
[03:10:03] - Ensuring status. Please wait.
[03:10:13] - Looking at optimizations...
[03:10:13] - Working with standard loops on this execution.
[03:10:13] - Previous termination of core was improper.
[03:10:13] - Going to use standard loops.
[03:10:13] - Files status OK
[03:10:13] - Expanded 1772217 -> 1975105 (decompressed 111.4 percent)
[03:10:13] Called DecompressByteArray: compressed_data_size=1772217 data_size=1975105, decompressed_data_size=1975105 diff=0
[03:10:13] - Digital signature verified
[03:10:13] 
[03:10:13] Project: 6021 (Run 0, Clone 135, Gen 68)
[03:10:13] 
[03:10:13] Entering M.D.
[03:10:19] Using Gromacs checkpoints
[03:10:20] Resuming from checkpoint
[03:10:21] Verified work/wudata_02.log
[03:10:22] Verified work/wudata_02.trr
[03:10:22] Verified work/wudata_02.edr
[03:10:22] Completed 287296 out of 500000 steps  (57%)
[03:16:05] Completed 290000 out of 500000 steps  (58%)
[03:26:33] Completed 295000 out of 500000 steps  (59%)
[03:37:03] Completed 300000 out of 500000 steps  (60%)
[03:47:36] Completed 305000 out of 500000 steps  (61%)
[03:58:08] Completed 310000 out of 500000 steps  (62%)
[04:08:42] Completed 315000 out of 500000 steps  (63%)
[04:19:14] Completed 320000 out of 500000 steps  (64%)
[04:29:46] Completed 325000 out of 500000 steps  (65%)
[04:40:18] Completed 330000 out of 500000 steps  (66%)
[04:50:50] Completed 335000 out of 500000 steps  (67%)
[05:01:22] Completed 340000 out of 500000 steps  (68%)
[05:11:55] Completed 345000 out of 500000 steps  (69%)
[05:22:28] Completed 350000 out of 500000 steps  (70%)
[05:33:01] Completed 355000 out of 500000 steps  (71%)
[05:43:35] Completed 360000 out of 500000 steps  (72%)
[05:54:07] Completed 365000 out of 500000 steps  (73%)
[06:04:38] Completed 370000 out of 500000 steps  (74%)
[06:15:10] Completed 375000 out of 500000 steps  (75%)
[06:25:42] Completed 380000 out of 500000 steps  (76%)
[06:36:13] Completed 385000 out of 500000 steps  (77%)
[06:46:46] Completed 390000 out of 500000 steps  (78%)
[06:57:18] Completed 395000 out of 500000 steps  (79%)
[07:07:51] Completed 400000 out of 500000 steps  (80%)
[07:17:53] Gromacs cannot continue further.
[07:17:53] Going to send back what have done -- stepsTotalG=500000
[07:17:53] Work fraction=0.8095 steps=500000.
[07:17:57] logfile size=43805 infoLength=43805 edr=0 trr=23
[07:17:57] logfile size: 43805 info=43805 bed=0 hdr=23
[07:17:57] - Writing 44341 bytes of core data to disk...
[07:17:58]   ... Done.
[07:17:58] 
[07:17:58] Folding@home Core Shutdown: EARLY_UNIT_END
[07:18:01] CoreStatus = 72 (114)
[07:18:01] Sending work to server
[07:18:01] Project: 6021 (Run 0, Clone 135, Gen 68)


[07:18:01] + Attempting to send results [March 30 07:18:01 UTC]
[07:18:03] + Results successfully sent
[07:18:03] Thank you for your contribution to Folding@Home.
Fresh install off Win XP-SP3 on an E5200 @ 3.6Ghz. It has already completed more than 50 a3's without any problem.

Re: Various Project: 601x problems

Posted: Tue Mar 30, 2010 7:56 pm
by bruce
ikerekes wrote:I don't know if this qualifies as the same problem, but:

Code: Select all

[03:10:13] Project: 6021 (Run 0, Clone 135, Gen 68)


[07:07:51] Completed 400000 out of 500000 steps  (80%)
[07:17:53] Gromacs cannot continue further.
[07:17:53] Going to send back what have done -- stepsTotalG=500000

[07:17:58] Folding@home Core Shutdown: EARLY_UNIT_END
[07:18:01] CoreStatus = 72 (114)
[07:18:01] Sending work to server
[07:18:01] Project: 6021 (Run 0, Clone 135, Gen 68)
[07:18:01] + Attempting to send results [March 30 07:18:01 UTC]
[07:18:03] + Results successfully sent
[07:18:03] Thank you for your contribution to Folding@Home.
Possibly . . . or maybe not.
1) Your incomplete result was returned to Stanford for partial credit. Others have not been.
2) The WU was reassigned and someone else completed it successfully.

Re: Various Project: 601x problems

Posted: Thu Apr 01, 2010 3:43 am
by Brian Redoutey
I'm having similar issues as Ardvark, posted in another thread before finding this one. Something I just noticed/remembered. I *think* I have only been able to trip this bug when I resume the work unit after a cold system boot. Going to leave it on straight until the current unit finishes and see what happens. If I remember correctly, it's happening inside of five minutes or so from firing up the client and resuming from the previous check point. I'll leave it going all night and day and see if I can get it to crash out w/o a reboot or relaunch thrown in anywhere.

Re: Various Project: 601x problems

Posted: Thu Apr 01, 2010 6:22 am
by curby.net
Me too. I fold on a variety of hardware, including two MacBook Pros. While they can run StressCPU just fine for over an hour, they give up the ghost on 601x WUs within the first 1-2% percent. I can't always be looking at the client, so when I come back I've found a series of failed units, including EUE, UNSTABLE CLIENT, etc. By that time, qfix followed by -send all doesn't work (nothing is returned).

I realize that MBPs aren't intended as scientific computing platforms so their cooling may be marginal, but then I wonder why StressCPU doesn't report any issues. Does the folding client do things that StressCPU doesn't? Or is this indicative of issues with the code rather than the hardware?

I know this is a forum dedicated to specifics, but when enough issues are raised it may be time to look beyond the specifics and consider the forest instead of the trees. Is there an easy way to throttle down the client so it doesn't tax the system as hard? For the good of the project, it would be better to have systems folding at 80-90% rather than not at all.

Re: Various Project: 601x problems

Posted: Thu Apr 01, 2010 6:54 am
by bruce
curby.net wrote:Is there an easy way to throttle down the client so it doesn't tax the system as hard? For the good of the project, it would be better to have systems folding at 80-90% rather than not at all.
For those of you with MacOS/Linux clients, consider the 3rd party tool - fahlimit - which is supposed to be able to reduce the cpu load caused by the folding@home core. It was originally designed back before the SMP cores came out so I'm not sure how well it works with the A1/A2/A3 cores. Maybe somebody else can comment on that -- or you can try it and see what happens.

Project: 6015 (Run 1, Clone 94, Gen 87)

Posted: Thu Apr 01, 2010 2:30 pm
by Aardvark
Another failed a3core WU. (See Subject above)

This unit failed at <5%. Work file was written cleanly and effort was identified as an EUE.

Remains were returned to Stanford.

Immediately before this WU I had a P6013R0C92G102 WU that folded cleanly and was returned.

Log file for the failed WU follows:

Code: Select all

[10:37:42] + Processing work unit
[10:37:42] Core required: FahCore_a3.exe
[10:37:42] Core found.
[10:37:42] Working on queue slot 05 [April 1 10:37:42 UTC]
[10:37:42] + Working ...
[10:37:42] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 05 -np 2 -checkpoint 15 -verbose -lifeline 1590 -version 629'

[10:37:42] 
[10:37:42] *------------------------------*
[10:37:42] Folding@Home Gromacs SMP Core
[10:37:42] Version 2.17 (Mar 7 2010)
[10:37:42] 
[10:37:42] Preparing to commence simulation
[10:37:42] - Looking at optimizations...
[10:37:42] - Created dyn
[10:37:42] - Files status OK
[10:37:42] - Expanded 1797750 -> 2392545 (decompressed 133.0 percent)
[10:37:42] Called DecompressByteArray: compressed_data_size=1797750 data_size=2392545, decompressed_data_size=2392545 diff=0
[10:37:42] - Digital signature verified
[10:37:42] 
[10:37:42] Project: 6015 (Run 1, Clone 94, Gen 87)
[10:37:42] 
[10:37:42] Assembly optimizations on if available.
[10:37:42] Entering M.D.
Starting 2 threads
NNODES=2, MYRANK=0, HOSTNAME=thread #0
NNODES=2, MYRANK=1, HOSTNAME=thread #1
Reading file work/wudata_05.tpr, VERSION 4.0.99_development_20090605 (single precision)
Note: tpx file_version 68, software version 70
Making 1D domain decomposition 2 x 1 x 1
starting mdrun 'Protein in POPC'
44000004 steps,  88000.0 ps (continuing from step 43500004,  87000.0 ps).
[10:37:49] Completed 0 out of 500000 steps  (0%)
[11:00:47] Completed 5000 out of 500000 steps  (1%)
[11:22:25] Completed 10000 out of 500000 steps  (2%)
[11:44:00] Completed 15000 out of 500000 steps  (3%)
[12:05:38] Completed 20000 out of 500000 steps  (4%)

-------------------------------------------------------
Program mdrun, VERSION 4.0.99-dev-20100305
Source code file: /Users/kasson/a3_devnew/gromacs/src/mdlib/pme.c, line: 563

Fatal error:
3 particles communicated to PME node 0 are more than a cell length out of the domain decomposition cell of their charge group in dimension x
For more information and tips for trouble shooting please check the GROMACS website at
http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

[12:08:41] mdrun returned 255
[12:08:41] Going to send back what have done -- stepsTotalG=500000
[12:08:41] Work fraction=0.0414 steps=500000.
[12:08:45] logfile size=14209 infoLength=14209 edr=0 trr=25
[12:08:45] logfile size: 14209 info=14209 bed=0 hdr=25
[12:08:45] - Writing 14747 bytes of core data to disk...
[12:08:45]   ... Done.
[12:08:45] 
[12:08:45] Folding@home Core Shutdown: EARLY_UNIT_END
[12:08:45] CoreStatus = 72 (114)
[12:08:45] Sending work to server
[12:08:45] Project: 6015 (Run 1, Clone 94, Gen 87)


[12:08:45] + Attempting to send results [April 1 12:08:45 UTC]
[12:08:45] - Reading file work/wuresults_05.dat from core
[12:08:45]   (Read 14747 bytes from disk)
[12:08:46] > Press "c" to connect to the server to upload results

Ciao....

Re: Various Project: 601x problems

Posted: Thu Apr 01, 2010 3:37 pm
by curby.net
Thanks Bruce, fahlimit works to reduce the system load and keep temps down, but it doesn't solve the problems entirely. I'm now running with a 80% duty cycle and my first WU just had an unstable machine exit after about 4 minutes on 6015/R0/C117/G102. The logs indicate that it sent something back to Stanford. It's now working on 6015/R0/C187/G80. If I can't finish WUs consistently I'll try using a 60% duty cycle, but I really don't think that cooling is an issue even at 80%; the temps now are lower than when I'm folding _a1/_a2 WUs, which invariably finish successfully, and much lower than when I'm running StressCPU, which also runs without error.

Here's a wrinkle. The problems started using the March 9, 2010 core (v 2.17). Before that, I'd been using v2.13 without issue, completing and returning 4 out of 4 WUs. After downloading v2.17 I only ever successfully returned one WU (and failed on over ten).

The 187/80 WU mentioned above failed after 20 minutes. It logged the same "3 particles communicated to PME node 0" error to console (but not FAHlog.txt) that others have mentioned. Is it odd that my client reports that as "unstable machine" but others are getting "EUE" for the same error? Is there any way to get 2.13 back? It worked just fine for me. =)

Another wrinkle: The two machines that are failing are running Snow Leopard and are on newer hardware. I've got another machine folding _a3 WUs in Leopard just fine (with the v2.17 core). Summary:

(system_profiler SPHardwareDataType) (OSX) (FahCore_a3 version) (success)
MacBookPro3,1 10.5.8 2.17 100% successful
MacBookPro5,1 10.6.2 2.17 all fail
MacBookPro5,3 10.6.2 2.13 100% successful
MacBookPro5,3 10.6.2 2.17 one finished, others all fail in under 5% completion

Re: Various Project: 601x problems

Posted: Thu Apr 01, 2010 4:22 pm
by curby.net
I'm still getting errors at 60% fahlimit and relatively frigid CPU temps. This really doesn't seem like simple overheating issues anymore. I'm giving up and turning advmethods off for these clients for now. Back to the wild west of _a1 folding. =P

(I edited my previous post a few times. Please review for possible clues.)

Project: 6012 (Run 0, Clone 325, Gen 99)

Posted: Thu Apr 01, 2010 6:18 pm
by Aardvark
Another failed WU, for the list. This one failed <6% and the Client assigned UNSTABLE_MACHINE as the failure category.

I too do not understand how the Client is making the choice between EARLY_UNIT_END and UNSTABLE_MACHINE. The situations seem to be identical when the failures occur.

Log file for failed WU follows:

Code: Select all

[14:25:28] + Processing work unit
[14:25:28] Core required: FahCore_a3.exe
[14:25:28] Core found.
[14:25:28] Working on queue slot 06 [April 1 14:25:28 UTC]
[14:25:28] + Working ...
[14:25:28] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 06 -np 2 -checkpoint 15 -verbose -lifeline 1590 -version 629'

[14:25:28] 
[14:25:28] *------------------------------*
[14:25:28] Folding@Home Gromacs SMP Core
[14:25:28] Version 2.17 (Mar 7 2010)
[14:25:28] 
[14:25:28] Preparing to commence simulation
[14:25:28] - Looking at optimizations...
[14:25:28] - Created dyn
[14:25:28] - Files status OK
[14:25:28] - Expanded 1796874 -> 2078149 (decompressed 115.6 percent)
[14:25:28] Called DecompressByteArray: compressed_data_size=1796874 data_size=2078149, decompressed_data_size=2078149 diff=0
[14:25:28] - Digital signature verified
[14:25:28] 
[14:25:28] Project: 6012 (Run 0, Clone 325, Gen 99)
[14:25:28] 
[14:25:28] Assembly optimizations on if available.
[14:25:28] Entering M.D.
Starting 2 threads
NNODES=2, MYRANK=1, HOSTNAME=thread #1
NNODES=2, MYRANK=0, HOSTNAME=thread #0
Reading file work/wudata_06.tpr, VERSION 4.0.99_development_20090605 (single precision)
Note: tpx file_version 68, software version 70
Making 1D domain decomposition 2 x 1 x 1
starting mdrun 'Protein in POPC'
50000004 steps, 100000.0 ps (continuing from step 49500004,  99000.0 ps).
[14:25:35] Completed 0 out of 500000 steps  (0%)
[14:48:45] Completed 5000 out of 500000 steps  (1%)
[15:10:02] Completed 10000 out of 500000 steps  (2%)
[15:31:21] Completed 15000 out of 500000 steps  (3%)
[15:52:40] Completed 20000 out of 500000 steps  (4%)
[16:14:02] Completed 25000 out of 500000 steps  (5%)

-------------------------------------------------------
Program mdrun, VERSION 4.0.99-dev-20100305
Source code file: /Users/kasson/a3_devnew/gromacs/src/mdlib/pme.c, line: 563

Fatal error:
3 particles communicated to PME node 0 are more than a cell length out of the domain decomposition cell of their charge group in dimension x
For more information and tips for trouble shooting please check the GROMACS website at
http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

[16:28:46] mdrun returned 255
[16:28:46] Going to send back what have done -- stepsTotalG=500000
[16:28:46] Work fraction=0.0569 steps=500000.
[16:28:50] logfile size=14303 infoLength=14303 edr=0 trr=25
[16:28:50] logfile size: 14303 info=14303 bed=0 hdr=25
[16:28:50] - Writing 14841 bytes of core data to disk...
[16:28:50]   ... Done.
[16:28:50] 
[16:28:50] Folding@home Core Shutdown: UNSTABLE_MACHINE
[16:28:50] CoreStatus = 7A (122)
[16:28:50] Sending work to server
[16:28:50] Project: 6012 (Run 0, Clone 325, Gen 99)


[16:28:50] + Attempting to send results [April 1 16:28:50 UTC]
[16:28:50] - Reading file work/wuresults_06.dat from core
[16:28:50]   (Read 14841 bytes from disk)
[16:28:51] > Press "c" to connect to the server to upload results