Various Project: 601x problems

Moderators: Site Moderators, FAHC Science Team

Aardvark
Posts: 143
Joined: Sat Jul 12, 2008 4:22 pm
Location: Team MacResource

Project: 6012 (Run 0, Clone 390, Gen 99)

Post by Aardvark »

Another failed WU. Did not even make it to 1%.

Work file is again not adequate for return to Stanford.

This is starting to get a little stale....

Log file follows:

Code: Select all

[16:33:30] + Connections closed: You may now disconnect
[16:33:35] 
[16:33:35] + Processing work unit
[16:33:35] Core required: FahCore_a3.exe
[16:33:35] Core found.
[16:33:35] Working on queue slot 01 [March 30 16:33:35 UTC]
[16:33:35] + Working ...
[16:33:35] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 01 -np 2 -checkpoint 15 -verbose -lifeline 7949 -version 629'

[16:33:35] 
[16:33:35] *------------------------------*
[16:33:35] Folding@Home Gromacs SMP Core
[16:33:35] Version 2.17 (Mar 7 2010)
[16:33:35] 
[16:33:35] Preparing to commence simulation
[16:33:35] - Ensuring status. Please wait.
[16:33:45] - Looking at optimizations...
[16:33:45] - Working with standard loops on this execution.
[16:33:45] - Created dyn
[16:33:45] - Files status OK
[16:33:45] - Expanded 1796995 -> 2078149 (decompressed 115.6 percent)
[16:33:45] Called DecompressByteArray: compressed_data_size=1796995 data_size=2078149, decompressed_data_size=2078149 diff=0
[16:33:45] - Digital signature verified
[16:33:45] 
[16:33:45] Project: 6012 (Run 0, Clone 390, Gen 99)
[16:33:45] 
[16:33:45] Entering M.D.
Starting 2 threads
NNODES=2, MYRANK=0, HOSTNAME=thread #0
NNODES=2, MYRANK=1, HOSTNAME=thread #1
Reading file work/wudata_01.tpr, VERSION 4.0.99_development_20090605 (single precision)
Note: tpx file_version 68, software version 70
Making 1D domain decomposition 2 x 1 x 1
starting mdrun 'Protein in POPC'
50000004 steps, 100000.0 ps (continuing from step 49500004,  99000.0 ps).
[16:33:52] Completed 0 out of 500000 steps  (0%)

-------------------------------------------------------
Program mdrun, VERSION 4.0.99-dev-20100305
Source code file: /Users/kasson/a3_devnew/gromacs/src/mdlib/pme.c, line: 563

Fatal error:
8 particles communicated to PME node 1 are more than a cell length out of the domain decomposition cell of their charge group in dimension x
For more information and tips for trouble shooting please check the GROMACS website at
http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

[16:35:04] mdrun returned 255
[16:35:04] Going to send back what have done -- stepsTotalG=500000
[16:35:04] Work fraction=0.0005 steps=500000.
[16:35:05] CoreStatus = 0 (0)
[16:35:05] Sending work to server
[16:35:05] Project: 6012 (Run 0, Clone 390, Gen 99)
[16:35:05] - Error: Could not get length of results file work/wuresults_01.dat
[16:35:05] - Error: Could not read unit 01 file. Removing from queue.
[16:35:05] Trying to send all finished work units
[16:35:05] + No unsent completed units remaining.
[16:35:05] - Preparing to get new work unit...
[16:35:06] > Press "c" to connect to the server to download unit

What is past is prologue!
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Various Project: 601x problems

Post by bruce »

Your client is misbehaving and I'm sure that's frustrating, but that doesn't mean you need to abandon FAH. For reliable performance, I suggest you switch to N classic clients (one per physical core) until the issue can be resolved.

As a reminder, please re-read what it says about beta clients on the download page.
AlanH
Posts: 57
Joined: Mon Dec 03, 2007 9:54 pm

Re: Various Project: 601x problems

Post by AlanH »

bruce wrote:As a matter of fact, the main developer is working on the problem. Apparently you didn't see his response to your previous report.
viewtopic.php?f=19&t=13980&p=137199#p137199

Two topics on the same subject merged.
Sure, I saw it.
Kasson indicated that the log info was useful, and that's why I reposted my data here.
Folding for TeamCFC
- Mac Pro Dual 2.66GHz Xeon, 4 GBytes running Mac SMP2 client
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: Various Project: 601x problems

Post by P5-133XL »

bruce wrote:Your client is misbehaving and I'm sure that's frustrating, but that doesn't mean you need to abandon FAH. For reliable performance, I suggest you switch to N classic clients (one per physical core) until the issue can be resolved.

As a reminder, please re-read what it says about beta clients on the download page.
Or even just drop the -advmethods flag and switch from A3's to A1/A2's. You'll get more points than n uniprocessor clients but not as many as A3's
Image
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Various Project: 601x problems

Post by bruce »

Just so you know you're not alone . . . .

Code: Select all

[16:51:24] - Calling '.\FahCore_a3.exe -dir work/ -nice 19 -suffix 02 -np 4 -nocpulock -checkpoint 15 -forceasm -verbose -lifeline 1004 -version 629'

[16:51:24] 
[16:51:24] *------------------------------*
[16:51:24] Folding@Home Gromacs SMP Core
[16:51:24] Version 2.17 (Mar 12, 2010)
[16:51:24] 
[16:51:24] Preparing to commence simulation
[16:51:24] - Assembly optimizations manually forced on.
[16:51:24] - Not checking prior termination.
[16:51:25] - Expanded 1799234 -> 2396877 (decompressed 133.2 percent)
[16:51:25] Called DecompressByteArray: compressed_data_size=1799234 data_size=2396877, decompressed_data_size=2396877 diff=0
[16:51:25] - Digital signature verified
[16:51:25] 
[16:51:25] Project: 6014 (Run 0, Clone 29, Gen 121)
[16:51:25] 
[16:51:25] Assembly optimizations on if available.
[16:51:25] Entering M.D.
[16:51:32] Completed 0 out of 500000 steps  (0%)
[17:18:48] Completed 5000 out of 500000 steps  (1%)
[17:20:30] - Autosending finished units... [March 29 17:20:30 UTC]
[17:20:30] Trying to send all finished work units
[17:20:30] + No unsent completed units remaining.
[17:20:30] - Autosend completed
[17:45:40] Completed 10000 out of 500000 steps  (2%)
[18:12:11] Completed 15000 out of 500000 steps  (3%)
[18:38:43] Completed 20000 out of 500000 steps  (4%)
[19:05:15] Completed 25000 out of 500000 steps  (5%)
[19:32:02] Completed 30000 out of 500000 steps  (6%)
[19:58:48] Completed 35000 out of 500000 steps  (7%)
[20:25:53] Completed 40000 out of 500000 steps  (8%)
[20:52:50] Completed 45000 out of 500000 steps  (9%)
[21:19:45] Completed 50000 out of 500000 steps  (10%)
[21:46:44] Completed 55000 out of 500000 steps  (11%)
[22:13:42] Completed 60000 out of 500000 steps  (12%)
[22:40:34] Completed 65000 out of 500000 steps  (13%)
[23:07:42] Completed 70000 out of 500000 steps  (14%)
[23:20:29] - Autosending finished units... [March 29 23:20:29 UTC]
[23:20:29] Trying to send all finished work units
[23:20:29] + No unsent completed units remaining.
[23:20:29] - Autosend completed
[23:34:49] Completed 75000 out of 500000 steps  (15%)
[00:01:54] Completed 80000 out of 500000 steps  (16%)
[00:29:18] Completed 85000 out of 500000 steps  (17%)
[00:49:57] Gromacs cannot continue further.
[00:49:57] Going to send back what have done -- stepsTotalG=500000
[00:49:57] Work fraction=-1.#IND steps=500000.
--------- Vista popup on the screen demanded attention and work suspended until I responded ----14+ hrs later -----
[05:20:27] - Autosending finished units... [March 30 05:20:27 UTC]
[05:20:27] Trying to send all finished work units
[05:20:27] + No unsent completed units remaining.
[05:20:27] - Autosend completed
[11:20:26] - Autosending finished units... [March 30 11:20:26 UTC]
[11:20:26] Trying to send all finished work units
[11:20:26] + No unsent completed units remaining.
[11:20:26] - Autosend completed
[15:31:15] CoreStatus = C0000005 (-1073741819)
[15:31:15] Client-core communications error: ERROR 0xc0000005
[15:31:15] Deleting current work unit & continuing...
[15:31:29] Trying to send all finished work units
[15:31:29] + No unsent completed units remaining.
[15:31:29] - Preparing to get new work unit...
P5-133XL wrote:Or even just drop the -advmethods flag and switch from A3's to A1/A2's. You'll get more points than n uniprocessor clients but not as many as A3's
Great idea. It's always a hassle to get all four clients to shut down more or less simultaneously (i.e.- the same day) using -oneunit and restart SMP.
Aardvark
Posts: 143
Joined: Sat Jul 12, 2008 4:22 pm
Location: Team MacResource

Project: 6014 (Run 1, Clone 36, Gen 95)

Post by Aardvark »

I tried to follow PS-133XLs suggestion and did not specify -advmethods. However, I received another a3core WU (See Subject above). I have to assume that the -smp argument is now "hot-wired" into a3 WUs. Is that the way things are supposed to be?

Anyway, it failed at <1%.

Log File follows:

Code: Select all

[18:58:37] + Processing work unit
[18:58:37] Core required: FahCore_a3.exe
[18:58:37] Core found.
[18:58:37] Working on queue slot 02 [March 30 18:58:37 UTC]
[18:58:37] + Working ...
[18:58:37] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 02 -np 2 -checkpoint 15 -verbose -lifeline 695 -version 629'

[18:58:37] 
[18:58:37] *------------------------------*
[18:58:37] Folding@Home Gromacs SMP Core
[18:58:37] Version 2.17 (Mar 7 2010)
[18:58:37] 
[18:58:37] Preparing to commence simulation
[18:58:37] - Ensuring status. Please wait.
[18:58:46] - Looking at optimizations...
[18:58:46] - Working with standard loops on this execution.
[18:58:46] - Created dyn
[18:58:46] - Files status OK
[18:58:47] - Expanded 1798615 -> 2396877 (decompressed 133.2 percent)
[18:58:47] Called DecompressByteArray: compressed_data_size=1798615 data_size=2396877, decompressed_data_size=2396877 diff=0
[18:58:47] - Digital signature verified
[18:58:47] 
[18:58:47] Project: 6014 (Run 1, Clone 36, Gen 95)
[18:58:47] 
[18:58:47] Entering M.D.
Starting 2 threads
NNODES=2, MYRANK=0, HOSTNAME=thread #0
NNODES=2, MYRANK=1, HOSTNAME=thread #1
Reading file work/wudata_02.tpr, VERSION 4.0.99_development_20090605 (single precision)
Note: tpx file_version 68, software version 70
Making 1D domain decomposition 2 x 1 x 1
starting mdrun 'Protein in POPC'
48000004 steps,  96000.0 ps (continuing from step 47500004,  95000.0 ps).
[18:58:54] Completed 0 out of 500000 steps  (0%)

-------------------------------------------------------
Program mdrun, VERSION 4.0.99-dev-20100305
Source code file: /Users/kasson/a3_devnew/gromacs/src/mdlib/pme.c, line: 563

Fatal error:
3 particles communicated to PME node 1 are more than a cell length out of the domain decomposition cell of their charge group in dimension x
For more information and tips for trouble shooting please check the GROMACS website at
http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

[18:59:45] mdrun returned 255
[18:59:45] Going to send back what have done -- stepsTotalG=500000
[18:59:45] Work fraction=0.0004 steps=500000.
[18:59:49] logfile size=11879 infoLength=11879 edr=0 trr=25
[18:59:49] logfile size: 11879 info=11879 bed=0 hdr=25
[18:59:49] - Writing 12417 bytes of core data to disk...
[18:59:49]   ... Done.
[18:59:50] 
[18:59:50] Folding@home Core Shutdown: UNSTABLE_MACHINE
[18:59:50] CoreStatus = 7A (122)
[18:59:50] Sending work to server
[18:59:50] Project: 6014 (Run 1, Clone 36, Gen 95)


[18:59:50] + Attempting to send results [March 30 18:59:50 UTC]
[18:59:50] - Reading file work/wuresults_02.dat from core
[18:59:50]   (Read 12417 bytes from disk)
[18:59:51] > Press "c" to connect to the server to upload results
[
At least it wrote a good Work file and the info should now have made it to Stanford.....

I think I am going to take a break from Folding for a while. Will be checking for that "action plan".
Last edited by Aardvark on Tue Mar 30, 2010 7:29 pm, edited 1 time in total.
What is past is prologue!
ikerekes
Posts: 94
Joined: Thu Nov 13, 2008 4:18 pm
Hardware configuration: q6600 @ 3.3Ghz windows xp-sp3 one SMP2 (2.15 core) + 1 9800GT native GPU2
Athlon x2 6000+ @ 3.0Ghz ubuntu 8.04 smp + asus 9600GSO gpu2 in wine wrapper
5600X2 @ 3.19Ghz ubuntu 8.04 smp + asus 9600GSO gpu2 in wine wrapper
E5200 @ 3.7Ghz ubuntu 8.04 smp2 + asus 9600GT silent gpu2 in wine wrapper
E5200 @ 3.65Ghz ubuntu 8.04 smp2 + asus 9600GSO gpu2 in wine wrapper
E6550 vmware ubuntu 8.4.1
q8400 @ 3.3Ghz windows xp-sp3 one SMP2 (2.15 core) + 1 9800GT native GPU2
Athlon II 620 @ 2.6 Ghz windows xp-sp3 one SMP2 (2.15 core) + 1 9800GT native GPU2
Location: Calgary, Canada

Re: Various Project: 601x problems

Post by ikerekes »

I don't know if this qualifies as the same problem, but:

Code: Select all

[03:10:03] + Processing work unit
[03:10:03] Core required: FahCore_a3.exe
[03:10:03] Core found.
[03:10:03] Working on queue slot 02 [March 30 03:10:03 UTC]
[03:10:03] + Working ...
[03:10:03] 
[03:10:03] *------------------------------*
[03:10:03] Folding@Home Gromacs SMP Core
[03:10:03] Version 2.17 (Mar 12, 2010)
[03:10:03] 
[03:10:03] Preparing to commence simulation
[03:10:03] - Ensuring status. Please wait.
[03:10:13] - Looking at optimizations...
[03:10:13] - Working with standard loops on this execution.
[03:10:13] - Previous termination of core was improper.
[03:10:13] - Going to use standard loops.
[03:10:13] - Files status OK
[03:10:13] - Expanded 1772217 -> 1975105 (decompressed 111.4 percent)
[03:10:13] Called DecompressByteArray: compressed_data_size=1772217 data_size=1975105, decompressed_data_size=1975105 diff=0
[03:10:13] - Digital signature verified
[03:10:13] 
[03:10:13] Project: 6021 (Run 0, Clone 135, Gen 68)
[03:10:13] 
[03:10:13] Entering M.D.
[03:10:19] Using Gromacs checkpoints
[03:10:20] Resuming from checkpoint
[03:10:21] Verified work/wudata_02.log
[03:10:22] Verified work/wudata_02.trr
[03:10:22] Verified work/wudata_02.edr
[03:10:22] Completed 287296 out of 500000 steps  (57%)
[03:16:05] Completed 290000 out of 500000 steps  (58%)
[03:26:33] Completed 295000 out of 500000 steps  (59%)
[03:37:03] Completed 300000 out of 500000 steps  (60%)
[03:47:36] Completed 305000 out of 500000 steps  (61%)
[03:58:08] Completed 310000 out of 500000 steps  (62%)
[04:08:42] Completed 315000 out of 500000 steps  (63%)
[04:19:14] Completed 320000 out of 500000 steps  (64%)
[04:29:46] Completed 325000 out of 500000 steps  (65%)
[04:40:18] Completed 330000 out of 500000 steps  (66%)
[04:50:50] Completed 335000 out of 500000 steps  (67%)
[05:01:22] Completed 340000 out of 500000 steps  (68%)
[05:11:55] Completed 345000 out of 500000 steps  (69%)
[05:22:28] Completed 350000 out of 500000 steps  (70%)
[05:33:01] Completed 355000 out of 500000 steps  (71%)
[05:43:35] Completed 360000 out of 500000 steps  (72%)
[05:54:07] Completed 365000 out of 500000 steps  (73%)
[06:04:38] Completed 370000 out of 500000 steps  (74%)
[06:15:10] Completed 375000 out of 500000 steps  (75%)
[06:25:42] Completed 380000 out of 500000 steps  (76%)
[06:36:13] Completed 385000 out of 500000 steps  (77%)
[06:46:46] Completed 390000 out of 500000 steps  (78%)
[06:57:18] Completed 395000 out of 500000 steps  (79%)
[07:07:51] Completed 400000 out of 500000 steps  (80%)
[07:17:53] Gromacs cannot continue further.
[07:17:53] Going to send back what have done -- stepsTotalG=500000
[07:17:53] Work fraction=0.8095 steps=500000.
[07:17:57] logfile size=43805 infoLength=43805 edr=0 trr=23
[07:17:57] logfile size: 43805 info=43805 bed=0 hdr=23
[07:17:57] - Writing 44341 bytes of core data to disk...
[07:17:58]   ... Done.
[07:17:58] 
[07:17:58] Folding@home Core Shutdown: EARLY_UNIT_END
[07:18:01] CoreStatus = 72 (114)
[07:18:01] Sending work to server
[07:18:01] Project: 6021 (Run 0, Clone 135, Gen 68)


[07:18:01] + Attempting to send results [March 30 07:18:01 UTC]
[07:18:03] + Results successfully sent
[07:18:03] Thank you for your contribution to Folding@Home.
Fresh install off Win XP-SP3 on an E5200 @ 3.6Ghz. It has already completed more than 50 a3's without any problem.
Image
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Various Project: 601x problems

Post by bruce »

ikerekes wrote:I don't know if this qualifies as the same problem, but:

Code: Select all

[03:10:13] Project: 6021 (Run 0, Clone 135, Gen 68)


[07:07:51] Completed 400000 out of 500000 steps  (80%)
[07:17:53] Gromacs cannot continue further.
[07:17:53] Going to send back what have done -- stepsTotalG=500000

[07:17:58] Folding@home Core Shutdown: EARLY_UNIT_END
[07:18:01] CoreStatus = 72 (114)
[07:18:01] Sending work to server
[07:18:01] Project: 6021 (Run 0, Clone 135, Gen 68)
[07:18:01] + Attempting to send results [March 30 07:18:01 UTC]
[07:18:03] + Results successfully sent
[07:18:03] Thank you for your contribution to Folding@Home.
Possibly . . . or maybe not.
1) Your incomplete result was returned to Stanford for partial credit. Others have not been.
2) The WU was reassigned and someone else completed it successfully.
Brian Redoutey
Posts: 12
Joined: Fri Dec 14, 2007 5:09 pm
Hardware configuration: Dual Processor 2Ghz G5 Rev. A (video production system)
AthlonXP 2800 Asus A7n8X-X (3D modeling system)
only a poor artist blames their tools.
Location: Michigan

Re: Various Project: 601x problems

Post by Brian Redoutey »

I'm having similar issues as Ardvark, posted in another thread before finding this one. Something I just noticed/remembered. I *think* I have only been able to trip this bug when I resume the work unit after a cold system boot. Going to leave it on straight until the current unit finishes and see what happens. If I remember correctly, it's happening inside of five minutes or so from firing up the client and resuming from the previous check point. I'll leave it going all night and day and see if I can get it to crash out w/o a reboot or relaunch thrown in anywhere.
I forgot what i had in here last time.
curby.net
Posts: 4
Joined: Tue Apr 01, 2008 2:51 pm
Location: Team Mac OS X

Re: Various Project: 601x problems

Post by curby.net »

Me too. I fold on a variety of hardware, including two MacBook Pros. While they can run StressCPU just fine for over an hour, they give up the ghost on 601x WUs within the first 1-2% percent. I can't always be looking at the client, so when I come back I've found a series of failed units, including EUE, UNSTABLE CLIENT, etc. By that time, qfix followed by -send all doesn't work (nothing is returned).

I realize that MBPs aren't intended as scientific computing platforms so their cooling may be marginal, but then I wonder why StressCPU doesn't report any issues. Does the folding client do things that StressCPU doesn't? Or is this indicative of issues with the code rather than the hardware?

I know this is a forum dedicated to specifics, but when enough issues are raised it may be time to look beyond the specifics and consider the forest instead of the trees. Is there an easy way to throttle down the client so it doesn't tax the system as hard? For the good of the project, it would be better to have systems folding at 80-90% rather than not at all.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Various Project: 601x problems

Post by bruce »

curby.net wrote:Is there an easy way to throttle down the client so it doesn't tax the system as hard? For the good of the project, it would be better to have systems folding at 80-90% rather than not at all.
For those of you with MacOS/Linux clients, consider the 3rd party tool - fahlimit - which is supposed to be able to reduce the cpu load caused by the folding@home core. It was originally designed back before the SMP cores came out so I'm not sure how well it works with the A1/A2/A3 cores. Maybe somebody else can comment on that -- or you can try it and see what happens.
Aardvark
Posts: 143
Joined: Sat Jul 12, 2008 4:22 pm
Location: Team MacResource

Project: 6015 (Run 1, Clone 94, Gen 87)

Post by Aardvark »

Another failed a3core WU. (See Subject above)

This unit failed at <5%. Work file was written cleanly and effort was identified as an EUE.

Remains were returned to Stanford.

Immediately before this WU I had a P6013R0C92G102 WU that folded cleanly and was returned.

Log file for the failed WU follows:

Code: Select all

[10:37:42] + Processing work unit
[10:37:42] Core required: FahCore_a3.exe
[10:37:42] Core found.
[10:37:42] Working on queue slot 05 [April 1 10:37:42 UTC]
[10:37:42] + Working ...
[10:37:42] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 05 -np 2 -checkpoint 15 -verbose -lifeline 1590 -version 629'

[10:37:42] 
[10:37:42] *------------------------------*
[10:37:42] Folding@Home Gromacs SMP Core
[10:37:42] Version 2.17 (Mar 7 2010)
[10:37:42] 
[10:37:42] Preparing to commence simulation
[10:37:42] - Looking at optimizations...
[10:37:42] - Created dyn
[10:37:42] - Files status OK
[10:37:42] - Expanded 1797750 -> 2392545 (decompressed 133.0 percent)
[10:37:42] Called DecompressByteArray: compressed_data_size=1797750 data_size=2392545, decompressed_data_size=2392545 diff=0
[10:37:42] - Digital signature verified
[10:37:42] 
[10:37:42] Project: 6015 (Run 1, Clone 94, Gen 87)
[10:37:42] 
[10:37:42] Assembly optimizations on if available.
[10:37:42] Entering M.D.
Starting 2 threads
NNODES=2, MYRANK=0, HOSTNAME=thread #0
NNODES=2, MYRANK=1, HOSTNAME=thread #1
Reading file work/wudata_05.tpr, VERSION 4.0.99_development_20090605 (single precision)
Note: tpx file_version 68, software version 70
Making 1D domain decomposition 2 x 1 x 1
starting mdrun 'Protein in POPC'
44000004 steps,  88000.0 ps (continuing from step 43500004,  87000.0 ps).
[10:37:49] Completed 0 out of 500000 steps  (0%)
[11:00:47] Completed 5000 out of 500000 steps  (1%)
[11:22:25] Completed 10000 out of 500000 steps  (2%)
[11:44:00] Completed 15000 out of 500000 steps  (3%)
[12:05:38] Completed 20000 out of 500000 steps  (4%)

-------------------------------------------------------
Program mdrun, VERSION 4.0.99-dev-20100305
Source code file: /Users/kasson/a3_devnew/gromacs/src/mdlib/pme.c, line: 563

Fatal error:
3 particles communicated to PME node 0 are more than a cell length out of the domain decomposition cell of their charge group in dimension x
For more information and tips for trouble shooting please check the GROMACS website at
http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

[12:08:41] mdrun returned 255
[12:08:41] Going to send back what have done -- stepsTotalG=500000
[12:08:41] Work fraction=0.0414 steps=500000.
[12:08:45] logfile size=14209 infoLength=14209 edr=0 trr=25
[12:08:45] logfile size: 14209 info=14209 bed=0 hdr=25
[12:08:45] - Writing 14747 bytes of core data to disk...
[12:08:45]   ... Done.
[12:08:45] 
[12:08:45] Folding@home Core Shutdown: EARLY_UNIT_END
[12:08:45] CoreStatus = 72 (114)
[12:08:45] Sending work to server
[12:08:45] Project: 6015 (Run 1, Clone 94, Gen 87)


[12:08:45] + Attempting to send results [April 1 12:08:45 UTC]
[12:08:45] - Reading file work/wuresults_05.dat from core
[12:08:45]   (Read 14747 bytes from disk)
[12:08:46] > Press "c" to connect to the server to upload results

Ciao....
What is past is prologue!
curby.net
Posts: 4
Joined: Tue Apr 01, 2008 2:51 pm
Location: Team Mac OS X

Re: Various Project: 601x problems

Post by curby.net »

Thanks Bruce, fahlimit works to reduce the system load and keep temps down, but it doesn't solve the problems entirely. I'm now running with a 80% duty cycle and my first WU just had an unstable machine exit after about 4 minutes on 6015/R0/C117/G102. The logs indicate that it sent something back to Stanford. It's now working on 6015/R0/C187/G80. If I can't finish WUs consistently I'll try using a 60% duty cycle, but I really don't think that cooling is an issue even at 80%; the temps now are lower than when I'm folding _a1/_a2 WUs, which invariably finish successfully, and much lower than when I'm running StressCPU, which also runs without error.

Here's a wrinkle. The problems started using the March 9, 2010 core (v 2.17). Before that, I'd been using v2.13 without issue, completing and returning 4 out of 4 WUs. After downloading v2.17 I only ever successfully returned one WU (and failed on over ten).

The 187/80 WU mentioned above failed after 20 minutes. It logged the same "3 particles communicated to PME node 0" error to console (but not FAHlog.txt) that others have mentioned. Is it odd that my client reports that as "unstable machine" but others are getting "EUE" for the same error? Is there any way to get 2.13 back? It worked just fine for me. =)

Another wrinkle: The two machines that are failing are running Snow Leopard and are on newer hardware. I've got another machine folding _a3 WUs in Leopard just fine (with the v2.17 core). Summary:

(system_profiler SPHardwareDataType) (OSX) (FahCore_a3 version) (success)
MacBookPro3,1 10.5.8 2.17 100% successful
MacBookPro5,1 10.6.2 2.17 all fail
MacBookPro5,3 10.6.2 2.13 100% successful
MacBookPro5,3 10.6.2 2.17 one finished, others all fail in under 5% completion
curby.net
Posts: 4
Joined: Tue Apr 01, 2008 2:51 pm
Location: Team Mac OS X

Re: Various Project: 601x problems

Post by curby.net »

I'm still getting errors at 60% fahlimit and relatively frigid CPU temps. This really doesn't seem like simple overheating issues anymore. I'm giving up and turning advmethods off for these clients for now. Back to the wild west of _a1 folding. =P

(I edited my previous post a few times. Please review for possible clues.)
Aardvark
Posts: 143
Joined: Sat Jul 12, 2008 4:22 pm
Location: Team MacResource

Project: 6012 (Run 0, Clone 325, Gen 99)

Post by Aardvark »

Another failed WU, for the list. This one failed <6% and the Client assigned UNSTABLE_MACHINE as the failure category.

I too do not understand how the Client is making the choice between EARLY_UNIT_END and UNSTABLE_MACHINE. The situations seem to be identical when the failures occur.

Log file for failed WU follows:

Code: Select all

[14:25:28] + Processing work unit
[14:25:28] Core required: FahCore_a3.exe
[14:25:28] Core found.
[14:25:28] Working on queue slot 06 [April 1 14:25:28 UTC]
[14:25:28] + Working ...
[14:25:28] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 06 -np 2 -checkpoint 15 -verbose -lifeline 1590 -version 629'

[14:25:28] 
[14:25:28] *------------------------------*
[14:25:28] Folding@Home Gromacs SMP Core
[14:25:28] Version 2.17 (Mar 7 2010)
[14:25:28] 
[14:25:28] Preparing to commence simulation
[14:25:28] - Looking at optimizations...
[14:25:28] - Created dyn
[14:25:28] - Files status OK
[14:25:28] - Expanded 1796874 -> 2078149 (decompressed 115.6 percent)
[14:25:28] Called DecompressByteArray: compressed_data_size=1796874 data_size=2078149, decompressed_data_size=2078149 diff=0
[14:25:28] - Digital signature verified
[14:25:28] 
[14:25:28] Project: 6012 (Run 0, Clone 325, Gen 99)
[14:25:28] 
[14:25:28] Assembly optimizations on if available.
[14:25:28] Entering M.D.
Starting 2 threads
NNODES=2, MYRANK=1, HOSTNAME=thread #1
NNODES=2, MYRANK=0, HOSTNAME=thread #0
Reading file work/wudata_06.tpr, VERSION 4.0.99_development_20090605 (single precision)
Note: tpx file_version 68, software version 70
Making 1D domain decomposition 2 x 1 x 1
starting mdrun 'Protein in POPC'
50000004 steps, 100000.0 ps (continuing from step 49500004,  99000.0 ps).
[14:25:35] Completed 0 out of 500000 steps  (0%)
[14:48:45] Completed 5000 out of 500000 steps  (1%)
[15:10:02] Completed 10000 out of 500000 steps  (2%)
[15:31:21] Completed 15000 out of 500000 steps  (3%)
[15:52:40] Completed 20000 out of 500000 steps  (4%)
[16:14:02] Completed 25000 out of 500000 steps  (5%)

-------------------------------------------------------
Program mdrun, VERSION 4.0.99-dev-20100305
Source code file: /Users/kasson/a3_devnew/gromacs/src/mdlib/pme.c, line: 563

Fatal error:
3 particles communicated to PME node 0 are more than a cell length out of the domain decomposition cell of their charge group in dimension x
For more information and tips for trouble shooting please check the GROMACS website at
http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

[16:28:46] mdrun returned 255
[16:28:46] Going to send back what have done -- stepsTotalG=500000
[16:28:46] Work fraction=0.0569 steps=500000.
[16:28:50] logfile size=14303 infoLength=14303 edr=0 trr=25
[16:28:50] logfile size: 14303 info=14303 bed=0 hdr=25
[16:28:50] - Writing 14841 bytes of core data to disk...
[16:28:50]   ... Done.
[16:28:50] 
[16:28:50] Folding@home Core Shutdown: UNSTABLE_MACHINE
[16:28:50] CoreStatus = 7A (122)
[16:28:50] Sending work to server
[16:28:50] Project: 6012 (Run 0, Clone 325, Gen 99)


[16:28:50] + Attempting to send results [April 1 16:28:50 UTC]
[16:28:50] - Reading file work/wuresults_06.dat from core
[16:28:50]   (Read 14841 bytes from disk)
[16:28:51] > Press "c" to connect to the server to upload results
What is past is prologue!
Post Reply