Page 11 of 13

Re: List of SMP WUs with the "1 core usage" issue

Posted: Tue Sep 08, 2009 12:43 pm
by martiou
I received several WUs which crashed immediately after startup ("CoreStatus = FF (255)").

With core 2.07 :
- 20/08 :
.Project: 2669 (Run 9, Clone 20, Gen 198) - compressed_data_size=1518487
- 26/08 :
.Project: 2677 (Run 11, Clone 88, Gen 31) - compressed_data_size=1490678
- 02/09 :
.Project: 2677 (Run 3, Clone 20, Gen 41) - compressed_data_size=1503302

With core 2.10 :
- 06/09 :
.Project: 2677 (Run 30, Clone 77, Gen 33) - compressed_data_size=1495164
- today :
.Project: 2677 (Run 12, Clone 88, Gen 38) - compressed_data_size=1508891
.Project: 2677 (Run 24, Clone 57, Gen 32) - compressed_data_size=1498655

Bad Units, lots of them ...

Posted: Tue Sep 08, 2009 1:36 pm
by SpockLogic
Over the last few days I've had the following bad units:-

2669, 11, 148, 181
2671, 10, 3, 86
2671, 12, 40, 89
2671, 32, 41, 88
2671, 37, 79, 78
2671, 49, 98, 84
2671, 50, 97, 91
2671, 52, 43, 82
2677, 1, 30, 31
2677, 30, 77, 33

each has had the following in the FAHlog

Code: Select all

[23:18:40] 
[23:18:40] *------------------------------*
[23:18:40] Folding@Home Gromacs SMP Core
[23:18:40] Version 2.11 (Fri Sep 4 09:50:46 PDT 2009)
[23:18:40] 
[23:18:40] Preparing to commence simulation
[23:18:40] - Assembly optimizations manually forced on.
[23:18:40] - Not checking prior termination.
[23:18:41] - Expanded 1495164 -> 24042365 (decompressed 1608.0 percent)
[23:18:41] Called DecompressByteArray: compressed_data_size=1495164 data_size=24042365, decompressed_data_size=24042365 diff=0
[23:18:42] - Digital signature verified
[23:18:42] 
[23:18:42] Project: 2677 (Run 30, Clone 77, Gen 33)
[23:18:42] 
[23:18:42] Assembly optimizations on if available.
[23:18:42] Entering M.D.
[23:19:32] Completed 0 out of 250001 steps  (0%)
[23:19:37] CoreStatus = FF (255)
[23:19:37] Client-core communications error: ERROR 0xff
[23:19:37] This is a sign of more serious problems, shutting down.
Some units have been issued more than once so is there any chance that we can get them deleted from the queue.

[rant]This is taking a lot more babysitting than before and I'm finding it tedious.[/rant]

MacPro 2 x 2.26 Quad Core, OS 10.5.8

Re: List of SMP WUs with the "1 core usage" issue

Posted: Tue Sep 08, 2009 2:54 pm
by shdbcamping
Ditto :( ,
after reading this 1st... I find no use in looking up the Specific WU info :( .

Just curious, as I was out of folding for the Holiday weekend, if any advances as to the Crux of the problem have been made?

An update would be helpful :wink: . None of mine will make it either when a WU is over 1 hour per % :( . It is a lot of wasted electricity and Hardware stress for nothing with these slow WU's.

Please advise, as I think that only one or 2 WU's finished over the weekend on 4 VM's running.
Sean

Re: List of SMP WUs with the "1 core usage" issue

Posted: Tue Sep 08, 2009 3:13 pm
by ikerekes
Project: 2677 (Run 27, Clone 19, Gen 30)

Do we still have to report bad WU's?
Apparently nobody cares to delete them, they keep reappearing :(

Project: 2671 (Run 12, Clone 40, Gen 89)

Posted: Wed Sep 09, 2009 12:15 pm
by SpockLogic

Code: Select all

# Mac OS X SMP Console Edition ################################################
###############################################################################

                       Folding@Home Client Version 6.20

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /Users/xxxxx/Library/InCrease/unit4
Executable: /Users/tonys/Library/InCrease/unit4/fah6
Arguments: -local -advmethods -forceasm -verbosity 9 -smp 

Warning:
 By using the -forceasm flag, you are overriding
 safeguards in the program. If you did not intend to
 do this, please restart the program without -forceasm.
 If work units are not completing fully (and particularly
 if your machine is overclocked), then please discontinue
 use of the flag.

[11:29:14] - Ask before connecting: No
[11:29:14] - User name: SpockLogic (Team 1971)
[11:29:14] - User ID: 278B714036433438
[11:29:14] - Machine ID: 4
[11:29:14] 
[11:29:14] Loaded queue successfully.
[11:29:14] 
[11:29:14] - Autosending finished units... [11:29:14]
[11:29:14] + Processing work unit
[11:29:14] Trying to send all finished work units
[11:29:14] Core required: FahCore_a2.exe
[11:29:14] + No unsent completed units remaining.
[11:29:14] - Autosend completed
[11:29:14] Core found.
[11:29:14] - Using generic ./mpiexec
[11:29:14] Working on queue slot 02 [September 9 11:29:14 UTC]
[11:29:14] + Working ...
[11:29:14] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -checkpoint 15 -forceasm -verbose -lifeline 18232 -version 620'

[11:29:14] 
[11:29:14] *------------------------------*
[11:29:14] Folding@Home Gromacs SMP Core
[11:29:14] Version 2.11 (Fri Sep 4 09:50:46 PDT 2009)
[11:29:14] 
[11:29:14] Preparing to commence simulation
[11:29:14] - Ensuring status. Please wait.
[11:29:24] - Assembly optimizations manually forced on.
[11:29:24] - Not checking prior termination.
[11:29:25] - Expanded 1506827 -> 24012993 (decompressed 1593.6 percent)
[11:29:25] Called DecompressByteArray: compressed_data_size=1506827 data_size=24012993, decompressed_data_size=24012993 diff=0
[11:29:25] - Digital signature verified
[11:29:25] 
[11:29:25] Project: 2671 (Run 12, Clone 40, Gen 89)
[11:29:25] 
[11:29:25] Assembly optimizations on if available.
[11:29:25] Entering M.D.
[11:30:14] Completed 0 out of 250000 steps  (0%)
[11:30:19] CoreStatus = FF (255)
[11:30:19] Client-core communications error: ERROR 0xff
[11:30:19] This is a sign of more serious problems, shutting down.

Re: List of SMP WUs with the "1 core usage" issue

Posted: Wed Sep 09, 2009 5:14 pm
by Anglik666
Project: 2671 (Run 24, Clone 41, Gen 91)

Code: Select all

[13:44:20] 
[13:44:20] *------------------------------*
[13:44:20] Folding@Home Gromacs SMP Core
[13:44:20] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[13:44:20] 
[13:44:20] Preparing to commence simulation
[13:44:20] - Ensuring status. Please wait.
[13:44:21] Called DecompressByteArray: compressed_data_size=1492108 data_size=24057197, decompressed_data_size=24057197 diff=0
[13:44:21] - Digital signature verified
[13:44:21] 
[13:44:21] Project: 2671 (Run 24, Clone 41, Gen 91)
[13:44:21] 
[13:44:21] Assembly optimizations on if available.
[13:44:21] Entering M.D.
[13:44:35] Run 24, Clone 41, Gen 91)
[13:44:35] 
[13:44:35] Entering M.D.
[13:45:06] lding@home Core Shutdown: INTERRUPTED
[13:45:10] CoreStatus = FF (255)
[13:45:10] Sending work to server
[13:45:10] Project: 2671 (Run 24, Clone 41, Gen 91)
[13:45:10] - Error: Could not get length of results file work/wuresults_09.dat
[13:45:10] - Error: Could not read unit 09 file. Removing from queue.
[13:45:10] Trying to send all finished work units
[13:45:10] + No unsent completed units remaining.
[13:45:10] - Preparing to get new work unit...
[13:45:10] Cleaning up work directory
[13:45:10] + Attempting to get work packet
[13:45:10] - Will indicate memory of 519 MB
[13:45:10] - Connecting to assignment server
[13:45:10] Connecting to http://assign.stanford.edu:8080/
[13:45:17] Posted data.
[13:45:17] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[13:45:17] + News From Folding@Home: Welcome to Folding@Home
[13:45:17] Loaded queue successfully.
[13:45:17] Connecting to http://171.67.108.24:8080/
[13:45:25] Posted data.
[13:45:25] Initial: 0000; - Receiving payload (expected size: 1492620)
[13:45:33] - Downloaded at ~182 kB/s
[13:45:33] - Averaged speed for that direction ~219 kB/s
[13:45:33] + Received work.
[13:45:33] Trying to send all finished work units
[13:45:33] + No unsent completed units remaining.
[13:45:33] + Closed connections
[13:45:38] 
[13:45:38] + Processing work unit
[13:45:38] At least 4 processors must be requested; read 1.
[13:45:38] Core required: FahCore_a2.exe
[13:45:38] Core found.
[13:45:38] Working on queue slot 00 [September 8 13:45:38 UTC]
[13:45:38] + Working ...
[13:45:38] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -nice 19 -suffix 00 -checkpoint 15 -verbose -lifeline 4758 -version 624'

[13:45:39] 
[13:45:39] *------------------------------*
[13:45:39] Folding@Home Gromacs SMP Core
[13:45:39] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[13:45:39] 
[13:45:39] Preparing to commence simulation
[13:45:39] - Ensuring status. Please wait.
[13:45:48] - Looking at optimizations...
[13:45:48] - Working with standard loops on this execution.
[13:45:48] - Files status OK
[13:45:49] - Expanded 1492108 -> 24057197 (decompressed 1612.2 percent)
[13:45:49] Called DecompressByteArray: compressed_data_size=1492108 data_size=24057197, decompressed_data_size=24057197 diff=0
[13:45:50] - Digital signature verified
[13:45:50] 
[13:45:50] Project: 2671 (Run 24, Clone 41, Gen 91)
[13:45:50] 
[13:45:50] Entering M.D.
[13:46:20] Completed 0 out of 250000 steps  (0%)
[13:46:25] CoreStatus = FF (255)
[13:46:25] Sending work to server
[13:46:25] Project: 2671 (Run 24, Clone 41, Gen 91)
[13:46:25] - Error: Could not get length of results file work/wuresults_00.dat
[13:46:25] - Error: Could not read unit 00 file. Removing from queue.
[13:46:25] Trying to send all finished work units
[13:46:25] + No unsent completed units remaining.
[13:46:25] - Preparing to get new work unit...
[13:46:25] Cleaning up work directory
[13:46:35] + Attempting to get work packet
[13:46:35] - Will indicate memory of 519 MB
[13:46:35] - Connecting to assignment server
[13:46:35] Connecting to http://assign.stanford.edu:8080/
[13:46:42] Posted data.
[13:46:42] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[13:46:42] + News From Folding@Home: Welcome to Folding@Home
[13:46:42] Loaded queue successfully.
[13:46:42] Connecting to http://171.67.108.24:8080/
[13:46:51] Posted data.
[13:46:51] Initial: 0000; - Receiving payload (expected size: 1492620)
[13:46:57] - Downloaded at ~242 kB/s
[13:46:57] - Averaged speed for that direction ~224 kB/s
[13:46:57] + Received work.
[13:46:57] Trying to send all finished work units
[13:46:57] + No unsent completed units remaining.
[13:46:57] + Closed connections
[13:47:02] 
[13:47:02] + Processing work unit
[13:47:02] At least 4 processors must be requested; read 1.
[13:47:02] Core required: FahCore_a2.exe
[13:47:02] Core found.
[13:47:02] Working on queue slot 01 [September 8 13:47:02 UTC]
[13:47:02] + Working ...
[13:47:02] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -nice 19 -suffix 01 -checkpoint 15 -verbose -lifeline 4758 -version 624'

[13:47:02] 
[13:47:02] *------------------------------*
[13:47:02] Folding@Home Gromacs SMP Core
[13:47:02] Version 2.10 (Sun Aug 30 03:43:28 CEST 2009)
[13:47:02] 
[13:47:02] Preparing to commence simulation
[13:47:02] - Ensuring status. Please wait.
[13:47:12] - Looking at optimizations...
[13:47:12] - Working with standard loops on this execution.
[13:47:12] - Files status OK
[13:47:12] - Expanded 1492108 -> 24057197 (decompressed 1612.2 percent)
[13:47:13] Called DecompressByteArray: compressed_data_size=1492108 data_size=24057197, decompressed_data_size=24057197 diff=0
[13:47:13] - Digital signature verified
[13:47:13] 
[13:47:13] Project: 2671 (Run 24, Clone 41, Gen 91)
[13:47:13] 
[13:47:13] Entering M.D.
[13:47:42] Completed 0 out of 250000 steps  (0%)
[13:47:47] CoreStatus = FF (255) 

Re: List of SMP WUs with the "1 core usage" issue

Posted: Wed Sep 09, 2009 6:13 pm
by GTron
This morning 2 of my current 3 SMP folders were served the following WUs with this problem (and both hung deleting them):

Project: 2671 (Run 50, Clone 97, Gen 91)
Project: 2671 (Run 51, Clone 50, Gen 89)

Greg

Project: 2669 (Run 13, Clone 71, Gen 168) Core Status = FF

Posted: Thu Sep 10, 2009 3:12 am
by Foxbat
This got loaded on my Mac Mini 1.83 GHz C2D running Leopard 10.5.8. It seemed to not want to get started, but looking at it, something weird was happening when it tried to run. It might be one of the runs only with one FAH_Core process as my CPU meter showed 50% of one core and ~5% of the other core busy. I had recently received the V2.10 FAHCore_A2, and I tried again after deleting the V2.10 and letting it download the V2.11 core:

Code: Select all

--- Opening Log file [September 10 02:39:35 UTC] 


# Mac OS X SMP Console Edition ################################################
###############################################################################

                       Folding@Home Client Version 6.24R1

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /Users/Shared/FAH
Executable: /Users/Shared/FAH/fah6
Arguments: -local -advmethods -forceasm -verbosity 9 -smp 

[02:39:35] - Ask before connecting: No
[02:39:35] - User name: Foxbat (Team 55236)
[02:39:35] - User ID: 4AC0C6BC16A9E14A
[02:39:35] - Machine ID: 1
[02:39:35] 
[02:39:35] Work directory not found. Creating...
[02:39:35] Could not open work queue, generating new queue...
[02:39:35] - Preparing to get new work unit...
[02:39:35] Cleaning up work directory
[02:39:35] + Attempting to get work packet
[02:39:35] - Autosending finished units... [02:39:35]
[02:39:35] Trying to send all finished work units
[02:39:35] - Will indicate memory of 1000 MB
[02:39:35] + No unsent completed units remaining.
[02:39:35] - Detect CPU.[02:39:35] - Autosend completed
 Vendor: GenuineIntel, Family: 6, Model: 15, Stepping: 2
[02:39:35] - Connecting to assignment server
[02:39:35] Connecting to http://assign.stanford.edu:8080/
[02:39:35] Posted data.
[02:39:35] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[02:39:35] + News From Folding@Home: Welcome to Folding@Home
[02:39:35] Loaded queue successfully.
[02:39:35] Connecting to http://171.64.65.56:8080/
[02:39:41] Posted data.
[02:39:41] Initial: 0000; - Receiving payload (expected size: 1507343)
[02:39:45] - Downloaded at ~368 kB/s
[02:39:45] - Averaged speed for that direction ~368 kB/s
[02:39:45] + Received work.
[02:39:45] + Closed connections
[02:39:45] 
[02:39:45] + Processing work unit
[02:39:45] At least 4 processors must be requested; read 1.
[02:39:45] Core required: FahCore_a2.exe
[02:39:45] Core not found.
[02:39:45] - Core is not present or corrupted.
[02:39:45] - Attempting to download new core...
[02:39:45] + Downloading new core: FahCore_a2.exe
[02:39:45] Downloading core (/~pande/OSX/x86/Core_a2.fah from www.stanford.edu)
[02:39:45] Initial: AFDE; + 10240 bytes downloaded
[02:39:46] Initial: BD61; + 20480 bytes downloaded
[02:39:46] Initial: 4F92; + 30720 bytes downloaded
<snip>
[02:39:48] Initial: A09E; + 1546240 bytes downloaded
[02:39:49] Initial: 65FC; + 1550485 bytes downloaded
[02:39:49] Verifying core Core_a2.fah...
[02:39:49] Signature is VALID
[02:39:49] 
[02:39:49] Trying to unzip core FahCore_a2.exe
[02:39:49] Decompressed FahCore_a2.exe (4746488 bytes) successfully
[02:39:49] + Core successfully engaged
[02:39:54] 
[02:39:54] + Processing work unit
[02:39:54] At least 4 processors must be requested; read 1.
[02:39:54] Core required: FahCore_a2.exe
[02:39:54] Core found.
[02:39:54] - Using generic ./mpiexec
[02:39:54] Working on queue slot 01 [September 10 02:39:54 UTC]
[02:39:54] + Working ...
[02:39:54] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 01 -checkpoint 8 -forceasm -verbose -lifeline 227 -version 624'

[02:39:54] 
[02:39:54] *------------------------------*
[02:39:54] Folding@Home Gromacs SMP Core
[02:39:54] Version 2.11 (Fri Sep 4 09:50:46 PDT 2009)
[02:39:54] 
[02:39:54] Preparing to commence simulation
[02:39:54] - Ensuring status. Please wait.
[02:39:55] Called DecompressByteArray: compressed_data_size=1506831 data_size=23973757, decompressed_data_size=23973757 diff=0
[02:39:56] - Digital signature verified
[02:39:56] 
[02:39:56] Project: 2669 (Run 13, Clone 71, Gen 168)
[02:39:56] 
[02:39:56] Assembly optimizations on if available.
[02:39:56] Entering M.D.
[02:40:06]  on if available.
[02:40:06] Entering M.D.
[02:40:54]  (0%)
[02:40:59] CoreStatus = FF (255)
[02:40:59] Sending work to server
[02:40:59] Project: 2669 (Run 13, Clone 71, Gen 168)
[02:40:59] - Error: Could not get length of results file work/wuresults_01.dat
[02:40:59] - Error: Could not read unit 01 file. Removing from queue.
[02:40:59] Trying to send all finished work units
[02:40:59] + No unsent completed units remaining.
[02:40:59] - Preparing to get new work unit...
[02:40:59] Cleaning up work directory
[02:40:59] + Attempting to get work packet
[02:40:59] - Will indicate memory of 1000 MB
[02:40:59] - Connecting to assignment server
[02:40:59] Connecting to http://assign.stanford.edu:8080/
[02:40:59] Posted data.
[02:40:59] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[02:40:59] + News From Folding@Home: Welcome to Folding@Home
[02:40:59] Loaded queue successfully.
[02:40:59] Connecting to http://171.64.65.56:8080/
[02:41:04] Posted data.
[02:41:04] Initial: 0000; - Receiving payload (expected size: 1507343)
[02:41:07] - Downloaded at ~490 kB/s
[02:41:07] - Averaged speed for that direction ~429 kB/s
[02:41:07] + Received work.
[02:41:07] Trying to send all finished work units
[02:41:07] + No unsent completed units remaining.
[02:41:07] + Closed connections
[02:41:12] 
[02:41:12] + Processing work unit
[02:41:12] At least 4 processors must be requested; read 1.
[02:41:12] Core required: FahCore_a2.exe
[02:41:12] Core found.
[02:41:12] - Using generic ./mpiexec
[02:41:12] Working on queue slot 02 [September 10 02:41:12 UTC]
[02:41:12] + Working ...
[02:41:12] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -checkpoint 8 -forceasm -verbose -lifeline 227 -version 624'

[02:41:12] 
[02:41:12] *------------------------------*
[02:41:12] Folding@Home Gromacs SMP Core
[02:41:12] Version 2.11 (Fri Sep 4 09:50:46 PDT 2009)
[02:41:12] 
[02:41:12] Preparing to commence simulation
[02:41:12] - Ensuring status. Please wait.
[02:41:22] - Assembly optimizations manually forced on.
[02:41:22] - Not checking prior termination.
[02:41:24] - Expanded 1506831 -> 23973757 (decompressed 1591.0 percent)
[02:41:24] Called DecompressByteArray: compressed_data_size=1506831 data_size=23973757, decompressed_data_size=23973757 diff=0
[02:41:25] - Digital signature verified
[02:41:25] 
[02:41:25] Project: 2669 (Run 13, Clone 71, Gen 168)
[02:41:25] 
[02:41:25] Assembly optimizations on if available.
[02:41:25] Entering M.D.
[02:42:23] Completed 0 out of 250000 steps  (0%)
[02:42:24] 
[02:42:24] Folding@home Core Shutdown: INTERRUPTED
[02:42:28] CoreStatus = FF (255)
[02:42:28] Sending work to server
[02:42:28] Project: 2669 (Run 13, Clone 71, Gen 168)
[02:42:28] - Error: Could not get length of results file work/wuresults_02.dat
[02:42:28] - Error: Could not read unit 02 file. Removing from queue.
[02:42:28] Trying to send all finished work units
[02:42:28] + No unsent completed units remaining.
[02:42:28] - Preparing to get new work unit...
[02:42:28] Cleaning up work directory
[02:42:28] + Attempting to get work packet
[02:42:28] - Will indicate memory of 1000 MB
[02:42:28] - Connecting to assignment server
[02:42:28] Connecting to http://assign.stanford.edu:8080/
[02:42:28] Posted data.
[02:42:28] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[02:42:28] + News From Folding@Home: Welcome to Folding@Home
[02:42:28] Loaded queue successfully.
[02:42:28] Connecting to http://171.64.65.56:8080/
[02:42:29] Posted data.
[02:42:29] Initial: 0000; - Error: Bad packet type from server, expected work assignment
[02:42:30] - Attempt #1  to get work failed, and no other work to do.
Waiting before retry.
[02:42:47] + Attempting to get work packet
[02:42:47] - Will indicate memory of 1000 MB
[02:42:47] - Connecting to assignment server
[02:42:47] Connecting to http://assign.stanford.edu:8080/
[02:42:48] Posted data.
[02:42:48] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[02:42:48] + News From Folding@Home: Welcome to Folding@Home
[02:42:48] Loaded queue successfully.
[02:42:48] Connecting to http://171.64.65.56:8080/
[02:42:55] Posted data.
[02:42:55] Initial: 0000; - Receiving payload (expected size: 1503468)
[02:42:58] - Downloaded at ~489 kB/s
[02:42:58] - Averaged speed for that direction ~449 kB/s
[02:42:58] + Received work.
[02:42:58] Trying to send all finished work units
[02:42:58] + No unsent completed units remaining.
[02:42:58] + Closed connections
[02:43:03] 
[02:43:03] + Processing work unit
[02:43:03] At least 4 processors must be requested; read 1.
[02:43:03] Core required: FahCore_a2.exe
[02:43:03] Core found.
[02:43:03] - Using generic ./mpiexec
[02:43:03] Working on queue slot 03 [September 10 02:43:03 UTC]
[02:43:03] + Working ...
[02:43:03] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 03 -checkpoint 8 -forceasm -verbose -lifeline 227 -version 624'

[02:43:03] 
[02:43:03] *------------------------------*
[02:43:03] Folding@Home Gromacs SMP Core
[02:43:03] Version 2.11 (Fri Sep 4 09:50:46 PDT 2009)
[02:43:03] 
[02:43:03] Preparing to commence simulation
[02:43:03] - Ensuring status. Please wait.
[02:43:04] Called DecompressByteArray: compressed_data_size=1502956 data_size=24031357, decompressed_data_size=24031357 diff=0
[02:43:05] - Digital signature verified
[02:43:05] 
[02:43:05] Project: 2677 (Run 19, Clone 97, Gen 30)
[02:43:05] 
[02:43:05] Assembly optimizations on if available.
[02:43:05] Entering M.D.
[02:43:15]  on if available.
[02:43:15] Entering M.D.
[02:44:03] Completed 0 out of 250000 steps  (0%)
[02:44:07] CoreStatus = FF (255)
[02:44:07] Sending work to server
[02:44:07] Project: 2677 (Run 19, Clone 97, Gen 30)
[02:44:07] - Error: Could not get length of results file work/wuresults_03.dat
[02:44:07] - Error: Could not read unit 03 file. Removing from queue.
[02:44:07] Trying to send all finished work units
[02:44:07] + No unsent completed units remaining.
[02:44:07] - Preparing to get new work unit...
[02:44:07] Cleaning up work directory
[02:44:07] + Attempting to get work packet
[02:44:07] - Will indicate memory of 1000 MB
[02:44:07] - Connecting to assignment server
[02:44:07] Connecting to http://assign.stanford.edu:8080/
[02:44:08] Posted data.
[02:44:08] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[02:44:08] + News From Folding@Home: Welcome to Folding@Home
[02:44:08] Loaded queue successfully.
[02:44:08] Connecting to http://171.64.65.56:8080/
[02:44:15] Posted data.
[02:44:15] Initial: 0000; - Receiving payload (expected size: 1503468)
[02:44:18] - Downloaded at ~489 kB/s
[02:44:18] - Averaged speed for that direction ~459 kB/s
[02:44:18] + Received work.
[02:44:18] Trying to send all finished work units
[02:44:18] + No unsent completed units remaining.
[02:44:18] + Closed connections
[02:44:23] 
[02:44:23] + Processing work unit
[02:44:23] At least 4 processors must be requested; read 1.
[02:44:23] Core required: FahCore_a2.exe
[02:44:23] Core found.
[02:44:23] - Using generic ./mpiexec
[02:44:23] Working on queue slot 04 [September 10 02:44:23 UTC]
[02:44:23] + Working ...
[02:44:23] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 04 -checkpoint 8 -forceasm -verbose -lifeline 227 -version 624'

[02:44:23] 
[02:44:23] *------------------------------*
[02:44:23] Folding@Home Gromacs SMP Core
[02:44:23] Version 2.11 (Fri Sep 4 09:50:46 PDT 2009)
[02:44:23] 
[02:44:23] Preparing to commence simulation
[02:44:23] - Ensuring status. Please wait.
[02:44:32] - Assembly optimizations manually forced on.
[02:44:32] - Not checking prior termination.
[02:44:34] - Expanded 1502956 -> 24031357 (decompressed 1598.9 percent)
[02:44:35] Called DecompressByteArray: compressed_data_size=1502956 data_size=24031357, decompressed_data_size=24031357 diff=0
[02:44:35] - Digital signature verified
[02:44:35] 
[02:44:35] Project: 2677 (Run 19, Clone 97, Gen 30)
[02:44:35] 
[02:44:35] Assembly optimizations on if available.
[02:44:35] Entering M.D.
[02:45:23] Completed 0 out of 250000 steps  (0%)
[02:45:28] CoreStatus = FF (255)
[02:45:28] Sending work to server
[02:45:28] Project: 2677 (Run 19, Clone 97, Gen 30)
[02:45:28] - Error: Could not get length of results file work/wuresults_04.dat
[02:45:28] - Error: Could not read unit 04 file. Removing from queue.
[02:45:28] Trying to send all finished work units
[02:45:28] + No unsent completed units remaining.
[02:45:28] - Preparing to get new work unit...
[02:45:28] Cleaning up work directory
[02:45:28] + Attempting to get work packet
[02:45:28] - Will indicate memory of 1000 MB
[02:45:28] - Connecting to assignment server
[02:45:28] Connecting to http://assign.stanford.edu:8080/
[02:45:28] Posted data.
[02:45:28] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[02:45:28] + News From Folding@Home: Welcome to Folding@Home
[02:45:28] Loaded queue successfully.
[02:45:28] Connecting to http://171.64.65.56:8080/
[02:45:36] Posted data.
[02:45:36] Initial: 0000; - Receiving payload (expected size: 1503468)
[02:45:40] - Downloaded at ~367 kB/s
[02:45:40] - Averaged speed for that direction ~440 kB/s
[02:45:40] + Received work.
[02:45:40] Trying to send all finished work units
[02:45:40] + No unsent completed units remaining.
[02:45:40] + Closed connections
[02:45:45] 
[02:45:45] + Processing work unit
[02:45:45] At least 4 processors must be requested; read 1.
[02:45:45] Core required: FahCore_a2.exe
[02:45:45] Core found.
[02:45:45] - Using generic ./mpiexec
[02:45:45] Working on queue slot 05 [September 10 02:45:45 UTC]
[02:45:45] + Working ...
[02:45:45] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 05 -checkpoint 8 -forceasm -verbose -lifeline 227 -version 624'

[02:45:45] 
[02:45:45] *------------------------------*
[02:45:45] Folding@Home Gromacs SMP Core
[02:45:45] Version 2.11 (Fri Sep 4 09:50:46 PDT 2009)
[02:45:45] 
[02:45:45] Preparing to commence simulation
[02:45:45] - Ensuring status. Please wait.
[02:45:55] - Assembly optimizations manually forced on.
[02:45:55] - Not checking prior termination.
[02:45:57] - Expanded 1502956 -> 24031357 (decompressed 1598.9 percent)
[02:45:57] Called DecompressByteArray: compressed_data_size=1502956 data_size=24031357, decompressed_data_size=24031357 diff=0
[02:45:58] - Digital signature verified
[02:45:58] 
[02:45:58] Project: 2677 (Run 19, Clone 97, Gen 30)
[02:45:58] 
[02:45:58] Assembly optimizations on if available.
[02:45:58] Entering M.D.
[02:46:46] Completed 0 out of 250000 steps  (0%)
[02:46:50] CoreStatus = FF (255)
[02:46:50] Sending work to server
[02:46:50] Project: 2677 (Run 19, Clone 97, Gen 30)
[02:46:50] - Error: Could not get length of results file work/wuresults_05.dat
[02:46:50] - Error: Could not read unit 05 file. Removing from queue.
[02:46:50] Trying to send all finished work units
[02:46:50] + No unsent completed units remaining.
[02:46:50] - Preparing to get new work unit...
[02:46:50] Cleaning up work directory
[02:46:50] + Attempting to get work packet
[02:46:50] - Will indicate memory of 1000 MB
[02:46:50] - Connecting to assignment server
[02:46:50] Connecting to http://assign.stanford.edu:8080/
[02:46:51] Posted data.
[02:46:51] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[02:46:51] + News From Folding@Home: Welcome to Folding@Home
[02:46:51] Loaded queue successfully.
[02:46:51] Connecting to http://171.64.65.56:8080/
[02:46:51] Posted data.
[02:46:51] Initial: 0000; - Error: Bad packet type from server, expected work assignment
[02:46:52] - Attempt #1  to get work failed, and no other work to do.
Waiting before retry.
[02:47:09] + Attempting to get work packet
[02:47:09] - Will indicate memory of 1000 MB
[02:47:09] - Connecting to assignment server
[02:47:09] Connecting to http://assign.stanford.edu:8080/
[02:47:09] Posted data.
[02:47:09] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[02:47:09] + News From Folding@Home: Welcome to Folding@Home
[02:47:09] Loaded queue successfully.
[02:47:09] Connecting to http://171.64.65.56:8080/
[02:47:14] Posted data.
[02:47:14] Initial: 0000; - Receiving payload (expected size: 4838155)
[02:47:23] - Downloaded at ~524 kB/s
[02:47:23] - Averaged speed for that direction ~457 kB/s
[02:47:23] + Received work.
[02:47:23] Trying to send all finished work units
[02:47:23] + No unsent completed units remaining.
[02:47:23] + Closed connections
[02:47:28] 
[02:47:28] + Processing work unit
[02:47:28] At least 4 processors must be requested; read 1.
[02:47:28] Core required: FahCore_a2.exe
[02:47:28] Core found.
[02:47:28] - Using generic ./mpiexec
[02:47:28] Working on queue slot 06 [September 10 02:47:28 UTC]
[02:47:28] + Working ...
[02:47:28] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 06 -checkpoint 8 -forceasm -verbose -lifeline 227 -version 624'

[02:47:28] 
[02:47:28] *------------------------------*
[02:47:28] Folding@Home Gromacs SMP Core
[02:47:28] Version 2.11 (Fri Sep 4 09:50:46 PDT 2009)
[02:47:28] 
[02:47:28] Preparing to commence simulation
[02:47:28] - Ensuring status. Please wait.
[02:47:38] - Assembly optimizations manually forced on.
[02:47:38] - Not checking prior termination.
[02:47:41] - Expanded 4837643 -> 24034797 (decompressed 496.8 percent)
[02:47:41] Called DecompressByteArray: compressed_data_size=4837643 data_size=24034797, decompressed_data_size=24034797 diff=0
[02:47:42] - Digital signature verified
[02:47:42] 
[02:47:42] Project: 2677 (Run 29, Clone 14, Gen 47)
[02:47:42] 
[02:47:42] Assembly optimizations on if available.
[02:47:42] Entering M.D.
[02:47:54] Completed 0 out of 250000 steps  (0%)
and so we're processing normally now. I notice that there is another WU that terminated after I ran through my P2669:R13:C71:G168. Work Unit for Project: 2677 (Run 19, Clone 97, Gen 30) also exited with the Core Status = FF.

I searched this Forum but didn't see either of these Work Units reported.

Re: List of SMP WUs with the "1 core usage" issue

Posted: Thu Sep 10, 2009 12:04 pm
by ikerekes
Project: 2669 (Run 5, Clone 92, Gen 175)

Re: List of SMP WUs with the "1 core usage" issue

Posted: Thu Sep 10, 2009 3:41 pm
by HaloJones
Hung deleting:

Project: 2677 (Run 38, Clone 44, Gen 31)

Twice :(

Re: List of SMP WUs with the "1 core usage" issue

Posted: Thu Sep 10, 2009 6:01 pm
by GTron
I received the WU below twice more, and it hung deleting both times. Hope I'm finally done with it!
Project: 2671 (Run 50, Clone 97, Gen 91)
Greg

Re: List of SMP WUs with the "1 core usage" issue

Posted: Thu Sep 10, 2009 8:24 pm
by ikerekes
Project: 2669 (Run 2, Clone 6, Gen 133) got it 4
times before it deleted and went to
2677 (Run 30, Clone 77, Gen 33) right after, (3 times) before finally got a good one :)

Re: List of SMP WUs with the "1 core usage" issue

Posted: Thu Sep 10, 2009 8:27 pm
by JackOfAll
Can kasson or someone from Pande please comment. "Bad" WU's are being reported. Why can they not be removed from the servers?

Re: List of SMP WUs with the "1 core usage" issue

Posted: Thu Sep 10, 2009 9:25 pm
by VijayPande
We're looking into what's going on here. We could try to do "whack a mole" on bad WU's as people report them, but it's better for us to get to the root of the issue, especially since others are likely having trouble and not reporting. The server will eventually abandon WUs that are bad, but it takes more than one computer to make the report to ensure that it is indeed bad. Being too liberal here with banning WUs leads to both no science done and WU shortages. I or Kasson will give an update hopefully in a day or two if we have some news. Sorry this is taking so long.

Re: List of SMP WUs with the "1 core usage" issue

Posted: Thu Sep 10, 2009 11:08 pm
by JackOfAll
VijayPande wrote:We're looking into what's going on here. We could try to do "whack a mole" on bad WU's as people report them, but it's better for us to get to the root of the issue, especially since others are likely having trouble and not reporting. The server will eventually abandon WUs that are bad, but it takes more than one computer to make the report to ensure that it is indeed bad. Being too liberal here with banning WUs leads to both no science done and WU shortages. I or Kasson will give an update hopefully in a day or two if we have some news. Sorry this is taking so long.
Vijay,

Thank you very much for responding. You're right, this is taking too long to sort out. ;) Firstly, with regard to bad WU's, you do not need to be a programmer to spot these a mile away by the size of the WU. We're not asking you to "whack a mole", just remove the obviously bad units from the server. This, along with making sure that no undersize bad units are generated, or added to the server in the future is the fix I would suggest you should be applying. Issuing the Core_a2 v2.10, to detect these units on the client side and error out with a 'FF' status is not fixing the issue. It's still a complete waste of resources for a single client to repeatedly download a bad unit 3 or 4 times before moving onto the next. It wouldn't be so bad if the client actually communicated the "it will only ever run 1 core" badness of the unit to the server so it is never issued again either to the reporting client or another for that matter. This may be an improvement on the previous v2.08 core behaviour of these bad units (ie. running on a single core) were it not for the fact that the client seems to be hanging for some people while locally deleting the bad unit. Not to mention that it is pretty obvious by now that v2.10 has issues when running less than 4 cores. I'd even suggest that by the v2.10 client still being in the wild and not having reverted to the v2.08 client until a better solution can be found, is a mistake, judging from the number of people running 2 cores in VM. I hope these issues are resolved quickly. The v2.10 client pretty much ensures less science is being done.