Page 2 of 13

Re: List of SMP WUs with the "1 core usage" issue

Posted: Wed Aug 19, 2009 1:36 pm
by toTOW
List edited ... it's beginning to be a long list :?

Re: List of SMP WUs with the "1 core usage" issue

Posted: Wed Aug 19, 2009 1:37 pm
by BrokenWolf
Got one this morning as well. 2677 R3/C78/G28 Client is 6.24beta, FahCore_a2.exe was 2.07.

Code: Select all

Reading file work/wudata_06.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 64

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun 'IBX in water'
7250000 steps,  14500.0 ps (continuing from step 7000000,  14000.0 ps).

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: nsgrid.c, line: 357

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.

Variable ci has value -2147483269. It should have been within [ 0 .. 9464 ]

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: nsgrid.c, line: 357

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.

Variable ci has value -2147483611. It should have been within [ 0 .. 256 ]

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 3, will try to stop all the nodes
Halting parallel program mdrun on CPU 3 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_3]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
Halting parallel program mdrun on CPU 0 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0

Re: List of SMP WUs with the "1 core usage" issue

Posted: Wed Aug 19, 2009 4:27 pm
by ChasR
Project: 2677 (Run 3, Clone 79, Gen 33)
Project: 2677 (Run 23, Clone 74, Gen 28)

Re: List of SMP WUs with the "1 core usage" issue

Posted: Wed Aug 19, 2009 8:13 pm
by ChasR
I'm also experiencing a rash of A2 core WUs failing to proceed after Entering M.D. on core 2.07, as BrokenWolf posted above. I have yet to find a WU running on one core on FAH core 2.07.

Re: List of SMP WUs with the "1 core usage" issue

Posted: Thu Aug 20, 2009 3:33 am
by BrokenWolf
Got another one. 2677 R5/C21/G30. It did not appear to be on the list yet.

@ChasR> I think that the 2.07 core can not handle the busted WU at all so it barfs. It appears that the 2.08 has better handling of WU funniness as it tries to go but can only process it on one core. At least that is my look on it.

BW

Re: List of SMP WUs with the "1 core usage" issue

Posted: Thu Aug 20, 2009 11:26 am
by ChasR
I've been doing some rudimentary checking to see if the same WUs that run on one core on 2.08 fail to proceed on 2.07. I haven't found a duplicate yet, but that might not mean much since on every 2.07 machine that had a bad WU, I deleted core 2.07 and got 2.08. I deleted the core on the problem 2.08 WUs as well, so most of the logs with the hung 2.07 WUs have been overwritten and are gone. While you are probably correct about the relationship of the failures seen on core 2.07 and 2.08, strictly speaking we don't know they are related. If you find a WU that hangs on 2.07 and runs on one core on 2.08, then you will have convinced me. I'll continue to look, though my core 2.07 machine count is way down. If they do turn out to be related, the problem WUs have been out there for some time.

Re: List of SMP WUs with the "1 core usage" issue

Posted: Thu Aug 20, 2009 1:39 pm
by ChasR
Project: 2669 (Run 0, Clone 32, Gen 188)

Re: List of SMP WUs with the "1 core usage" issue

Posted: Thu Aug 20, 2009 6:36 pm
by parkut
First time for this series WU and on a Quad Core machine. All prior instances (for me) were on Conroe Core2's

Project: 2671 (Run 37, Clone 79, Gen 78) 1920.00 pts (17.678 pt/hr)

compressed_data_size=1513330

Code: Select all

quad8.parkut.com
 14:24:01 up 11 days, 11:25,  0 users,  load average: 1.00, 1.00, 1.00
20077 99.6 20077 S ?        01:25:05 ./FahCore_a2.exe -dir work/ -nice 19 -suffix 06 -checkpoint 15 -verbose -lifeline 3086 -version 624
20080  0.3 20080 S ?        00:00:18 ./FahCore_a2.exe -dir work/ -nice 19 -suffix 06 -checkpoint 15 -verbose -lifeline 3086 -version 624
20078  0.0 20078 S ?        00:00:04 ./FahCore_a2.exe -dir work/ -nice 19 -suffix 06 -checkpoint 15 -verbose -lifeline 3086 -version 624
20079  0.0 20079 S ?        00:00:04 ./FahCore_a2.exe -dir work/ -nice 19 -suffix 06 -checkpoint 15 -verbose -lifeline 3086 -version 624
...
model name	: Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz
cpu MHz		: 3006.932
cache size	: 4096 KB
Memory: 1.96 GB physical, 1.94 GB virtual
...
Client Version 6.24R3  
Core: FahCore_a2.exe
Core Version 2.08 (Mon May 18 14:47:42 PDT 2009)
Current Work Unit
-----------------
Name: p2671_IBX in water
Tag: P2671R37C79G78
Download time: August 20 16:58:37
Due time: August 23 16:58:37
Progress: 1%  [__________]

Re: List of SMP WUs with the "1 core usage" issue

Posted: Thu Aug 20, 2009 9:05 pm
by Oldhat
Just got another one. :)

Project: 2669 (Run 7, Clone 51, Gen 110)

Cheers

Re: List of SMP WUs with the "1 core usage" issue

Posted: Thu Aug 20, 2009 10:15 pm
by toTOW
List updated. Thanks for your reports.

Re: List of SMP WUs with the "1 core usage" issue

Posted: Fri Aug 21, 2009 12:46 am
by ^w^ing
ChasR wrote:... If you find a WU that hangs on 2.07 and runs on one core on 2.08, then you will have convinced me.
It's true, when one of these WUs broke my client which ran 2.07, after I deleted the core it downloaded 2.08 and restarted the (same) WU. And it did start and ran on only one core.

Re: List of SMP WUs with the "1 core usage" issue

Posted: Fri Aug 21, 2009 2:42 am
by BrokenWolf
Got a repeat here. p2669 R13/C29/G178. Are these not being marked as bad so they do not get sent back out?

Code: Select all

[02:02:56] Connecting to http://171.64.65.56:8080/
[02:03:03] Posted data.
[02:03:03] Initial: 0000; - Receiving payload (expected size: 1508832)
[02:03:04] - Downloaded at ~1473 kB/s
[02:03:04] - Averaged speed for that direction ~1230 kB/s
[02:03:04] + Received work.
[02:03:04] Trying to send all finished work units
[02:03:04] + No unsent completed units remaining.
[02:03:04] + Closed connections
[02:03:04] 
[02:03:04] + Processing work unit
[02:03:04] At least 4 processors must be requested.Core required: FahCore_a2.exe
[02:03:04] Core found.
[02:03:05] Working on queue slot 02 [August 21 02:03:05 UTC]
[02:03:05] + Working ...
[02:03:05] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -priority 96 -checkpoint 10 -verbose -lifeline 28432 -version 624'

[02:03:05] 
[02:03:05] *------------------------------*
[02:03:05] Folding@Home Gromacs SMP Core
[02:03:05] Version 2.08 (Mon May 18 14:47:42 PDT 2009)
[02:03:05] 
[02:03:05] Preparing to commence simulation
[02:03:05] - Ensuring status. Please wait.
[02:03:06] Called DecompressByteArray: compressed_data_size=1508320 data_size=23973757, decompressed_data_size=23973757 diff=0
[02:03:06] - Digital signature verified
[02:03:06] 
[02:03:06] Project: 2669 (Run 13, Clone 29, Gen 178)
[02:03:06] 
[02:03:06] Assembly optimizations on if available.
[02:03:06] Entering M.D.
[02:03:16] un 13, Clone 29, Gen 178)
[02:03:16] 
[02:03:16] Entering M.D.
[02:03:53] Completed 0 out of 250000 steps  (0%)

Re: List of SMP WUs with the "1 core usage" issue

Posted: Fri Aug 21, 2009 1:24 pm
by BrokenWolf
And another one this morning. 2677 R35/C76/G35

BW

Re: List of SMP WUs with the "1 core usage" issue

Posted: Fri Aug 21, 2009 5:34 pm
by ChasR
Project: 2675 (Run 3, Clone 182, Gen 153)

Re: List of SMP WUs with the "1 core usage" issue

Posted: Fri Aug 21, 2009 11:02 pm
by Gary480six
Project: 2677 (Run 35, Clone 54, Gen 25)

How do I make it go away so I can get different work? I deleted the work folder and queue.dat and got the same project again.