List of SMP WUs with the "1 core usage" issue
Moderators: Site Moderators, FAHC Science Team
-
- Site Moderator
- Posts: 6349
- Joined: Sun Dec 02, 2007 10:38 am
- Location: Bordeaux, France
- Contact:
Re: List of SMP WUs with the "1 core usage" issue
List edited ... it's beginning to be a long list
-
- Posts: 126
- Joined: Sat Aug 02, 2008 3:08 am
Re: List of SMP WUs with the "1 core usage" issue
Got one this morning as well. 2677 R3/C78/G28 Client is 6.24beta, FahCore_a2.exe was 2.07.
Code: Select all
Reading file work/wudata_06.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 64
NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp
Making 1D domain decomposition 1 x 1 x 4
starting mdrun 'IBX in water'
7250000 steps, 14500.0 ps (continuing from step 7000000, 14000.0 ps).
-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: nsgrid.c, line: 357
Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.
Variable ci has value -2147483269. It should have been within [ 0 .. 9464 ]
For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------
Thanx for Using GROMACS - Have a Nice Day
Error on node 0, will try to stop all the nodes
-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: nsgrid.c, line: 357
Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.
Variable ci has value -2147483611. It should have been within [ 0 .. 256 ]
For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------
Thanx for Using GROMACS - Have a Nice Day
Error on node 3, will try to stop all the nodes
Halting parallel program mdrun on CPU 3 out of 4
gcq#0: Thanx for Using GROMACS - Have a Nice Day
[cli_3]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
Halting parallel program mdrun on CPU 0 out of 4
gcq#0: Thanx for Using GROMACS - Have a Nice Day
[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
Re: List of SMP WUs with the "1 core usage" issue
Project: 2677 (Run 3, Clone 79, Gen 33)
Project: 2677 (Run 23, Clone 74, Gen 28)
Project: 2677 (Run 23, Clone 74, Gen 28)
Re: List of SMP WUs with the "1 core usage" issue
I'm also experiencing a rash of A2 core WUs failing to proceed after Entering M.D. on core 2.07, as BrokenWolf posted above. I have yet to find a WU running on one core on FAH core 2.07.
-
- Posts: 126
- Joined: Sat Aug 02, 2008 3:08 am
Re: List of SMP WUs with the "1 core usage" issue
Got another one. 2677 R5/C21/G30. It did not appear to be on the list yet.
@ChasR> I think that the 2.07 core can not handle the busted WU at all so it barfs. It appears that the 2.08 has better handling of WU funniness as it tries to go but can only process it on one core. At least that is my look on it.
BW
@ChasR> I think that the 2.07 core can not handle the busted WU at all so it barfs. It appears that the 2.08 has better handling of WU funniness as it tries to go but can only process it on one core. At least that is my look on it.
BW
Re: List of SMP WUs with the "1 core usage" issue
I've been doing some rudimentary checking to see if the same WUs that run on one core on 2.08 fail to proceed on 2.07. I haven't found a duplicate yet, but that might not mean much since on every 2.07 machine that had a bad WU, I deleted core 2.07 and got 2.08. I deleted the core on the problem 2.08 WUs as well, so most of the logs with the hung 2.07 WUs have been overwritten and are gone. While you are probably correct about the relationship of the failures seen on core 2.07 and 2.08, strictly speaking we don't know they are related. If you find a WU that hangs on 2.07 and runs on one core on 2.08, then you will have convinced me. I'll continue to look, though my core 2.07 machine count is way down. If they do turn out to be related, the problem WUs have been out there for some time.
-
- Posts: 363
- Joined: Tue Feb 12, 2008 7:33 am
- Hardware configuration: Running exclusively Linux headless blades. All are dedicated crunching machines.
- Location: SE Michigan, USA
Re: List of SMP WUs with the "1 core usage" issue
First time for this series WU and on a Quad Core machine. All prior instances (for me) were on Conroe Core2's
Project: 2671 (Run 37, Clone 79, Gen 78) 1920.00 pts (17.678 pt/hr)
compressed_data_size=1513330
Project: 2671 (Run 37, Clone 79, Gen 78) 1920.00 pts (17.678 pt/hr)
compressed_data_size=1513330
Code: Select all
quad8.parkut.com
14:24:01 up 11 days, 11:25, 0 users, load average: 1.00, 1.00, 1.00
20077 99.6 20077 S ? 01:25:05 ./FahCore_a2.exe -dir work/ -nice 19 -suffix 06 -checkpoint 15 -verbose -lifeline 3086 -version 624
20080 0.3 20080 S ? 00:00:18 ./FahCore_a2.exe -dir work/ -nice 19 -suffix 06 -checkpoint 15 -verbose -lifeline 3086 -version 624
20078 0.0 20078 S ? 00:00:04 ./FahCore_a2.exe -dir work/ -nice 19 -suffix 06 -checkpoint 15 -verbose -lifeline 3086 -version 624
20079 0.0 20079 S ? 00:00:04 ./FahCore_a2.exe -dir work/ -nice 19 -suffix 06 -checkpoint 15 -verbose -lifeline 3086 -version 624
...
model name : Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
cpu MHz : 3006.932
cache size : 4096 KB
Memory: 1.96 GB physical, 1.94 GB virtual
...
Client Version 6.24R3
Core: FahCore_a2.exe
Core Version 2.08 (Mon May 18 14:47:42 PDT 2009)
Current Work Unit
-----------------
Name: p2671_IBX in water
Tag: P2671R37C79G78
Download time: August 20 16:58:37
Due time: August 23 16:58:37
Progress: 1% [__________]
Re: List of SMP WUs with the "1 core usage" issue
Just got another one.
Project: 2669 (Run 7, Clone 51, Gen 110)
Cheers
Project: 2669 (Run 7, Clone 51, Gen 110)
Cheers
-
- Site Moderator
- Posts: 6349
- Joined: Sun Dec 02, 2007 10:38 am
- Location: Bordeaux, France
- Contact:
Re: List of SMP WUs with the "1 core usage" issue
List updated. Thanks for your reports.
-
- Posts: 136
- Joined: Fri Mar 07, 2008 7:29 pm
- Hardware configuration: C2D E6400 2.13 GHz @ 3.2 GHz
Asus EN8800GTS 640 (G80) @ 660/792/1700 running the 6.23 w/ core11 v1.19
forceware 260.89
Asus P5N-E SLi
2GB 800MHz DDRII (2xCorsair TwinX 512MB)
WinXP 32 SP3 - Location: Prague
Re: List of SMP WUs with the "1 core usage" issue
It's true, when one of these WUs broke my client which ran 2.07, after I deleted the core it downloaded 2.08 and restarted the (same) WU. And it did start and ran on only one core.ChasR wrote:... If you find a WU that hangs on 2.07 and runs on one core on 2.08, then you will have convinced me.
-
- Posts: 126
- Joined: Sat Aug 02, 2008 3:08 am
Re: List of SMP WUs with the "1 core usage" issue
Got a repeat here. p2669 R13/C29/G178. Are these not being marked as bad so they do not get sent back out?
Code: Select all
[02:02:56] Connecting to http://171.64.65.56:8080/
[02:03:03] Posted data.
[02:03:03] Initial: 0000; - Receiving payload (expected size: 1508832)
[02:03:04] - Downloaded at ~1473 kB/s
[02:03:04] - Averaged speed for that direction ~1230 kB/s
[02:03:04] + Received work.
[02:03:04] Trying to send all finished work units
[02:03:04] + No unsent completed units remaining.
[02:03:04] + Closed connections
[02:03:04]
[02:03:04] + Processing work unit
[02:03:04] At least 4 processors must be requested.Core required: FahCore_a2.exe
[02:03:04] Core found.
[02:03:05] Working on queue slot 02 [August 21 02:03:05 UTC]
[02:03:05] + Working ...
[02:03:05] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -priority 96 -checkpoint 10 -verbose -lifeline 28432 -version 624'
[02:03:05]
[02:03:05] *------------------------------*
[02:03:05] Folding@Home Gromacs SMP Core
[02:03:05] Version 2.08 (Mon May 18 14:47:42 PDT 2009)
[02:03:05]
[02:03:05] Preparing to commence simulation
[02:03:05] - Ensuring status. Please wait.
[02:03:06] Called DecompressByteArray: compressed_data_size=1508320 data_size=23973757, decompressed_data_size=23973757 diff=0
[02:03:06] - Digital signature verified
[02:03:06]
[02:03:06] Project: 2669 (Run 13, Clone 29, Gen 178)
[02:03:06]
[02:03:06] Assembly optimizations on if available.
[02:03:06] Entering M.D.
[02:03:16] un 13, Clone 29, Gen 178)
[02:03:16]
[02:03:16] Entering M.D.
[02:03:53] Completed 0 out of 250000 steps (0%)
-
- Posts: 126
- Joined: Sat Aug 02, 2008 3:08 am
Re: List of SMP WUs with the "1 core usage" issue
And another one this morning. 2677 R35/C76/G35
BW
BW
-
- Posts: 93
- Joined: Mon Jan 21, 2008 6:42 pm
Re: List of SMP WUs with the "1 core usage" issue
Project: 2677 (Run 35, Clone 54, Gen 25)
How do I make it go away so I can get different work? I deleted the work folder and queue.dat and got the same project again.
How do I make it go away so I can get different work? I deleted the work folder and queue.dat and got the same project again.