project 7520 being reactivated--post if bad WU's
Moderators: Site Moderators, FAHC Science Team
project 7520 being reactivated--post if bad WU's
I'm about to reactivate project 7520, which was pulled when we had a RAID failure that corrupted a bunch of WU's. I've done a restore from backup, so I'm hoping that the work units should be clean again. If you get repeat failures on a WU, however, please post here. If necessary, we'll pull the project again.
-
- Posts: 76
- Joined: Tue Apr 29, 2008 11:02 pm
- Hardware configuration: XP-32 Pro SP-3
Antec NSK-2480 with two Thermaltake 120mm Smart Fans
Gigabyte ga-ma78gm-s2h 780G IGP
BE-2350 with 10.5 x multiplier, 1.250V in BIOS, clock at 272 (2.856GHz)
EVGA 8800 GS
Ninja Mini CPU HS
GeIL 4GB (2 x 2GB) 240-Pin DDR2 SDRAM DDR2 800
Seagate 500GB SATA hard drive
ASUS 18X DVD±R DVD Burner PATA Model DRW-1814BL
Project: 7520 (Run 58, Clone 40, Gen 0)
Code: Select all
*********************** Log Started 2015-06-10T12:41:51Z ***********************
12:41:51:************************* Folding@home Client *************************
12:41:51: Website: http://folding.stanford.edu/
12:41:51: Copyright: (c) 2009-2014 Stanford University
12:41:51: Author: Joseph Coffland <[email protected]>
12:41:51: Args: --child --lifeline 3031 /etc/fahclient/config.xml --run-as
12:41:51: fahclient --pid-file=/var/run/fahclient.pid --daemon
12:41:51: Config: /etc/fahclient/config.xml
12:41:51:******************************** Build ********************************
12:41:51: Version: 7.4.4
12:41:51: Date: Mar 4 2014
12:41:51: Time: 12:02:38
12:41:51: SVN Rev: 4130
12:41:51: Branch: fah/trunk/client
12:41:51: Compiler: GNU 4.4.7
12:41:51: Options: -std=gnu++98 -O3 -funroll-loops -mfpmath=sse -ffast-math
12:41:51: -fno-unsafe-math-optimizations -msse2
12:41:51: Platform: linux2 3.2.0-1-amd64
12:41:51: Bits: 64
12:41:51: Mode: Release
12:41:51:******************************* System ********************************
12:41:51: CPU: AMD Phenom(tm) II X6 1045T Processor
12:41:51: CPU ID: AuthenticAMD Family 16 Model 10 Stepping 0
12:41:51: CPUs: 6
12:41:51: Memory: 7.55GiB
12:41:51:Free Memory: 5.66GiB
12:41:51: Threads: POSIX_THREADS
12:41:51: OS Version: 3.13
12:41:51:Has Battery: false
12:41:51: On Battery: false
12:41:51: UTC Offset: -4
12:41:51: PID: 3119
12:41:51: CWD: /var/lib/fahclient
12:41:51: OS: Linux 3.13.0-32-generic x86_64
12:41:51: OS Arch: AMD64
12:41:51: GPUs: 1
12:41:51: GPU 0: UNSUPPORTED: RS880 [Radeon HD 4250]
12:41:51: CUDA: Not detected
12:41:51:***********************************************************************
12:41:51:<config>
12:41:51: <!-- Client Control -->
12:41:51: <fold-anon v='true'/>
12:41:51:
12:41:51: <!-- Folding Core -->
12:41:51: <checkpoint v='30'/>
12:41:51:
12:41:51: <!-- Folding Slot Configuration -->
12:41:51: <client-type v='advanced'/>
12:41:51:
12:41:51: <!-- HTTP Server -->
12:41:51: <allow v='127.0.0.1 192.168.1.0/24'/>
12:41:51:
12:41:51: <!-- Network -->
12:41:51: <proxy v=':8080'/>
12:41:51:
12:41:51: <!-- Remote Command Server -->
12:41:51: <command-allow-no-pass v='127.0.0.1 192.168.1.0/24'/>
12:41:51:
12:41:51: <!-- Slot Control -->
12:41:51: <power v='full'/>
12:41:51:
12:41:51: <!-- User Information -->
12:41:51: <passkey v='********************************'/>
12:41:51: <team v='31574'/>
12:41:51: <user v='SKeptical_Thinker'/>
12:41:51:
12:41:51: <!-- Work Unit Control -->
12:41:51: <next-unit-percentage v='100'/>
12:41:51:
12:41:51: <!-- Folding Slots -->
12:41:51: <slot id='0' type='CPU'/>
12:41:51:</config>
12:41:51:Switching to user fahclient
12:41:51:Trying to access database...
12:41:51:Successfully acquired database lock
12:41:51:Enabled folding slot 00: READY cpu:6
12:41:51:WU01:FS00:Starting
12:41:51:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/Core_a4.fah/FahCore_a4 -dir 01 -suffix 01 -version 704 -lifeline 3119 -checkpoint 30 -np 6
12:41:51:WU01:FS00:Started FahCore on PID 3128
12:41:51:WU01:FS00:Core PID:3132
12:41:51:WU01:FS00:FahCore 0xa4 started
12:41:51:WU01:FS00:0xa4:
12:41:51:WU01:FS00:0xa4:*------------------------------*
12:41:51:WU01:FS00:0xa4:Folding@Home Gromacs GB Core
12:41:51:WU01:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
12:41:51:WU01:FS00:0xa4:
12:41:51:WU01:FS00:0xa4:Preparing to commence simulation
12:41:51:WU01:FS00:0xa4:- Ensuring status. Please wait.
12:42:00:WU01:FS00:0xa4:- Looking at optimizations...
12:42:00:WU01:FS00:0xa4:- Files status OK
12:42:00:WU01:FS00:0xa4:- Expanded 2021924 -> 3413504 (decompressed 168.8 percent)
12:42:00:WU01:FS00:0xa4:Called DecompressByteArray: compressed_data_size=2021924 data_size=3413504, decompressed_data_size=3413504 diff=0
12:42:00:WU01:FS00:0xa4:- Digital signature verified
12:42:00:WU01:FS00:0xa4:
12:42:00:WU01:FS00:0xa4:Project: 7520 (Run 58, Clone 40, Gen 0)
12:42:00:WU01:FS00:0xa4:
12:42:00:WU01:FS00:0xa4:Assembly optimizations on if available.
12:42:00:WU01:FS00:0xa4:Entering M.D.
I can supply an ubuntu crash file if you need it.
I deleted the WU and have moved on to Project: 9011 (Run 450, Clone 4, Gen 17) without issue.
Re: Project: 7520 (Run 58, Clone 40, Gen 0)
So far, that WU has not been returned by anyone.
Project 7520, Run 58, Clone 40, Gen 0
No data back from query
The log you posted doesn't show the initialization of the FAHCore. Did it crash before messages of the form
Which drivers are you running?
Project 7520, Run 58, Clone 40, Gen 0
No data back from query
The log you posted doesn't show the initialization of the FAHCore. Did it crash before messages of the form
Code: Select all
14:57:55:WU00:FS00:0xa4:Entering M.D.
14:58:02:WU00:FS00:0xa4:Mapping NT from 4 to 4
14:58:05:WU00:FS00:0xa4:Completed 0 out of 200000 steps (0%)
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 1003
- Joined: Thu May 02, 2013 8:46 pm
- Hardware configuration: Full Time:
2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)
Retired:
3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop - Location: Near Oxford, United Kingdom
- Contact:
Re: project 7520 being reactivated--post if bad WU's
Code: Select all
14:28:52:WU00:FS00:0xa4:Project: 7520 (Run 64, Clone 42, Gen 0)
14:28:52:WU00:FS00:0xa4:
14:28:52:WU00:FS00:0xa4:Assembly optimizations on if available.
14:28:52:WU00:FS00:0xa4:Entering M.D.
14:28:58:WU00:FS00:0xa4:Completed 0 out of 1000000 steps (0%)
14:43:48:WU00:FS00:0xa4:Completed 10000 out of 1000000 steps (1%)
14:59:09:WU00:FS00:0xa4:Completed 20000 out of 1000000 steps (2%)
15:14:08:WU00:FS00:0xa4:Completed 30000 out of 1000000 steps (3%)
TPF and PPD are compatible with it having twice the number it should have (looking in HFM at P7520's from long ago), it's running OK so far but should I dump it?
BTW, it's not listed in psummary.
-
- Site Admin
- Posts: 7938
- Joined: Tue Apr 21, 2009 4:41 pm
- Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2 - Location: W. MA
Re: project 7520 being reactivated--post if bad WU's
Looking at an old log with a 7520 WU, yes you remember that count for steps correctly - should be 500,000.
As for showing up in psummary, this is a bit odd. It does show up on the old psummary page, and also does on the new psummaryC page, but not the new psummary.
As for showing up in psummary, this is a bit odd. It does show up on the old psummary page, and also does on the new psummaryC page, but not the new psummary.
iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
-
- Posts: 1003
- Joined: Thu May 02, 2013 8:46 pm
- Hardware configuration: Full Time:
2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)
Retired:
3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop - Location: Near Oxford, United Kingdom
- Contact:
Re: project 7520 being reactivated--post if bad WU's
OK, I've dumped it (deleted the slot, so it may get reassigned).
-
- Posts: 363
- Joined: Tue Feb 12, 2008 7:33 am
- Hardware configuration: Running exclusively Linux headless blades. All are dedicated crunching machines.
- Location: SE Michigan, USA
Re: project 7520 being reactivated--post if bad WU's
I have 12 machines chewing on 7520's. In the log files, I see (2) two of them have 500,000 steps, the rest have 1,000,000 steps
1,000,000 steps
Machine 033: Project: 7520 (Run 49, Clone 23, Gen 0)
Machine f1b1: Project: 7520 (Run 1, Clone 16, Gen 0)
Machine 163: Project: 7520 (Run 32, Clone 44, Gen 0)
Machine 105: Project: 7520 (Run 78, Clone 39, Gen 0)
Machine 139: Project: 7520 (Run 85, Clone 42, Gen 0)
Machine 148: Project: 7520 (Run 65, Clone 44, Gen 0)
Machine 093: Project: 7520 (Run 122, Clone 32, Gen 0)
Machine 071: Project: 7520 (Run 25, Clone 10, Gen 0)
Machine 081: Project: 7520 (Run 17, Clone 36, Gen 0)
Machine 091: Project: 7520 (Run 46, Clone 33, Gen 0)
500,000 steps
Machine: 145: Project: 7520 (Run 73, Clone 21, Gen 1)
Machine 133: Project: 7520 (Run 120, Clone 5, Gen 130)
1,000,000 steps
Machine 033: Project: 7520 (Run 49, Clone 23, Gen 0)
Machine f1b1: Project: 7520 (Run 1, Clone 16, Gen 0)
Machine 163: Project: 7520 (Run 32, Clone 44, Gen 0)
Machine 105: Project: 7520 (Run 78, Clone 39, Gen 0)
Machine 139: Project: 7520 (Run 85, Clone 42, Gen 0)
Machine 148: Project: 7520 (Run 65, Clone 44, Gen 0)
Machine 093: Project: 7520 (Run 122, Clone 32, Gen 0)
Machine 071: Project: 7520 (Run 25, Clone 10, Gen 0)
Machine 081: Project: 7520 (Run 17, Clone 36, Gen 0)
Machine 091: Project: 7520 (Run 46, Clone 33, Gen 0)
500,000 steps
Machine: 145: Project: 7520 (Run 73, Clone 21, Gen 1)
Machine 133: Project: 7520 (Run 120, Clone 5, Gen 130)
Re: project 7520 being reactivated--post if bad WU's
Oh, there was an old problem here IIRC where gen 0 had 2x the number of steps. Let me check that.
-
- Posts: 76
- Joined: Tue Apr 29, 2008 11:02 pm
- Hardware configuration: XP-32 Pro SP-3
Antec NSK-2480 with two Thermaltake 120mm Smart Fans
Gigabyte ga-ma78gm-s2h 780G IGP
BE-2350 with 10.5 x multiplier, 1.250V in BIOS, clock at 272 (2.856GHz)
EVGA 8800 GS
Ninja Mini CPU HS
GeIL 4GB (2 x 2GB) 240-Pin DDR2 SDRAM DDR2 800
Seagate 500GB SATA hard drive
ASUS 18X DVD±R DVD Burner PATA Model DRW-1814BL
Re: Project: 7520 (Run 58, Clone 40, Gen 0)
The log that I posted is complete. That was all of the output in the log up to the time of the crash.bruce wrote:So far, that WU has not been returned by anyone.
Project 7520, Run 58, Clone 40, Gen 0
No data back from query
The log you posted doesn't show the initialization of the FAHCore. Did it crash before messages of the formWhich drivers are you running?Code: Select all
14:57:55:WU00:FS00:0xa4:Entering M.D. 14:58:02:WU00:FS00:0xa4:Mapping NT from 4 to 4 14:58:05:WU00:FS00:0xa4:Completed 0 out of 200000 steps (0%)
Which drivers are you talking about?
Moderator, please move this thread to: project 7520 being reactivated--post if bad WU's
Mod edit: Topics merged.
thanks
Re: Project: 7520 (Run 58, Clone 40, Gen 0)
The WU was corrupted and has been regenerated. Thanks.
Re: Project: 7520 (Run 58, Clone 40, Gen 0)
We further regenerated all gen0 clones in Run 58 (and corrected their number of steps).
-
- Posts: 363
- Joined: Tue Feb 12, 2008 7:33 am
- Hardware configuration: Running exclusively Linux headless blades. All are dedicated crunching machines.
- Location: SE Michigan, USA
Re: project 7520 being reactivated--post if bad WU's
An observation, the million step WUs are reporting nearly one third of the PPD that the half million step WUs report on identical hardware.
And they take much longer to process.
And they take much longer to process.
Re: project 7520 being reactivated--post if bad WU's
Project: 7520 (81, 38, 0) has 1000000 steps. Processing time 4+ days.
Normal WU turnaround time on this machine 8-13 hours.
Normal WU turnaround time on this machine 8-13 hours.
Re: project 7520 being reactivated--post if bad WU's
We're working on this and have cleaned up a large number of WUs. The tricky part is that we don't want to change the number of steps on a WU that's already assigned; that can screw up the gen=n+1 WU.
Re: project 7520 being reactivated--post if bad WU's
Is there a way to detect if a WU has been assigned but not yet returned? Apparently if Gen 0 has been returned, that trajectory will continue normally.
Of course there's no guarantee that a WU that has been assigned will be returned, so that complicates the issues. What happens if you suspend assignments of trajectories for which Gen 0 has NOT been returned and then just wait until any all of those WUs have either expired or have been returned? Can you do that?
Of course there's no guarantee that a WU that has been assigned will be returned, so that complicates the issues. What happens if you suspend assignments of trajectories for which Gen 0 has NOT been returned and then just wait until any all of those WUs have either expired or have been returned? Can you do that?
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.