project 7520 being reactivated--post if bad WU's

Moderators: Site Moderators, FAHC Science Team

kasson
Pande Group Member
Posts: 1459
Joined: Thu Nov 29, 2007 9:37 pm

project 7520 being reactivated--post if bad WU's

Post by kasson »

I'm about to reactivate project 7520, which was pulled when we had a RAID failure that corrupted a bunch of WU's. I've done a restore from backup, so I'm hoping that the work units should be clean again. If you get repeat failures on a WU, however, please post here. If necessary, we'll pull the project again.
SKeptical_Thinker
Posts: 76
Joined: Tue Apr 29, 2008 11:02 pm
Hardware configuration: XP-32 Pro SP-3
Antec NSK-2480 with two Thermaltake 120mm Smart Fans
Gigabyte ga-ma78gm-s2h 780G IGP
BE-2350 with 10.5 x multiplier, 1.250V in BIOS, clock at 272 (2.856GHz)
EVGA 8800 GS
Ninja Mini CPU HS
GeIL 4GB (2 x 2GB) 240-Pin DDR2 SDRAM DDR2 800
Seagate 500GB SATA hard drive
ASUS 18X DVD±R DVD Burner PATA Model DRW-1814BL

Project: 7520 (Run 58, Clone 40, Gen 0)

Post by SKeptical_Thinker »

Code: Select all

*********************** Log Started 2015-06-10T12:41:51Z ***********************
12:41:51:************************* Folding@home Client *************************
12:41:51:    Website: http://folding.stanford.edu/
12:41:51:  Copyright: (c) 2009-2014 Stanford University
12:41:51:     Author: Joseph Coffland <[email protected]>
12:41:51:       Args: --child --lifeline 3031 /etc/fahclient/config.xml --run-as
12:41:51:             fahclient --pid-file=/var/run/fahclient.pid --daemon
12:41:51:     Config: /etc/fahclient/config.xml
12:41:51:******************************** Build ********************************
12:41:51:    Version: 7.4.4
12:41:51:       Date: Mar 4 2014
12:41:51:       Time: 12:02:38
12:41:51:    SVN Rev: 4130
12:41:51:     Branch: fah/trunk/client
12:41:51:   Compiler: GNU 4.4.7
12:41:51:    Options: -std=gnu++98 -O3 -funroll-loops -mfpmath=sse -ffast-math
12:41:51:             -fno-unsafe-math-optimizations -msse2
12:41:51:   Platform: linux2 3.2.0-1-amd64
12:41:51:       Bits: 64
12:41:51:       Mode: Release
12:41:51:******************************* System ********************************
12:41:51:        CPU: AMD Phenom(tm) II X6 1045T Processor
12:41:51:     CPU ID: AuthenticAMD Family 16 Model 10 Stepping 0
12:41:51:       CPUs: 6
12:41:51:     Memory: 7.55GiB
12:41:51:Free Memory: 5.66GiB
12:41:51:    Threads: POSIX_THREADS
12:41:51: OS Version: 3.13
12:41:51:Has Battery: false
12:41:51: On Battery: false
12:41:51: UTC Offset: -4
12:41:51:        PID: 3119
12:41:51:        CWD: /var/lib/fahclient
12:41:51:         OS: Linux 3.13.0-32-generic x86_64
12:41:51:    OS Arch: AMD64
12:41:51:       GPUs: 1
12:41:51:      GPU 0: UNSUPPORTED: RS880 [Radeon HD 4250]
12:41:51:       CUDA: Not detected
12:41:51:***********************************************************************
12:41:51:<config>
12:41:51:  <!-- Client Control -->
12:41:51:  <fold-anon v='true'/>
12:41:51:
12:41:51:  <!-- Folding Core -->
12:41:51:  <checkpoint v='30'/>
12:41:51:
12:41:51:  <!-- Folding Slot Configuration -->
12:41:51:  <client-type v='advanced'/>
12:41:51:
12:41:51:  <!-- HTTP Server -->
12:41:51:  <allow v='127.0.0.1 192.168.1.0/24'/>
12:41:51:
12:41:51:  <!-- Network -->
12:41:51:  <proxy v=':8080'/>
12:41:51:
12:41:51:  <!-- Remote Command Server -->
12:41:51:  <command-allow-no-pass v='127.0.0.1 192.168.1.0/24'/>
12:41:51:
12:41:51:  <!-- Slot Control -->
12:41:51:  <power v='full'/>
12:41:51:
12:41:51:  <!-- User Information -->
12:41:51:  <passkey v='********************************'/>
12:41:51:  <team v='31574'/>
12:41:51:  <user v='SKeptical_Thinker'/>
12:41:51:
12:41:51:  <!-- Work Unit Control -->
12:41:51:  <next-unit-percentage v='100'/>
12:41:51:
12:41:51:  <!-- Folding Slots -->
12:41:51:  <slot id='0' type='CPU'/>
12:41:51:</config>
12:41:51:Switching to user fahclient
12:41:51:Trying to access database...
12:41:51:Successfully acquired database lock
12:41:51:Enabled folding slot 00: READY cpu:6
12:41:51:WU01:FS00:Starting
12:41:51:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/Core_a4.fah/FahCore_a4 -dir 01 -suffix 01 -version 704 -lifeline 3119 -checkpoint 30 -np 6
12:41:51:WU01:FS00:Started FahCore on PID 3128
12:41:51:WU01:FS00:Core PID:3132
12:41:51:WU01:FS00:FahCore 0xa4 started
12:41:51:WU01:FS00:0xa4:
12:41:51:WU01:FS00:0xa4:*------------------------------*
12:41:51:WU01:FS00:0xa4:Folding@Home Gromacs GB Core
12:41:51:WU01:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
12:41:51:WU01:FS00:0xa4:
12:41:51:WU01:FS00:0xa4:Preparing to commence simulation
12:41:51:WU01:FS00:0xa4:- Ensuring status. Please wait.
12:42:00:WU01:FS00:0xa4:- Looking at optimizations...
12:42:00:WU01:FS00:0xa4:- Files status OK
12:42:00:WU01:FS00:0xa4:- Expanded 2021924 -> 3413504 (decompressed 168.8 percent)
12:42:00:WU01:FS00:0xa4:Called DecompressByteArray: compressed_data_size=2021924 data_size=3413504, decompressed_data_size=3413504 diff=0
12:42:00:WU01:FS00:0xa4:- Digital signature verified
12:42:00:WU01:FS00:0xa4:
12:42:00:WU01:FS00:0xa4:Project: 7520 (Run 58, Clone 40, Gen 0)
12:42:00:WU01:FS00:0xa4:
12:42:00:WU01:FS00:0xa4:Assembly optimizations on if available.
12:42:00:WU01:FS00:0xa4:Entering M.D.
This WU starts and grows to over 40GB of virtual memory and crashes.

I can supply an ubuntu crash file if you need it.

I deleted the WU and have moved on to Project: 9011 (Run 450, Clone 4, Gen 17) without issue.
Image
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 7520 (Run 58, Clone 40, Gen 0)

Post by bruce »

So far, that WU has not been returned by anyone.

Project 7520, Run 58, Clone 40, Gen 0
No data back from query

The log you posted doesn't show the initialization of the FAHCore. Did it crash before messages of the form

Code: Select all

14:57:55:WU00:FS00:0xa4:Entering M.D.
14:58:02:WU00:FS00:0xa4:Mapping NT from 4 to 4 
14:58:05:WU00:FS00:0xa4:Completed 0 out of 200000 steps (0%)
Which drivers are you running?
billford
Posts: 1003
Joined: Thu May 02, 2013 8:46 pm
Hardware configuration: Full Time:

2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)

Retired:

3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop
Location: Near Oxford, United Kingdom
Contact:

Re: project 7520 being reactivated--post if bad WU's

Post by billford »

Code: Select all

14:28:52:WU00:FS00:0xa4:Project: 7520 (Run 64, Clone 42, Gen 0)
14:28:52:WU00:FS00:0xa4:
14:28:52:WU00:FS00:0xa4:Assembly optimizations on if available.
14:28:52:WU00:FS00:0xa4:Entering M.D.
14:28:58:WU00:FS00:0xa4:Completed 0 out of 1000000 steps  (0%)
14:43:48:WU00:FS00:0xa4:Completed 10000 out of 1000000 steps  (1%)
14:59:09:WU00:FS00:0xa4:Completed 20000 out of 1000000 steps  (2%)
15:14:08:WU00:FS00:0xa4:Completed 30000 out of 1000000 steps  (3%)
If I remember correctly it should only have 500,000 steps?

TPF and PPD are compatible with it having twice the number it should have (looking in HFM at P7520's from long ago), it's running OK so far but should I dump it?

BTW, it's not listed in psummary.
Image
Joe_H
Site Admin
Posts: 7937
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: project 7520 being reactivated--post if bad WU's

Post by Joe_H »

Looking at an old log with a 7520 WU, yes you remember that count for steps correctly - should be 500,000.

As for showing up in psummary, this is a bit odd. It does show up on the old psummary page, and also does on the new psummaryC page, but not the new psummary.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
billford
Posts: 1003
Joined: Thu May 02, 2013 8:46 pm
Hardware configuration: Full Time:

2x NVidia GTX 980
1x NVidia GTX 780 Ti
2x 3GHz Core i5 PC (Linux)

Retired:

3.2GHz Core i5 PC (Linux)
3.2GHz Core i5 iMac
2.8GHz Core i5 iMac
2.16GHz Core 2 Duo iMac
2GHz Core 2 Duo MacBook
1.6GHz Core 2 Duo Acer laptop
Location: Near Oxford, United Kingdom
Contact:

Re: project 7520 being reactivated--post if bad WU's

Post by billford »

OK, I've dumped it (deleted the slot, so it may get reassigned).
Image
parkut
Posts: 363
Joined: Tue Feb 12, 2008 7:33 am
Hardware configuration: Running exclusively Linux headless blades. All are dedicated crunching machines.
Location: SE Michigan, USA

Re: project 7520 being reactivated--post if bad WU's

Post by parkut »

I have 12 machines chewing on 7520's. In the log files, I see (2) two of them have 500,000 steps, the rest have 1,000,000 steps

1,000,000 steps
Machine 033: Project: 7520 (Run 49, Clone 23, Gen 0)
Machine f1b1: Project: 7520 (Run 1, Clone 16, Gen 0)
Machine 163: Project: 7520 (Run 32, Clone 44, Gen 0)
Machine 105: Project: 7520 (Run 78, Clone 39, Gen 0)
Machine 139: Project: 7520 (Run 85, Clone 42, Gen 0)
Machine 148: Project: 7520 (Run 65, Clone 44, Gen 0)
Machine 093: Project: 7520 (Run 122, Clone 32, Gen 0)
Machine 071: Project: 7520 (Run 25, Clone 10, Gen 0)
Machine 081: Project: 7520 (Run 17, Clone 36, Gen 0)
Machine 091: Project: 7520 (Run 46, Clone 33, Gen 0)

500,000 steps
Machine: 145: Project: 7520 (Run 73, Clone 21, Gen 1)
Machine 133: Project: 7520 (Run 120, Clone 5, Gen 130)
kasson
Pande Group Member
Posts: 1459
Joined: Thu Nov 29, 2007 9:37 pm

Re: project 7520 being reactivated--post if bad WU's

Post by kasson »

Oh, there was an old problem here IIRC where gen 0 had 2x the number of steps. Let me check that.
SKeptical_Thinker
Posts: 76
Joined: Tue Apr 29, 2008 11:02 pm
Hardware configuration: XP-32 Pro SP-3
Antec NSK-2480 with two Thermaltake 120mm Smart Fans
Gigabyte ga-ma78gm-s2h 780G IGP
BE-2350 with 10.5 x multiplier, 1.250V in BIOS, clock at 272 (2.856GHz)
EVGA 8800 GS
Ninja Mini CPU HS
GeIL 4GB (2 x 2GB) 240-Pin DDR2 SDRAM DDR2 800
Seagate 500GB SATA hard drive
ASUS 18X DVD±R DVD Burner PATA Model DRW-1814BL

Re: Project: 7520 (Run 58, Clone 40, Gen 0)

Post by SKeptical_Thinker »

bruce wrote:So far, that WU has not been returned by anyone.

Project 7520, Run 58, Clone 40, Gen 0
No data back from query

The log you posted doesn't show the initialization of the FAHCore. Did it crash before messages of the form

Code: Select all

14:57:55:WU00:FS00:0xa4:Entering M.D.
14:58:02:WU00:FS00:0xa4:Mapping NT from 4 to 4 
14:58:05:WU00:FS00:0xa4:Completed 0 out of 200000 steps (0%)
Which drivers are you running?
The log that I posted is complete. That was all of the output in the log up to the time of the crash.

Which drivers are you talking about?

Moderator, please move this thread to: project 7520 being reactivated--post if bad WU's
Mod edit: Topics merged.

thanks
Image
kasson
Pande Group Member
Posts: 1459
Joined: Thu Nov 29, 2007 9:37 pm

Re: Project: 7520 (Run 58, Clone 40, Gen 0)

Post by kasson »

The WU was corrupted and has been regenerated. Thanks.
kasson
Pande Group Member
Posts: 1459
Joined: Thu Nov 29, 2007 9:37 pm

Re: Project: 7520 (Run 58, Clone 40, Gen 0)

Post by kasson »

We further regenerated all gen0 clones in Run 58 (and corrected their number of steps).
parkut
Posts: 363
Joined: Tue Feb 12, 2008 7:33 am
Hardware configuration: Running exclusively Linux headless blades. All are dedicated crunching machines.
Location: SE Michigan, USA

Re: project 7520 being reactivated--post if bad WU's

Post by parkut »

An observation, the million step WUs are reporting nearly one third of the PPD that the half million step WUs report on identical hardware.


And they take much longer to process.
rewron
Posts: 12
Joined: Fri Nov 04, 2011 8:25 pm

Re: project 7520 being reactivated--post if bad WU's

Post by rewron »

Project: 7520 (81, 38, 0) has 1000000 steps. Processing time 4+ days.

Normal WU turnaround time on this machine 8-13 hours.
kasson
Pande Group Member
Posts: 1459
Joined: Thu Nov 29, 2007 9:37 pm

Re: project 7520 being reactivated--post if bad WU's

Post by kasson »

We're working on this and have cleaned up a large number of WUs. The tricky part is that we don't want to change the number of steps on a WU that's already assigned; that can screw up the gen=n+1 WU.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: project 7520 being reactivated--post if bad WU's

Post by bruce »

Is there a way to detect if a WU has been assigned but not yet returned? Apparently if Gen 0 has been returned, that trajectory will continue normally.

Of course there's no guarantee that a WU that has been assigned will be returned, so that complicates the issues. What happens if you suspend assignments of trajectories for which Gen 0 has NOT been returned and then just wait until any all of those WUs have either expired or have been returned? Can you do that?
Post Reply