Page 1 of 1

Project: 9408 (Run 628, Clone 0, Gen 0) [RESOLVED]

Posted: Fri Apr 18, 2014 10:40 am
by billford
Got to the machine this morning and found this:

Code: Select all

23:00:05:WU01:FS01:0x17:Project: 9408 (Run 628, Clone 0, Gen 0)
23:00:05:WU01:FS01:0x17:Unit: 0x000000000a3b1e5c5342df4fb1e621a8
23:00:05:WU01:FS01:0x17:CPU: 0x00000000000000000000000000000000
23:00:05:WU01:FS01:0x17:Machine: 1
23:00:05:WU01:FS01:0x17:Reading tar file system.xml
23:00:05:WU01:FS01:0x17:Reading tar file integrator.xml
23:00:05:WU01:FS01:0x17:Reading tar file state.xml
23:00:05:WU01:FS01:0x17:Reading tar file core.xml
23:00:05:WU01:FS01:0x17:Digital signatures verified
23:01:36:WU01:FS01:0x17:Completed 0 out of 5000000 steps (0%)
23:07:01:WU01:FS01:0x17:Completed 50000 out of 5000000 steps (1%)
23:12:23:WU01:FS01:0x17:Completed 100000 out of 5000000 steps (2%)
23:17:46:WU01:FS01:0x17:Completed 150000 out of 5000000 steps (3%)
23:23:08:WU01:FS01:0x17:Completed 200000 out of 5000000 steps (4%)
23:28:31:WU01:FS01:0x17:Completed 250000 out of 5000000 steps (5%)
23:33:53:WU01:FS01:0x17:Completed 300000 out of 5000000 steps (6%)
23:39:15:WU01:FS01:0x17:Completed 350000 out of 5000000 steps (7%)
23:44:37:WU01:FS01:0x17:Completed 400000 out of 5000000 steps (8%)
23:50:00:WU01:FS01:0x17:Completed 450000 out of 5000000 steps (9%)
23:55:22:WU01:FS01:0x17:Completed 500000 out of 5000000 steps (10%)
******************************* Date: 2014-04-18 *******************************
00:00:45:WU01:FS01:0x17:Completed 550000 out of 5000000 steps (11%)
00:06:07:WU01:FS01:0x17:Completed 600000 out of 5000000 steps (12%)
00:11:29:WU01:FS01:0x17:Completed 650000 out of 5000000 steps (13%)
00:16:51:WU01:FS01:0x17:Completed 700000 out of 5000000 steps (14%)
00:22:14:WU01:FS01:0x17:Completed 750000 out of 5000000 steps (15%)
00:27:36:WU01:FS01:0x17:Completed 800000 out of 5000000 steps (16%)
00:32:58:WU01:FS01:0x17:Completed 850000 out of 5000000 steps (17%)
00:38:21:WU01:FS01:0x17:Completed 900000 out of 5000000 steps (18%)
00:43:43:WU01:FS01:0x17:Completed 950000 out of 5000000 steps (19%)
00:49:05:WU01:FS01:0x17:Completed 1000000 out of 5000000 steps (20%)
00:54:28:WU01:FS01:0x17:Completed 1050000 out of 5000000 steps (21%)
00:59:50:WU01:FS01:0x17:Completed 1100000 out of 5000000 steps (22%)
01:05:13:WU01:FS01:0x17:Completed 1150000 out of 5000000 steps (23%)
01:10:35:WU01:FS01:0x17:Completed 1200000 out of 5000000 steps (24%)
01:15:57:WU01:FS01:0x17:Completed 1250000 out of 5000000 steps (25%)
01:21:20:WU01:FS01:0x17:Completed 1300000 out of 5000000 steps (26%)
01:26:42:WU01:FS01:0x17:Completed 1350000 out of 5000000 steps (27%)
01:32:05:WU01:FS01:0x17:Completed 1400000 out of 5000000 steps (28%)
01:37:28:WU01:FS01:0x17:Completed 1450000 out of 5000000 steps (29%)
01:42:50:WU01:FS01:0x17:Completed 1500000 out of 5000000 steps (30%)
01:50:12:WU01:FS01:0x17:Completed 1550000 out of 5000000 steps (31%)
01:50:18:WU01:FS01:0x17:Bad State detected... attempting to resume from last good checkpoint
02:00:58:WU01:FS01:0x17:Completed 1600000 out of 5000000 steps (32%)
02:06:20:WU01:FS01:0x17:Completed 1650000 out of 5000000 steps (33%)
02:11:42:WU01:FS01:0x17:Completed 1700000 out of 5000000 steps (34%)
02:17:05:WU01:FS01:0x17:Completed 1750000 out of 5000000 steps (35%)
02:22:27:WU01:FS01:0x17:Completed 1800000 out of 5000000 steps (36%)
02:27:49:WU01:FS01:0x17:Completed 1850000 out of 5000000 steps (37%)
02:33:11:WU01:FS01:0x17:Completed 1900000 out of 5000000 steps (38%)
02:38:33:WU01:FS01:0x17:Completed 1950000 out of 5000000 steps (39%)
02:43:56:WU01:FS01:0x17:Completed 2000000 out of 5000000 steps (40%)
02:49:18:WU01:FS01:0x17:Completed 2050000 out of 5000000 steps (41%)
02:54:40:WU01:FS01:0x17:Completed 2100000 out of 5000000 steps (42%)
03:00:02:WU01:FS01:0x17:Completed 2150000 out of 5000000 steps (43%)
03:05:24:WU01:FS01:0x17:Completed 2200000 out of 5000000 steps (44%)
03:24:13:WU01:FS01:0x17:Completed 2250000 out of 5000000 steps (45%)
03:24:19:WU01:FS01:0x17:Bad State detected... attempting to resume from last good checkpoint
03:34:58:WU01:FS01:0x17:Completed 2300000 out of 5000000 steps (46%)
03:40:21:WU01:FS01:0x17:Completed 2350000 out of 5000000 steps (47%)
03:45:43:WU01:FS01:0x17:Completed 2400000 out of 5000000 steps (48%)
03:51:05:WU01:FS01:0x17:Completed 2450000 out of 5000000 steps (49%)
03:56:28:WU01:FS01:0x17:Completed 2500000 out of 5000000 steps (50%)
04:01:50:WU01:FS01:0x17:Completed 2550000 out of 5000000 steps (51%)
04:07:12:WU01:FS01:0x17:Completed 2600000 out of 5000000 steps (52%)
04:12:34:WU01:FS01:0x17:Completed 2650000 out of 5000000 steps (53%)
04:17:56:WU01:FS01:0x17:Completed 2700000 out of 5000000 steps (54%)
04:23:18:WU01:FS01:0x17:Completed 2750000 out of 5000000 steps (55%)
04:28:40:WU01:FS01:0x17:Completed 2800000 out of 5000000 steps (56%)
04:34:03:WU01:FS01:0x17:Completed 2850000 out of 5000000 steps (57%)
04:39:25:WU01:FS01:0x17:Completed 2900000 out of 5000000 steps (58%)
04:44:47:WU01:FS01:0x17:Completed 2950000 out of 5000000 steps (59%)
04:50:09:WU01:FS01:0x17:Completed 3000000 out of 5000000 steps (60%)
04:55:32:WU01:FS01:0x17:Completed 3050000 out of 5000000 steps (61%)
05:00:54:WU01:FS01:0x17:Completed 3100000 out of 5000000 steps (62%)
05:06:17:WU01:FS01:0x17:Completed 3150000 out of 5000000 steps (63%)
05:11:39:WU01:FS01:0x17:Completed 3200000 out of 5000000 steps (64%)
05:17:01:WU01:FS01:0x17:Completed 3250000 out of 5000000 steps (65%)
05:22:24:WU01:FS01:0x17:Completed 3300000 out of 5000000 steps (66%)
05:27:46:WU01:FS01:0x17:Completed 3350000 out of 5000000 steps (67%)
05:33:08:WU01:FS01:0x17:Completed 3400000 out of 5000000 steps (68%)
05:38:30:WU01:FS01:0x17:Completed 3450000 out of 5000000 steps (69%)
05:43:52:WU01:FS01:0x17:Completed 3500000 out of 5000000 steps (70%)
05:49:14:WU01:FS01:0x17:Completed 3550000 out of 5000000 steps (71%)
05:54:37:WU01:FS01:0x17:Completed 3600000 out of 5000000 steps (72%)
05:59:59:WU01:FS01:0x17:Completed 3650000 out of 5000000 steps (73%)
******************************* Date: 2014-04-18 *******************************
06:05:21:WU01:FS01:0x17:Completed 3700000 out of 5000000 steps (74%)
06:10:43:WU01:FS01:0x17:Completed 3750000 out of 5000000 steps (75%)
06:16:05:WU01:FS01:0x17:Completed 3800000 out of 5000000 steps (76%)
06:21:27:WU01:FS01:0x17:Completed 3850000 out of 5000000 steps (77%)
06:26:50:WU01:FS01:0x17:Completed 3900000 out of 5000000 steps (78%)
06:32:12:WU01:FS01:0x17:Completed 3950000 out of 5000000 steps (79%)
06:37:34:WU01:FS01:0x17:Completed 4000000 out of 5000000 steps (80%)
06:42:56:WU01:FS01:0x17:Completed 4050000 out of 5000000 steps (81%)
06:46:07:WU01:FS01:FahCore returned: INTERRUPTED (102 = 0x66)
06:46:08:WU01:FS01:Starting
06:46:08:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/www.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/Core_17.fah/FahCore_17 -dir 01 -suffix 01 -version 703 -lifeline 1300 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
06:46:08:WU01:FS01:Started FahCore on PID 8095
06:46:08:WU01:FS01:Core PID:8099
06:46:08:WU01:FS01:FahCore 0x17 started
06:46:08:WU01:FS01:0x17:*********************** Log Started 2014-04-18T06:46:08Z ***********************
06:46:08:WU01:FS01:0x17:Project: 9408 (Run 628, Clone 0, Gen 0)
06:46:08:WU01:FS01:0x17:Unit: 0x000000000a3b1e5c5342df4fb1e621a8
06:46:08:WU01:FS01:0x17:CPU: 0x00000000000000000000000000000000
06:46:08:WU01:FS01:0x17:Machine: 1
06:46:08:WU01:FS01:0x17:Digital signatures verified
06:46:08:WU01:FS01:0x17:  Found a checkpoint file
06:47:34:WU01:FS01:0x17:Completed 4050000 out of 5000000 steps (81%)
06:53:14:WU01:FS01:0x17:Completed 4100000 out of 5000000 steps (82%)
06:58:50:WU01:FS01:0x17:Completed 4150000 out of 5000000 steps (83%)
07:31:26:WU01:FS01:0x17:Completed 4200000 out of 5000000 steps (84%)
07:31:31:WU01:FS01:0x17:Bad State detected... attempting to resume from last good checkpoint
08:09:38:WU01:FS01:0x17:Completed 4250000 out of 5000000 steps (85%)
08:09:44:WU01:FS01:0x17:Bad State detected... attempting to resume from last good checkpoint
08:20:58:WU01:FS01:0x17:Completed 4300000 out of 5000000 steps (86%)
08:26:33:WU01:FS01:0x17:Completed 4350000 out of 5000000 steps (87%)
08:32:09:WU01:FS01:0x17:Completed 4400000 out of 5000000 steps (88%)
08:37:45:WU01:FS01:0x17:Completed 4450000 out of 5000000 steps (89%)
08:43:21:WU01:FS01:0x17:Completed 4500000 out of 5000000 steps (90%)
08:48:57:WU01:FS01:0x17:Completed 4550000 out of 5000000 steps (91%)
08:54:33:WU01:FS01:0x17:Completed 4600000 out of 5000000 steps (92%)
09:00:08:WU01:FS01:0x17:Completed 4650000 out of 5000000 steps (93%)
09:05:44:WU01:FS01:0x17:Completed 4700000 out of 5000000 steps (94%)
09:38:15:WU01:FS01:0x17:Completed 4750000 out of 5000000 steps (95%)
09:38:21:WU01:FS01:0x17:Bad State detected... attempting to resume from last good checkpoint
09:38:21:WU01:FS01:0x17:Max number of retries reached. Aborting.
09:38:21:WU01:FS01:0x17:ERROR:exception: Max Retries Reached
The "Bad State detected" messages seem to be preceded by a very long frame time … at one point (just after 08:00) I thought it was stuck and was about to dump it, but it picked up again just in time so I let it run. Even the horrible PPD it was returning (only ~70% of typical) was better than zero from a dumped WU!

But then it did it once too often and errored out.

The GPU temperature varied quite widely during the folding- from ~63ºC up to its normal 82-83ºC, not sure if that's useful.

After that I'm a bit confused as various parts of the log interleaved, but it got a p9406 which was immediately rejected:

Code: Select all

09:39:40:WU00:FS01:0x17:Project: 9406 (Run 184, Clone 0, Gen 4)
09:39:40:WU00:FS01:0x17:Unit: 0x000000040a3b1e5c533deab006a95130
09:39:40:WU00:FS01:0x17:CPU: 0x00000000000000000000000000000000
09:39:40:WU00:FS01:0x17:Machine: 1
09:39:40:WU00:FS01:0x17:Reading tar file state.xml
09:39:41:WU00:FS01:0x17:Reading tar file system.xml
09:39:41:WU00:FS01:0x17:Reading tar file integrator.xml
09:39:41:WU00:FS01:0x17:Reading tar file core.xml
09:39:41:WU00:FS01:0x17:Digital signatures verified
09:43:01:WU00:FS01:0x17:ERROR:exception: Potential energy error of 10.2812, threshold of 10
09:43:01:WU00:FS01:0x17:ERROR:Reference Potential Energy: -1.08118e+06 | Given Potential Energy: -1.08119e+06
(It had earlier completed a P9406 (Run 177, Clone 0, Gen 6) without a hitch)

It then got another P9408 (Run 496, Clone 0, Gen 3) which started like the first one (very low PPD) and I'm afraid I'd had enough at that point- I dumped it and removed the advanced flag. It's now, happily so far, crunching a P13000.

My main question- the 780 Ti is from Gigabyte and slightly manufacturer overclocked- is this likely to be the problem?

If it is, can I reduce/remove the overclock (it's in a Linux box) and if so how?

Otherwise I'll have to try to get it swapped for a stock speed one… which would be a pity :(




System info:

Code: Select all

11:54:09:************************* Folding@home Client *************************
11:54:09:    Website: http://folding.stanford.edu/
11:54:09:  Copyright: (c) 2009-2013 Stanford University
11:54:09:     Author: Joseph Coffland <[email protected]>
11:54:09:       Args: --child --lifeline 1087 /etc/fahclient/config.xml --run-as
11:54:09:             fahclient --pid-file=/var/run/fahclient.pid --daemon
11:54:09:     Config: /etc/fahclient/config.xml
11:54:09:******************************** Build ********************************
11:54:09:    Version: 7.3.6
11:54:09:       Date: Feb 18 2013
11:54:09:       Time: 07:24:08
11:54:09:    SVN Rev: 3923
11:54:09:     Branch: fah/trunk/client
11:54:09:   Compiler: GNU 4.4.7
11:54:09:    Options: -std=gnu++98 -O3 -funroll-loops -mfpmath=sse -ffast-math
11:54:09:             -fno-unsafe-math-optimizations -msse2
11:54:09:   Platform: linux2 3.2.0-1-amd64
11:54:09:       Bits: 64
11:54:09:       Mode: Release
11:54:09:******************************* System ********************************
11:54:09:        CPU: Intel(R) Core(TM) i5-4440 CPU @ 3.10GHz
11:54:09:     CPU ID: GenuineIntel Family 6 Model 60 Stepping 3
11:54:09:       CPUs: 4
11:54:09:     Memory: 3.82GiB
11:54:09:Free Memory: 3.59GiB
11:54:09:    Threads: POSIX_THREADS
11:54:09:Has Battery: false
11:54:09: On Battery: false
11:54:09: UTC offset: 1
11:54:09:        PID: 1300
11:54:09:        CWD: /var/lib/fahclient
11:54:09:         OS: Linux 3.11.0-12-generic x86_64
11:54:09:    OS Arch: AMD64
11:54:09:       GPUs: 1
11:54:09:      GPU 0: NVIDIA:3 GK110 [GeForce GTX 780 Ti]
11:54:09:       CUDA: 3.5
11:54:09:CUDA Driver: 5050
11:54:09:***********************************************************************
Config:

Code: Select all

12:30:03:<config>
12:30:03:  <!-- Client Control -->
12:30:03:  <fold-anon v='true'/>
12:30:03:
12:30:03:  <!-- Folding Slot Configuration -->
12:30:03:  <power v='full'/>
12:30:03:
12:30:03:  <!-- HTTP Server -->
12:30:03:  <allow v='127.0.0.1 192.168.1.0/24'/>
12:30:03:
12:30:03:  <!-- Network -->
12:30:03:  <proxy v=':8080'/>
12:30:03:
12:30:03:  <!-- Remote Command Server -->
12:30:03:  <command-allow-no-pass v='127.0.0.1 192.168.1.0/24'/>
12:30:03:
12:30:03:  <!-- User Information -->
12:30:03:  <passkey v='********************************'/>
12:30:03:  <user v='[removed'/>
12:30:03:
12:30:03:  <!-- Folding Slots -->
12:30:03:  <slot id='0' type='CPU'>
12:30:03:    <client-type v='advanced'/>
12:30:03:    <cpus v='3'/>
12:30:03:    <next-unit-percentage v='100'/>
12:30:03:    <pause-on-start v='yes'/>
12:30:03:  </slot>
12:30:03:  <slot id='1' type='GPU'>
12:30:03:    <client-type v='advanced'/>
12:30:03:    <next-unit-percentage v='100'/>
12:30:03:    <pause-on-start v='yes'/>
12:30:03:  </slot>
12:30:03:</config>

Re: Project: 9408 (Run 628, Clone 0, Gen 0)

Posted: Fri Apr 18, 2014 11:21 am
by Kurtis200200
It's entirely possible that manufacturer overclocking is responsible for unstability leadig to failed work units, but not necessarily so (I have two EVGA 660SCs and no problems also in a linux box).
As for downclocking though, I'm afraid I haven't the foggiest /:

Re: Project: 9408 (Run 628, Clone 0, Gen 0)

Posted: Fri Apr 18, 2014 11:34 am
by billford
I've also got a Gigabyte GTX 650 Ti in an essentially identical box (same OS- Linux Mint 16, same drivers- 319.32) but it hasn't yet been given a P9408.

It handles P13000/1 and P9406 without any bother, as this one seems to, but I only built this system yesterday so possible comparisons are limited...

Re: Project: 9408 (Run 628, Clone 0, Gen 0)

Posted: Fri Apr 18, 2014 4:05 pm
by davidcoton
It does seem a bit strange to get a run of Bad WUs unless the card is too fast for its own good. My factory overclocked 780Ti (different manufacturer) seems stable under Linux, but I too have failed to find a clock setting utility for Linux.

It will be interesting to know if any of your problem WUs get completed by others -- especially if the alternative is to get the card changed.

David

Re: Project: 9408 (Run 628, Clone 0, Gen 0)

Posted: Fri Apr 18, 2014 4:40 pm
by billford
I think I'm going to have to put it back to advanced and see what happens… and also wait for the GTX 650Ti to get a P9408 to see how that card handles it.

So far both seem happy with P13000/1 and P9406 and, while they were around, the 650Ti thrived on P8900 and P9401, but there's not many ZETA projects around at the moment to get much information and the 780Ti hasn't even been running for 36 hours yet!

I can't do anything until Tuesday at least, so I might as well let it try as many projects as it can to see whether it's the card or if I've just been unlucky with WUs.

Re: Project: 9408 (Run 628, Clone 0, Gen 0)

Posted: Fri Apr 18, 2014 7:29 pm
by cxh
For what it's worth, project 9408 is exactly the same as 9401. So if you ran 9401 just fine, you shouldn't be having any issues with this one.

Re: Project: 9408 (Run 628, Clone 0, Gen 0)

Posted: Fri Apr 18, 2014 7:36 pm
by billford
Yes, I had a feeling it might be… but I've only run 9401's on the other GPU, not this one.

I'll just have to keep my fingers crossed that it was a bad WU. I'll put it back to advanced tomorrow when I can keep an eye on what it downloads.

Re: Project: 9408 (Run 628, Clone 0, Gen 0) [RESOLVED]

Posted: Sat Apr 19, 2014 6:42 am
by billford
It would seem to be the card… picked up a p9101 and:

Code: Select all

05:06:38:WU00:FS01:0x17:Completed 25000 out of 2500000 steps (1%)
.
05:24:28:WU00:FS01:0x17:Completed 100000 out of 2500000 steps (4%)
05:24:32:WU00:FS01:0x17:Bad State detected... attempting to resume from last good checkpoint
05:26:22:WU00:FS01:0x17:Completed 75000 out of 2500000 steps (3%)
05:35:12:WU00:FS01:0x17:Completed 100000 out of 2500000 steps (4%)
05:35:16:WU00:FS01:0x17:Bad State detected... attempting to resume from last good checkpoint
So it's back to normal mode and p13000/1 until I get it sorted.
davidcoton wrote:but I too have failed to find a clock setting utility for Linux. … especially if the alternative is to get the card changed.
I don't really want to change the card, it's a lot of hassle and unless I go for a stock (ie significantly slower :( ) one there's no guarantee it'll work, and I don't think I'd have much luck swapping a perfectly good card on a second occasion!

There is another possibility- swallow some principles and install Windows.

If I understand correctly what's on the driver DVD there's a tweaking utility included, and it may only need a very small clock reduction. It'll be a pretty much dedicated folding machine so I won't have to get too familiar with the OS, the biggest problem would likely be the learning curve- the last version of Windows I used was Vista!


edit- seems I can still pick up legal copies of Vista for a lot less than W7/8, that might be a better route.