Page 1 of 2

13001 WU failure

Posted: Sat Oct 04, 2014 4:22 pm
by bfromcolo
This system is running Mint 17, I have a 750ti and NVIDIA driver 343.22. All stock clocks. Been running 9201 WUs fine, this is the first 13001 I have seen and it failed with:

15:49:07:WU00:FS01:0x17:ERROR:exception: Force RMSE error of 447.223 with threshold of 5

What does this error mean?



Code: Select all

*********************** Log Started 2014-10-04T15:45:26Z ***********************
15:45:26:************************* Folding@home Client *************************
15:45:26:    Website: http://folding.stanford.edu/
15:45:26:  Copyright: (c) 2009-2014 Stanford University
15:45:26:     Author: Joseph Coffland <[email protected]>
15:45:26:       Args: --child --lifeline 2647 /etc/fahclient/config.xml --run-as
15:45:26:             fahclient --pid-file=/var/run/fahclient.pid --daemon
15:45:26:     Config: /etc/fahclient/config.xml
15:45:26:******************************** Build ********************************
15:45:26:    Version: 7.4.4
15:45:26:       Date: Mar 4 2014
15:45:26:       Time: 12:02:38
15:45:26:    SVN Rev: 4130
15:45:26:     Branch: fah/trunk/client
15:45:26:   Compiler: GNU 4.4.7
15:45:26:    Options: -std=gnu++98 -O3 -funroll-loops -mfpmath=sse -ffast-math
15:45:26:             -fno-unsafe-math-optimizations -msse2
15:45:26:   Platform: linux2 3.2.0-1-amd64
15:45:26:       Bits: 64
15:45:26:       Mode: Release
15:45:26:******************************* System ********************************
15:45:26:        CPU: AMD Phenom(tm) II X6 1045T Processor
15:45:26:     CPU ID: AuthenticAMD Family 16 Model 10 Stepping 0
15:45:26:       CPUs: 6
15:45:26:     Memory: 7.80GiB
15:45:26:Free Memory: 6.92GiB
15:45:26:    Threads: POSIX_THREADS
15:45:26: OS Version: 3.13
15:45:26:Has Battery: false
15:45:26: On Battery: false
15:45:26: UTC Offset: -6
15:45:26:        PID: 2649
15:45:26:        CWD: /var/lib/fahclient
15:45:26:         OS: Linux 3.13.0-24-generic x86_64
15:45:26:    OS Arch: AMD64
15:45:26:       GPUs: 1
15:45:26:      GPU 0: NVIDIA:4 GM107 [GeForce GTX 750 Ti]
15:45:26:       CUDA: 5.0
15:45:26:CUDA Driver: 6050
15:45:26:***********************************************************************
15:45:26:<config>
15:45:26:  <!-- Client Control -->
15:45:26:  <fold-anon v='true'/>
15:45:26:
15:45:26:  <!-- Network -->
15:45:26:  <proxy v=':8080'/>
15:45:26:
15:45:26:  <!-- Slot Control -->
15:45:26:  <power v='full'/>
15:45:26:
15:45:26:  <!-- User Information -->
15:45:26:  <passkey v='********************************'/>
15:45:26:  <team v='37726'/>
15:45:26:  <user v='bfromcolo'/>
15:45:26:
15:45:26:  <!-- Folding Slots -->
15:45:26:  <slot id='1' type='GPU'/>
15:45:26:</config>
15:45:26:Switching to user fahclient
15:45:26:Trying to access database...
15:45:27:Successfully acquired database lock
15:45:27:Enabled folding slot 01: READY gpu:0:GM107 [GeForce GTX 750 Ti]
15:45:27:WU00:FS01:Connecting to 171.67.108.201:80
15:45:28:WU00:FS01:Assigned to work server 140.163.4.231
15:45:28:WU00:FS01:Requesting new work unit for slot 01: READY gpu:0:GM107 [GeForce GTX 750 Ti] from 140.163.4.231
15:45:28:WU00:FS01:Connecting to 140.163.4.231:8080
15:45:29:WU00:FS01:Downloading 4.84MiB
15:45:35:WU00:FS01:Download 71.05%
15:45:37:WU00:FS01:Download complete
15:45:37:WU00:FS01:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:13001 run:378 clone:1 gen:68 core:0x17 unit:0x00000096538b3db75328bad892c4b6cd
15:45:38:WU00:FS01:Starting
15:45:38:WU00:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/Core_17.fah/FahCore_17 -dir 00 -suffix 01 -version 704 -lifeline 2649 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
15:45:38:WU00:FS01:Started FahCore on PID 2667
15:45:38:WU00:FS01:Core PID:2671
15:45:38:WU00:FS01:FahCore 0x17 started
15:45:38:WU00:FS01:0x17:*********************** Log Started 2014-10-04T15:45:38Z ***********************
15:45:38:WU00:FS01:0x17:Project: 13001 (Run 378, Clone 1, Gen 68)
15:45:38:WU00:FS01:0x17:Unit: 0x00000096538b3db75328bad892c4b6cd
15:45:38:WU00:FS01:0x17:CPU: 0x00000000000000000000000000000000
15:45:38:WU00:FS01:0x17:Machine: 1
15:45:38:WU00:FS01:0x17:Reading tar file state.xml
15:45:39:WU00:FS01:0x17:Reading tar file system.xml
15:45:39:WU00:FS01:0x17:Reading tar file integrator.xml
15:45:39:WU00:FS01:0x17:Reading tar file core.xml
15:45:39:WU00:FS01:0x17:Digital signatures verified
15:49:07:WU00:FS01:0x17:ERROR:exception: Force RMSE error of 447.223 with threshold of 5
15:49:07:WU00:FS01:0x17:Saving result file logfile_01.txt
15:49:07:WU00:FS01:0x17:Saving result file badStateCheckpoint_57114166
15:49:08:WU00:FS01:0x17:Saving result file badStateForceGroup0_57114166Core.xml
15:49:11:WU00:FS01:0x17:Saving result file badStateForceGroup0_57114166Ref.xml
15:49:14:WU00:FS01:0x17:Saving result file badStateForceGroup1_57114166Core.xml
15:49:16:WU00:FS01:0x17:Saving result file badStateForceGroup1_57114166Ref.xml
15:49:19:WU00:FS01:0x17:Saving result file badStateForceGroup2_57114166Core.xml
15:49:21:WU00:FS01:0x17:Saving result file badStateForceGroup2_57114166Ref.xml
15:49:23:WU00:FS01:0x17:Saving result file log.txt
15:49:23:WU00:FS01:0x17:Folding@home Core Shutdown: BAD_WORK_UNIT
15:49:24:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
15:49:24:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13001 run:378 clone:1 gen:68 core:0x17 unit:0x00000096538b3db75328bad892c4b6cd
15:49:24:WU00:FS01:Uploading 24.64MiB to 140.163.4.231
15:49:24:WU00:FS01:Connecting to 140.163.4.231:8080
Mod edit: Please use Code tags instead of Quote tags around log files

Re: 13001 WU failure

Posted: Sat Oct 04, 2014 5:15 pm
by Joe_H
The error indicates that you may have received a bad WU. So far no one has completed this WU, though one person did get about 25% of the way through it.

Re: 13001 WU failure

Posted: Sat Oct 04, 2014 7:07 pm
by Breach
I think this is a more general problem following some changes done today to the AS. You have a Maxwell like me and after the change we're being given Core 17 WUs which error out (or even crash the core) - see here:
viewtopic.php?f=18&t=26807&start=15

I don't know whether this is the case with all Core 17 WUs and Maxwells or just some projects. From what I understand it's an old problem which emerged again with the new AS and the recent changes. After failing all WUs I have received I stopped GPU folding for now (at least with Core 15 WUs we could do something ;-)

Re: 13001 WU failure

Posted: Sat Oct 04, 2014 10:31 pm
by bruce
Breach wrote:I think this is a more general problem following some changes done today to the AS. You have a Maxwell like me and after the change we're being given Core 17 WUs which error out (or even crash the core) - see here:
viewtopic.php?f=18&t=26807&start=15

I don't know whether this is the case with all Core 17 WUs and Maxwells or just some projects. From what I understand it's an old problem which emerged again with the new AS and the recent changes. After failing all WUs I have received I stopped GPU folding for now (at least with Core 15 WUs we could do something ;-)
The Maxwell most definitely are more reliable with the latest drivers that with older versions. I'm not sure if that's significant for FahCore_17 but it's worth considering.

While changes to the AS code have altered the assignment probabilities for specific projects, actual changes may not match with our perception of how particular projects behave.

Re: 13001 WU failure

Posted: Sat Oct 04, 2014 10:44 pm
by Kjetil
Latest Short Lived Branch version: 343.22. He has the last drivers for linux. I have the same problems om win. It is As not the drivers?

Re: 13001 WU failure

Posted: Sat Oct 04, 2014 10:44 pm
by Breach
bruce, right now all Core 17 WUs assigned to Maxwells seem to fail (with latest drivers) - in my case about 10 out of 10. I posted here as I don't think this here is an isolated incident.

Re: 13001 WU failure

Posted: Sun Oct 05, 2014 2:33 pm
by bfromcolo
My system runs 9201 fine, but overnight it stopped processing after 10 consecutive 13001 failures. Will any flag make these 9201 more likely?

Code: Select all

23:33:52:WU00:FS01:0x17:ERROR:exception: Force RMSE error of 454.735 with threshold of 5
23:34:09:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
23:38:00:WU01:FS01:0x17:ERROR:exception: Force RMSE error of 453.528 with threshold of 5
23:38:18:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
23:42:10:WU02:FS01:0x17:ERROR:exception: Force RMSE error of 446.944 with threshold of 5
23:42:28:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
23:46:29:WU00:FS01:0x17:ERROR:exception: Force RMSE error of 453.412 with threshold of 5
23:46:47:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
23:50:41:WU01:FS01:0x17:ERROR:exception: Force RMSE error of 451.321 with threshold of 5
23:50:59:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
23:54:50:WU02:FS01:0x17:ERROR:exception: Force RMSE error of 452.633 with threshold of 5
23:55:07:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
23:59:01:WU03:FS01:0x17:ERROR:exception: Force RMSE error of 455.484 with threshold of 5
23:59:17:WARNING:WU03:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
00:03:25:WU00:FS01:0x17:ERROR:exception: Force RMSE error of 456.956 with threshold of 5
00:03:42:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
00:07:31:WU01:FS01:0x17:ERROR:exception: Force RMSE error of 450.132 with threshold of 5
00:07:48:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
00:11:39:WU02:FS01:0x17:ERROR:exception: Force RMSE error of 452.811 with threshold of 5
00:11:56:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)

Re: 13001 WU failure

Posted: Sun Oct 05, 2014 7:36 pm
by snapshot
I've just had the same problem:

Code: Select all

18:57:26:WU02:FS00:0x17:ERROR:exception: Force RMSE error of 455.059 with threshold of 5
18:57:26:WU02:FS00:0x17:Saving result file logfile_01.txt
18:57:26:WU02:FS00:0x17:Saving result file log.txt
18:57:26:WU02:FS00:0x17:Folding@home Core Shutdown: BAD_WORK_UNIT
18:57:26:WARNING:WU02:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:57:26:WU02:FS00:Sending unit results: id:02 state:SEND error:FAULTY project:13001 run:62 clone:3 gen:11 core:0x17 unit:0x0000001e538b3db753286153604b81f0
18:57:26:WU02:FS00:Uploading 2.30KiB to 140.163.4.231
18:57:26:WU02:FS00:Connecting to 140.163.4.231:8080
18:57:26:WU02:FS00:Upload complete
18:57:26:WU02:FS00:Server responded WORK_ACK (400)
18:57:26:WU02:FS00:Cleaning up
Nvidia drivers 340.52 under W7 Pro 64. I'll try the 344.11 drivers on my test box but I wasn't using them because they were so poor on 9201s.

Re: 13001 WU failure

Posted: Sun Oct 05, 2014 8:06 pm
by 7im
What version of fahcore?

On what kind of hardware. Need more info to help you.

Re: 13001 WU failure

Posted: Sun Oct 05, 2014 8:23 pm
by snapshot
FAHcore is version 52. Hardware is i7-3770, 16GB RAM, GTX750ti.

Just had another one:

Code: Select all

20:06:01:WU02:FS00:0x17:ERROR:exception: Force RMSE error of 450.68 with threshold of 5
20:06:01:WU02:FS00:0x17:Saving result file logfile_01.txt
20:06:01:WU02:FS00:0x17:Saving result file log.txt
20:06:01:WU02:FS00:0x17:Folding@home Core Shutdown: BAD_WORK_UNIT
20:06:01:WARNING:WU02:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
20:06:01:WU02:FS00:Sending unit results: id:02 state:SEND error:FAULTY project:13000 run:129 clone:0 gen:50 core:0x17 unit:0x00000066538b3db7530fc0694e857c15
20:06:01:WU02:FS00:Uploading 2.31KiB to 140.163.4.231
20:06:01:WU02:FS00:Connecting to 140.163.4.231:8080
20:06:02:WU02:FS00:Upload complete
20:06:02:WU02:FS00:Server responded WORK_ACK (400)
20:06:02:WU02:FS00:Cleaning up
This is a system that was 100% stable with 9201, 8108 and 762x WUs and has not had any hardware changes or any extra software installed other than MS updates as I've been away from home for the last four days.
This is preventing me folding with my GPU and, if I can only use the CPU, then I'm just not going to bother.

Re: 13001 WU failure

Posted: Mon Oct 06, 2014 12:57 am
by gwildperson
snapshot wrote: Nvidia drivers 304.52 under W7 Pro 64. I'll try the 344.11 drivers on my test box but I wasn't using them because they were so poor on 9201s.
Why 344.11, when 344.16 was released 5 days later?

Re: 13001 WU failure

Posted: Mon Oct 06, 2014 1:07 am
by Kjetil
344.16 is for ONLY 970 and 980.

Re: 13001 WU failure

Posted: Mon Oct 06, 2014 1:44 am
by Razzaa
I am having the exact same issues. I have tried numerous things to fix it but now my GPU wont fold at all.

Re: 13001 WU failure

Posted: Mon Oct 06, 2014 2:54 am
by bruce
Razzaa wrote:I am having the exact same issues. I have tried numerous things to fix it but now my GPU wont fold at all.
Please report which GPU you have and which drivers you are running.

Re: 13001 WU failure

Posted: Mon Oct 06, 2014 3:41 am
by Barryfla
I am having the same problem as others stated. My gtx 750ti won't fold, driver version 334.89, win 7, amd fx6350 6core and 16gigs ram.

14:18:11:WU01:FS01:Connecting to 171.67.108.201:80
14:18:12:WU01:FS01:Assigned to work server 140.163.4.231
14:18:12:WU01:FS01:Requesting new work unit for slot 01: READY gpu:0:GM107 [GeForce GTX 750 Ti] from 140.163.4.231
14:18:12:WU01:FS01:Connecting to 140.163.4.231:8080
14:18:12:WU01:FS01:Downloading 4.84MiB
14:18:17:WU01:FS01:Download complete
14:18:17:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:13001 run:48 clone:4 gen:34 core:0x17 unit:0x00000048538b3db753285d6453ddcf7a
14:18:17:WU01:FS01:Starting
14:18:17:WU01:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Barry/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_17.fah/FahCore_17.exe -dir 01 -suffix 01 -version 704 -lifeline 13732 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
14:18:17:WU01:FS01:Started FahCore on PID 12816
14:18:17:WU01:FS01:Core PID:5296
14:18:17:WU01:FS01:FahCore 0x17 started
14:18:18:WU01:FS01:0x17:*********************** Log Started 2014-10-06T14:18:18Z ***********************
14:18:18:WU01:FS01:0x17:Project: 13001 (Run 48, Clone 4, Gen 34)
14:18:18:WU01:FS01:0x17:Unit: 0x00000048538b3db753285d6453ddcf7a
14:18:18:WU01:FS01:0x17:CPU: 0x00000000000000000000000000000000
14:18:18:WU01:FS01:0x17:Machine: 1
14:18:18:WU01:FS01:0x17:Reading tar file state.xml
14:18:19:WU01:FS01:0x17:Reading tar file system.xml
14:18:20:WU01:FS01:0x17:Reading tar file integrator.xml
14:18:20:WU01:FS01:0x17:Reading tar file core.xml
14:18:20:WU01:FS01:0x17:Digital signatures verified
14:18:21:WU01:FS01:0x17:Folding@home GPU core17
14:18:21:WU01:FS01:0x17:Version 0.0.52
14:22:20:WU01:FS01:0x17:ERROR:exception: Force RMSE error of 455.674 with threshold of 5
14:22:20:WU01:FS01:0x17:Saving result file logfile_01.txt
14:22:20:WU01:FS01:0x17:Saving result file log.txt
14:22:20:WU01:FS01:0x17:Folding@home Core Shutdown: BAD_WORK_UNIT
14:22:21:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
14:22:21:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13001 run:48 clone:4 gen:34 core:0x17 unit:0x00000048538b3db753285d6453ddcf7a