Bad Work Unit

Nert · Post by **Nert** » Fri Aug 28, 2015 12:56 pm

I just received a bad work unit message from FAH. Here's the timeline preceding this event:

1) System is Windows 10 running GTX 750TI folding both CPU and GPU 24 x 7.
2) Processing was working just fine prior to the following sequence.
3) Windows 10 informed me that a reboot was needed to finish windows update.
4) I paused FAH and rebooted the system
5) System restarted and FAH started fine
6) Within a matter of a minute or 2 after restart Windows flashed a message that the GE Force driver had stopped working, but that it had recovered.
7) I checked in the FAH log and noticed the Bad Work Unit Error message.
8) FAH recovered and downloaded another work unit.

Following is the log:

Code: Select all

*********************** Log Started 2015-08-28T12:38:28Z ***********************
12:38:28:WU00:FS01:Starting
12:38:28:WU00:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/roger/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21.exe -dir 00 -suffix 01 -version 704 -lifeline 3632 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
12:38:28:WU00:FS01:Started FahCore on PID 5516
12:38:28:WU00:FS01:Core PID:5280
12:38:28:WU00:FS01:FahCore 0x21 started
12:38:30:WU00:FS01:0x21:*********************** Log Started 2015-08-28T12:38:29Z ***********************
12:38:30:WU00:FS01:0x21:Project: 9704 (Run 19, Clone 2, Gen 12)
12:38:30:WU00:FS01:0x21:Unit: 0x0000000eab404162553ebea1756de681
12:38:30:WU00:FS01:0x21:CPU: 0x00000000000000000000000000000000
12:38:30:WU00:FS01:0x21:Machine: 1
12:38:30:WU00:FS01:0x21:Digital signatures verified
12:38:30:WU00:FS01:0x21:Folding@home GPU Core21 Folding@home Core
12:38:30:WU00:FS01:0x21:Version 0.0.11
12:40:07:WU00:FS01:0x21:ERROR:exception: Error downloading array interactionCount: clEnqueueReadBuffer (-5)
12:40:07:WU00:FS01:0x21:Saving result file logfile_01.txt
12:40:07:WU00:FS01:0x21:Saving result file log.txt
12:40:07:WU00:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
12:40:10:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
12:40:10:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:9704 run:19 clone:2 gen:12 core:0x21 unit:0x0000000eab404162553ebea1756de681
12:40:10:WU00:FS01:Uploading 2.96KiB to 171.64.65.98
12:40:10:WU00:FS01:Connecting to 171.64.65.98:8080
12:40:11:WU00:FS01:Upload complete
12:40:11:WU01:FS01:Connecting to 171.67.108.200:80
12:40:11:WU00:FS01:Server responded WORK_ACK (400)
12:40:11:WU00:FS01:Cleaning up
12:40:11:WU01:FS01:Assigned to work server 171.64.65.58
12:40:11:WU01:FS01:Requesting new work unit for slot 01: READY gpu:0:GM107 [GeForce GTX 750 Ti] from 171.64.65.58
12:40:11:WU01:FS01:Connecting to 171.64.65.58:8080
12:40:13:WU01:FS01:Downloading 894.10KiB
12:40:13:WU01:FS01:Download complete
12:40:13:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:9413 run:5 clone:7 gen:152 core:0x18 unit:0x000000a2ab40413a55410cdc5f487be9
12:40:13:WU01:FS01:Starting
12:40:13:WU01:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/roger/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_18.fah/FahCore_18.exe -dir 01 -suffix 01 -version 704 -lifeline 3632 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
12:40:13:WU01:FS01:Started FahCore on PID 1128
12:40:13:WU01:FS01:Core PID:1812
12:40:13:WU01:FS01:FahCore 0x18 started
12:40:14:WU01:FS01:0x18:*********************** Log Started 2015-08-28T12:40:14Z ***********************
12:40:14:WU01:FS01:0x18:Project: 9413 (Run 5, Clone 7, Gen 152)
12:40:14:WU01:FS01:0x18:Unit: 0x000000a2ab40413a55410cdc5f487be9
12:40:14:WU01:FS01:0x18:CPU: 0x00000000000000000000000000000000
12:40:14:WU01:FS01:0x18:Machine: 1
12:40:14:WU01:FS01:0x18:Reading tar file state.xml
12:40:14:WU01:FS01:0x18:Reading tar file system.xml
12:40:14:WU01:FS01:0x18:Reading tar file integrator.xml
12:40:14:WU01:FS01:0x18:Reading tar file core.xml
12:40:14:WU01:FS01:0x18:Digital signatures verified
12:40:14:WU01:FS01:0x18:Folding@home GPU core18
12:40:14:WU01:FS01:0x18:Version 0.0.4
12:40:26:WU01:FS01:0x18:Completed 0 out of 16000000 steps (0%)
12:40:26:WU01:FS01:0x18:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900

The coincidence of a failed work unit right after right after a Windows 10 update smells fishy, and I'm suspicious that this is more than a bad work unit. Let me know if there is any other info. that I can provide that might help diagnose this problem.

Post by **bruce** » Fri Aug 28, 2015 4:03 pm

The error report will trigger a reassignment of the same WU to someone else, so FAH's science will recover (unless it really is a bad WU). That doesn't explain YOUR error, though.

I share your suspicions about a Windows problem with an update. Needless to day, Windows 10 has some new features, some of which still contain bugs. We've seen a number of reports of problems with their policy of automatically updating [what used to be called "Windows Update"] -- particularly with GPU drivers which are not plug-and-play and which therefore require a reboot. It's a joint issue for Windows 10 and for nVidia's requirements for driver installation.

Your best course of action is to go to the event viewer and find whatever Windows recorded when the geforce driver stopped working and make sure it has been reported. {That's likely to be automatically reported to Microsoft, which probably also gets forwarded to nVidia.}

You may find some additional discussion here: https://forums.geforce.com/default/board/33/

Although FAH is at the mercy of Microsoft and nVidia for this problem, thanks for your report.

suprleg · Post by **suprleg** » Sun Oct 18, 2015 2:45 am

I just installed a new GPU:Nvidia GTX960 2Mb running Ubuntu 12.04.5, 3.13.0-65-generic kernel, latest Nvidia Linux drivers 352.41and I'm getting the same error:

Code: Select all

*********************** Log Started 2015-10-16T20:19:58Z ***********************
21:22:27:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
23:17:52:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
******************************* Date: 2015-10-17 *******************************
******************************* Date: 2015-10-17 *******************************
******************************* Date: 2015-10-17 *******************************
******************************* Date: 2015-10-17 *******************************
00:42:24:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
01:18:07:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)

Two snippets:

Code: Select all

21:22:23:WU01:FS01:0x21:*********************** Log Started 2015-10-16T21:22:22Z ***********************
21:22:23:WU01:FS01:0x21:Project: 9641 (Run 0, Clone 26, Gen 6)
21:22:23:WU01:FS01:0x21:Unit: 0x00000008ab436c9b5609bee484adeb06
21:22:23:WU01:FS01:0x21:CPU: 0x00000000000000000000000000000000
21:22:23:WU01:FS01:0x21:Machine: 1
21:22:23:WU01:FS01:0x21:Digital signatures verified
21:22:23:WU01:FS01:0x21:Folding@home GPU Core21 Folding@home Core
21:22:23:WU01:FS01:0x21:Version 0.0.11
21:22:23:WU01:FS01:0x21:  Found a checkpoint file
21:22:27:WU01:FS01:0x21:ERROR:Guru Meditation #73787103beab613f.e5bb3224e631a8e4 (15403700.15408703) '01/01/checkpointState.xml'
21:22:27:WU01:FS01:0x21:WARNING:Unexpected exit() call
21:22:27:WU01:FS01:0x21:WARNING:Unexpected exit from science code
21:22:27:WU01:FS01:0x21:Saving result file logfile_01.txt
21:22:27:WU01:FS01:0x21:Saving result file checkpt.crc
21:22:27:WU01:FS01:0x21:Saving result file log.txt
21:22:27:WU01:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
21:22:27:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
21:22:27:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:9641 run:0 clone:26 gen:6 core:0x21 unit:0x00000008ab436c9b5609bee484adeb06
21:22:27:WU01:FS01:Uploading 26.00KiB to 171.67.108.155
21:22:27:WU01:FS01:Connecting to 171.67.108.155:8080
21:22:27:WU01:FS01:Upload complete
21:22:27:WU01:FS01:Server responded WORK_ACK (400)
21:22:27:WU01:FS01:Cleaning up
21:22:28:WU01:FS01:Connecting to assign-GPU.stanford.edu:80
21:22:28:WU01:FS01:News: 
21:22:28:WU01:FS01:Assigned to work server 171.64.65.98
21:22:28:WU01:FS01:Requesting new work unit for slot 01: READY gpu:0:GM206 [GeForce GTX 960] from 171.64.65.98
21:22:28:WU01:FS01:Connecting to 171.64.65.98:8080
21:22:28:WU01:FS01:Downloading 7.55MiB
21:22:29:WU01:FS01:Download complete
21:22:29:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:9712 run:8 clone:10 gen:74 core:0x21 unit:0x0000012aab40416255b9a770b7e7800e
21:22:29:WU01:FS01:Starting
21:22:29:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21 -dir 01 -suffix 01 -version 703 -lifeline 1423 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
21:22:29:WU01:FS01:Started FahCore on PID 5326
21:22:29:WU01:FS01:Core PID:5330
21:22:29:WU01:FS01:FahCore 0x21 started

Code: Select all

00:42:26:WU02:FS01:0x18:*********************** Log Started 2015-10-18T00:42:26Z ***********************
00:42:26:WU02:FS01:0x18:Project: 9119 (Run 4, Clone 6, Gen 214)
00:42:26:WU02:FS01:0x18:Unit: 0x000000fb0a3b1e78553e7edee4f4687a
00:42:26:WU02:FS01:0x18:CPU: 0x00000000000000000000000000000000
00:42:26:WU02:FS01:0x18:Machine: 1
00:42:26:WU02:FS01:0x18:Reading tar file state.xml
00:42:26:WU02:FS01:0x18:Reading tar file system.xml
00:42:26:WU02:FS01:0x18:Reading tar file integrator.xml
00:42:26:WU02:FS01:0x18:Reading tar file core.xml
00:42:26:WU02:FS01:0x18:Digital signatures verified
00:42:26:WU02:FS01:0x18:Folding@home GPU core18
00:42:26:WU02:FS01:0x18:Version 0.0.4
00:42:36:WU02:FS01:0x18:Completed 0 out of 2500000 steps (0%)
00:42:36:WU02:FS01:0x18:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
00:45:35:WU02:FS01:0x18:Completed 25000 out of 2500000 steps (1%)
00:48:31:WU02:FS01:0x18:Completed 50000 out of 2500000 steps (2%)
00:51:27:WU02:FS01:0x18:Completed 75000 out of 2500000 steps (3%)
00:54:24:WU02:FS01:0x18:Completed 100000 out of 2500000 steps (4%)
00:57:23:WU02:FS01:0x18:Completed 125000 out of 2500000 steps (5%)
01:00:19:WU02:FS01:0x18:Completed 150000 out of 2500000 steps (6%)
01:03:15:WU02:FS01:0x18:Completed 175000 out of 2500000 steps (7%)
01:06:12:WU02:FS01:0x18:Completed 200000 out of 2500000 steps (8%)
01:09:11:WU02:FS01:0x18:Completed 225000 out of 2500000 steps (9%)
01:12:07:WU02:FS01:0x18:Completed 250000 out of 2500000 steps (10%)
01:15:03:WU02:FS01:0x18:Completed 275000 out of 2500000 steps (11%)
01:17:59:WU02:FS01:0x18:Completed 300000 out of 2500000 steps (12%)
01:18:04:WU02:FS01:FahCore returned: INTERRUPTED (102 = 0x66)
01:18:04:WU02:FS01:Starting
01:18:04:WU02:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/Core_18.fah/FahCore_18 -dir 02 -suffix 01 -version 703 -lifeline 1423 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
01:18:04:WU02:FS01:Started FahCore on PID 14078
01:18:04:WU02:FS01:Core PID:14082
01:18:04:WU02:FS01:FahCore 0x18 started
01:18:05:WU02:FS01:0x18:*********************** Log Started 2015-10-18T01:18:04Z ***********************
01:18:05:WU02:FS01:0x18:Project: 9119 (Run 4, Clone 6, Gen 214)
01:18:05:WU02:FS01:0x18:Unit: 0x000000fb0a3b1e78553e7edee4f4687a
01:18:05:WU02:FS01:0x18:CPU: 0x00000000000000000000000000000000
01:18:05:WU02:FS01:0x18:Machine: 1
01:18:05:WU02:FS01:0x18:Digital signatures verified
01:18:05:WU02:FS01:0x18:Folding@home GPU core18
01:18:05:WU02:FS01:0x18:Version 0.0.4
01:18:05:WU02:FS01:0x18:  Found a checkpoint file
01:18:06:WU02:FS01:0x18:ERROR:Guru Meditation #79f9c925585e2e42.587e032b520da89e (11139700.11140870) '02/01/checkpointState.xml'
01:18:06:WU02:FS01:0x18:WARNING:Unexpected exit() call
01:18:06:WU02:FS01:0x18:WARNING:Unexpected exit from science code
01:18:06:WU02:FS01:0x18:Saving result file logfile_01.txt
01:18:06:WU02:FS01:0x18:Saving result file checkpt.crc
01:18:06:WU02:FS01:0x18:Saving result file log.txt
01:18:06:WU02:FS01:0x18:Folding@home Core Shutdown: BAD_WORK_UNIT
01:18:07:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
01:18:07:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:9119 run:4 clone:6 gen:214 core:0x18 unit:0x000000fb0a3b1e78553e7edee4f4687a
01:18:07:WU02:FS01:Uploading 3.74KiB to 171.64.65.84
01:18:07:WU02:FS01:Connecting to 171.64.65.84:8080
01:18:07:WU00:FS01:Connecting to assign-GPU.stanford.edu:80
01:18:07:WU00:FS01:News: 
01:18:07:WU00:FS01:Assigned to work server 171.64.65.84
01:18:07:WU00:FS01:Requesting new work unit for slot 01: READY gpu:0:GM206 [GeForce GTX 960] from 171.64.65.84
01:18:07:WU00:FS01:Connecting to 171.64.65.84:8080
01:18:13:WU02:FS01:Upload complete
01:18:13:WU02:FS01:Server responded WORK_ACK (400)
01:18:13:WU02:FS01:Cleaning up
01:18:13:WU00:FS01:Downloading 3.83MiB
01:18:14:WU00:FS01:Download complete
01:18:14:WU00:FS01:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:9120 run:32 clone:1 gen:52 core:0x18 unit:0x0000003f0a3b1e78553ea17fb489cbe6
01:18:14:WU00:FS01:Starting
01:18:14:WU00:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/Core_18.fah/FahCore_18 -dir 00 -suffix 01 -version 703 -lifeline 1423 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
01:18:14:WU00:FS01:Started FahCore on PID 14085
01:18:14:WU00:FS01:Core PID:14089
01:18:14:WU00:FS01:FahCore 0x18 started

Not all work units fail, but you can see by the first code block I posted, way too many are failing. Ideas?
I am still running FAHClient version 7.3.6 do to old
team reporting scripts, but unless you're sure that's the issue please don't just jump on that . Thanks
*Sorry it looks like I posted in the wrong Forum area, I was doing an online search for this problem. Please move this to the appropriate place if I messed up, thanks. :-/

Post by **toTOW** » Sun Oct 18, 2015 11:26 am

It's a checkpoint corruption error ... how do you stop you client ? Do you let it a few seconds to flush everything to disk before rebooting or turning off the system ?

Folding Forum

Bad Work Unit

Bad Work Unit

Re: Bad Work Unit

Re: Bad Work Unit

Re: Bad Work Unit