Page 1 of 1

Repeated Download Failure

Posted: Mon Aug 12, 2013 4:34 pm
by N0OA
Can anyone provide insight into a problem I've been seeing the last few days on a GTX Titan? The slot seems to go into a cycle of download, run for a bit, fail and then download again... Rinse and repeat ;-)

The log seems to indicate a faulty project... Any ideas?

N0OA

Code: Select all

16:25:54:WU00:FS03:0xa3:Completed 454650 out of 500000 steps  (90%)
16:26:13:WU02:FS02:0x17:Completed 0 out of 2000000 steps (0%)
16:26:23:WU00:FS03:0xa3:Completed 455000 out of 500000 steps  (91%)
16:26:24:WU03:FS00:0x17:Completed 1700000 out of 2000000 steps (85%)
16:26:25:WU01:FS01:0x17:Completed 1150000 out of 2000000 steps (57%)
16:27:17:WU02:FS02:0x17:Completed 20000 out of 2000000 steps (1%)
16:27:25:WU02:FS02:0x17:ERROR:exception: Error downloading array energyBuffer: clEnqueueReadBuffer (-36)
16:27:25:WU02:FS02:0x17:Saving result file logfile_01.txt
16:27:25:WU02:FS02:0x17:Saving result file log.txt
16:27:25:WU02:FS02:0x17:Folding@home Core Shutdown: BAD_WORK_UNIT
16:27:25:WARNING:WU02:FS02:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
16:27:25:WU02:FS02:Sending unit results: id:02 state:SEND error:FAULTY project:7811 run:0 clone:185 gen:136 core:0x17 unit:0x0000008f0a3b1e8651db47d527d80df1
16:27:25:WU02:FS02:Uploading 2.45KiB to 171.64.65.98
16:27:25:WU02:FS02:Connecting to 171.64.65.98:8080
16:27:25:WU04:FS02:Connecting to assign-GPU.stanford.edu:80
16:27:25:WU02:FS02:Upload complete
16:27:25:WU02:FS02:Server responded WORK_ACK (400)
16:27:25:WU02:FS02:Cleaning up
16:27:26:WU04:FS02:News: Welcome to Folding@Home
16:27:26:WU04:FS02:Assigned to work server 171.64.65.98
16:27:26:WU04:FS02:Requesting new work unit for slot 02: READY gpu:2:GK110 [GeForce GTX Titan] from 171.64.65.98
16:27:26:WU04:FS02:Connecting to 171.64.65.98:8080
16:27:26:WU04:FS02:Downloading 2.09MiB
16:27:28:WU04:FS02:Download complete
16:27:28:WU04:FS02:Received Unit: id:04 state:DOWNLOAD error:NO_ERROR project:7810 run:0 clone:523 gen:94 core:0x17 unit:0x000000650a3b1e8651d34b92198db5ad
16:27:28:WU04:FS02:Starting
16:27:28:WU04:FS02:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/jrice/AppData/Roaming/FAHClient/cores/www.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/beta/Core_17.fah/FahCore_17.exe -dir 04 -suffix 01 -version 703 -lifeline 3736 -checkpoint 10 -gpu 2 -gpu-vendor nvidia
16:27:28:WU04:FS02:Started FahCore on PID 1636
16:27:28:WU04:FS02:Core PID:2568
16:27:28:WU04:FS02:FahCore 0x17 started
16:27:28:WU04:FS02:0x17:*********************** Log Started 2013-08-12T16:27:28Z ***********************
16:27:28:WU04:FS02:0x17:Project: 7810 (Run 0, Clone 523, Gen 94)
16:27:28:WU04:FS02:0x17:Unit: 0x000000650a3b1e8651d34b92198db5ad
16:27:28:WU04:FS02:0x17:CPU: 0x00000000000000000000000000000000
16:27:28:WU04:FS02:0x17:Machine: 2
16:27:28:WU04:FS02:0x17:Reading tar file state.xml
16:27:29:WU04:FS02:0x17:Reading tar file system.xml
16:27:29:WU04:FS02:0x17:Reading tar file integrator.xml
16:27:29:WU04:FS02:0x17:Reading tar file core.xml
16:27:29:WU04:FS02:0x17:Digital signatures verified
16:27:37:WU01:FS01:0x17:Completed 1160000 out of 2000000 steps (58%)
16:28:16:WU04:FS02:0x17:Completed 0 out of 2000000 steps (0%)
16:28:20:WU03:FS00:0x17:Completed 1720000 out of 2000000 steps (86%)
16:28:26:WARNING:WU04:FS02:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
16:28:26:WU04:FS02:Sending unit results: id:04 state:SEND error:FAULTY project:7810 run:0 clone:523 gen:94 core:0x17 unit:0x000000650a3b1e8651d34b92198db5ad
16:28:26:WU04:FS02:Uploading 2.51KiB to 171.64.65.98
16:28:26:WU04:FS02:Connecting to 171.64.65.98:8080
16:28:27:WU04:FS02:Upload complete
16:28:27:WU04:FS02:Server responded WORK_ACK (400)
16:28:27:WU04:FS02:Cleaning up
16:28:27:WU02:FS02:Connecting to assign-GPU.stanford.edu:80
16:28:27:WU02:FS02:News: Welcome to Folding@Home
16:28:27:WU02:FS02:Assigned to work server 171.64.65.98
16:28:27:WU02:FS02:Requesting new work unit for slot 02: READY gpu:2:GK110 [GeForce GTX Titan] from 171.64.65.98
16:28:27:WU02:FS02:Connecting to 171.64.65.98:8080
16:28:28:WU02:FS02:Downloading 2.07MiB
16:28:29:WU02:FS02:Download complete
16:28:29:WU02:FS02:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:7810 run:0 clone:802 gen:5 core:0x17 unit:0x000000070a3b1e8651d34ebb80a17d5b
16:28:29:WU02:FS02:Starting
16:28:29:WU02:FS02:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/jrice/AppData/Roaming/FAHClient/cores/www.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/beta/Core_17.fah/FahCore_17.exe -dir 02 -suffix 01 -version 703 -lifeline 3736 -checkpoint 10 -gpu 2 -gpu-vendor nvidia
16:28:29:WU02:FS02:Started FahCore on PID 3916
16:28:29:WU02:FS02:Core PID:5116
16:28:29:WU02:FS02:FahCore 0x17 started
16:28:30:WU02:FS02:0x17:*********************** Log Started 2013-08-12T16:28:30Z ***********************
16:28:30:WU02:FS02:0x17:Project: 7810 (Run 0, Clone 802, Gen 5)
16:28:30:WU02:FS02:0x17:Unit: 0x000000070a3b1e8651d34ebb80a17d5b
16:28:30:WU02:FS02:0x17:CPU: 0x00000000000000000000000000000000
16:28:30:WU02:FS02:0x17:Machine: 2
16:28:30:WU02:FS02:0x17:Reading tar file state.xml
16:28:30:WU02:FS02:0x17:Reading tar file system.xml
16:28:30:WU02:FS02:0x17:Reading tar file integrator.xml
16:28:30:WU02:FS02:0x17:Reading tar file core.xml
16:28:30:WU02:FS02:0x17:Digital signatures verified

Re: Repeated Download Failure

Posted: Mon Aug 12, 2013 4:40 pm
by ChristianVirtual
What driver do you use and do you restart every 24h to 36h ? There is still a driver bug for GTX 7xx GPU; might also impact Titan ?
Any OC applied ? How's the temp with multiple Titan ?

Re: Repeated Download Failure

Posted: Mon Aug 12, 2013 4:56 pm
by bollix47
Status code 72 could be indicating a memory problem. If the GPU memory is overclocked reduce it to the default. Memory speed has little to no advantage to folding but can cause folding to become unstable if it's too high.

You can also run MemtestCL to determine if the memory is flaky.

Re: Repeated Download Failure

Posted: Tue Aug 13, 2013 6:49 am
by N0OA
The machine had been running fine for several months. It was set to 75*C and under-clocked to avoid heat issues since there are three titans in this box... Since I had another titan, I did a little round robin until I figured out which piece of hardware was giving the issue. The problem now gone - I guess I have a failing card. I requested an RMA and I EVGA will be shipping me a replacement. I've never had a card with such a soft failure. The card worked in all other regards and tested just fine with a few of the GPU test programs I have. But, when I looked more closely, it was completing the test - but it was running at about 70% of the speed of my other Titans... Something is flaky.

I will put this one in the "no longer an issue" category since it's going back to EVGA. The customer support at EVGA has been great to work with... Maybe that's part of the reason I have so much of their hardware. :-)

N0OA

Re: Repeated Download Failure

Posted: Tue Aug 13, 2013 1:23 pm
by 7im
In your case, a 33% failure rate doesn't seem so good.