I am having rampant failures with the newest NVIDIA drivers

It seems that a lot of GPU problems revolve around specific versions of drivers. Though NVidia has their own support structure, you can often learn from information reported by others who fold.

Moderators: Site Moderators, FAHC Science Team

Turbo_T
Posts: 26
Joined: Mon Mar 11, 2013 1:46 am

Re: I am having rampant failures with the newest NVIDIA driv

Post by Turbo_T »

It was folding 15's fine, and has completed a couple 17's as well but it has been unable to get beyond the first few steps in the last few days. I tried setting all clocks to the lowest settings(stock), manually deleting all Nvidia drivers and performing a clean install of the gpu drivers just now, and I bumped the voltage on the PCIE IOH and ICH to see if the dual card setup was drawing voltage too low. I have a GTX 650Ti on the second slot. I will try folding that card again but with the specific nature of the errors I thought someone might have seen it before and new of a fix. Thanks,
Turbo_T
Posts: 26
Joined: Mon Mar 11, 2013 1:46 am

Re: I am having rampant failures with the newest NVIDIA driv

Post by Turbo_T »

Still getting the same failures after reloading the drivers,

Code: Select all

*********************** Log Started 2014-02-15T17:02:43Z ***********************
17:02:44:WU00:FS00:Connecting to assign-GPU.stanford.edu:80
17:02:45:WU00:FS00:News: Welcome to Folding@Home
17:02:45:WU00:FS00:Assigned to work server 171.64.65.69
17:02:45:WU00:FS00:Requesting new work unit for slot 00: READY gpu:0:GF100 [GeForce GTX 480] from 171.64.65.69
17:02:45:WU00:FS00:Connecting to 171.64.65.69:8080
17:02:45:WU00:FS00:Downloading 4.18MiB
17:02:49:WU00:FS00:Download complete
17:02:49:WU00:FS00:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:8900 run:874 clone:0 gen:83 core:0x17 unit:0x00000095028c126651a6e90a37885474
17:02:49:WU00:FS00:Starting
17:02:49:WU00:FS00:Running FahCore: "E:\Stanford FAH\FAHClient/FAHCoreWrapper.exe" "E:/Stanford FAH/cores/www.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_17.fah/FahCore_17.exe" -dir 00 -suffix 01 -version 703 -lifeline 5588 -checkpoint 30 -gpu 0 -gpu-vendor nvidia
17:02:49:WU00:FS00:Started FahCore on PID 368
17:02:49:WU00:FS00:Core PID:4280
17:02:49:WU00:FS00:FahCore 0x17 started
17:02:49:WU00:FS00:0x17:*********************** Log Started 2014-02-15T17:02:49Z ***********************
17:02:49:WU00:FS00:0x17:Project: 8900 (Run 874, Clone 0, Gen 83)
17:02:49:WU00:FS00:0x17:Unit: 0x00000095028c126651a6e90a37885474
17:02:49:WU00:FS00:0x17:CPU: 0x00000000000000000000000000000000
17:02:49:WU00:FS00:0x17:Machine: 0
17:02:49:WU00:FS00:0x17:Reading tar file state.xml
17:02:50:WU00:FS00:0x17:Reading tar file system.xml
17:02:50:WU00:FS00:0x17:Reading tar file integrator.xml
17:02:50:WU00:FS00:0x17:Reading tar file core.xml
17:02:50:WU00:FS00:0x17:Digital signatures verified
17:02:50:WU00:FS00:0x17:Folding@home GPU core17
17:02:50:WU00:FS00:0x17:Version 0.0.52
17:06:26:WU00:FS00:0x17:Completed 0 out of 2500000 steps (0%)
17:06:26:WU00:FS00:0x17:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
17:22:33:WU00:FS00:0x17:ERROR:exception: The periodic box size has decreased to less than twice the nonbonded cutoff.
17:22:33:WU00:FS00:0x17:Saving result file logfile_01.txt
17:22:33:WU00:FS00:0x17:Saving result file log.txt
17:22:33:WU00:FS00:0x17:Folding@home Core Shutdown: BAD_WORK_UNIT
17:22:33:WARNING:WU00:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
17:22:33:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:8900 run:874 clone:0 gen:83 core:0x17 unit:0x00000095028c126651a6e90a37885474
17:22:33:WU00:FS00:Uploading 2.47KiB to 171.64.65.69
17:22:33:WU00:FS00:Connecting to 171.64.65.69:8080
17:22:34:WU01:FS00:Connecting to assign-GPU.stanford.edu:80
17:22:34:WU00:FS00:Upload complete
17:22:34:WU00:FS00:Server responded WORK_ACK (400)
17:22:34:WU00:FS00:Cleaning up
17:22:34:WU01:FS00:News: Welcome to Folding@Home
17:22:34:WU01:FS00:Assigned to work server 171.64.65.69
17:22:34:WU01:FS00:Requesting new work unit for slot 00: READY gpu:0:GF100 [GeForce GTX 480] from 171.64.65.69
17:22:34:WU01:FS00:Connecting to 171.64.65.69:8080
17:22:35:WU01:FS00:Downloading 4.18MiB
17:22:38:WU01:FS00:Download complete
17:22:38:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:8900 run:264 clone:3 gen:73 core:0x17 unit:0x0000006c028c126651a6615b1a5badfe
17:22:38:WU01:FS00:Starting
17:22:38:WU01:FS00:Running FahCore: "E:\Stanford FAH\FAHClient/FAHCoreWrapper.exe" "E:/Stanford FAH/cores/www.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_17.fah/FahCore_17.exe" -dir 01 -suffix 01 -version 703 -lifeline 5588 -checkpoint 30 -gpu 0 -gpu-vendor nvidia
17:22:38:WU01:FS00:Started FahCore on PID 1256
17:22:38:WU01:FS00:Core PID:4112
17:22:38:WU01:FS00:FahCore 0x17 started
17:22:38:WU01:FS00:0x17:*********************** Log Started 2014-02-15T17:22:38Z ***********************
17:22:38:WU01:FS00:0x17:Project: 8900 (Run 264, Clone 3, Gen 73)
17:22:38:WU01:FS00:0x17:Unit: 0x0000006c028c126651a6615b1a5badfe
17:22:38:WU01:FS00:0x17:CPU: 0x00000000000000000000000000000000
17:22:38:WU01:FS00:0x17:Machine: 0
17:22:38:WU01:FS00:0x17:Reading tar file state.xml
17:22:39:WU01:FS00:0x17:Reading tar file system.xml
17:22:40:WU01:FS00:0x17:Reading tar file integrator.xml
17:22:40:WU01:FS00:0x17:Reading tar file core.xml
17:22:40:WU01:FS00:0x17:Digital signatures verified
17:22:40:WU01:FS00:0x17:Folding@home GPU core17
17:22:40:WU01:FS00:0x17:Version 0.0.52
17:25:48:WU01:FS00:0x17:Completed 0 out of 2500000 steps (0%)
17:25:48:WU01:FS00:0x17:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
17:39:50:FS00:Finishing
17:47:31:WU01:FS00:0x17:Completed 25000 out of 2500000 steps (1%)
7im
Posts: 10179
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: I am having rampant failures with the newest NVIDIA driv

Post by 7im »

Maybe a mod could check some of those WUs to see if others completed or also failed them.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: I am having rampant failures with the newest NVIDIA driv

Post by P5-133XL »

Project: 8900 (Run 446, Clone 3, Gen 67) Completed by someone else
Project: 8900 (Run 294, Clone 3, Gen 43) Completed by someone else
Project: 8900 (Run 302, Clone 0, Gen 296) Completed by someone else
Project: 8900 (Run 649, Clone 4, Gen 83) Has yet to be completed -- 2 failures
Project: 8900 (Run 627, Clone 1, Gen 286) Has yet to be completed -- 2 failures

Removed OC'ing

Project: 8900 (Run 874, Clone:0, Gen 83) Has yet to be completed -- 2 failures and a successful completion.
Project: 8900 (Run 264, Clone 3, Gen 73) Has yet to be completed -- 4 failures
Project: 8900 (Run 264, Clone 3, Gen 73) Has yet to be completed -- 4 failures and a successful completion.

Since you removed the OC'ing, there hasn't been a failed WU that was completed by someone else so it is still possible that you are just being unlucky in the WU's that have been assigned rather than a flaw in your setup.
Image
Turbo_T
Posts: 26
Joined: Mon Mar 11, 2013 1:46 am

Re: I am having rampant failures with the newest NVIDIA driv

Post by Turbo_T »

Well Add one more failure to the Project: 8900 (Run 264, Clone 3, Gen 73) . I got to 1% and failed. I think this make 8 in a row with not results. Should I continue to allow this GPU to try to Run or stop? is there a switch I can activate in the Advance or Expert tabs to constrain this GPU to X15 work and see if it will complete one of those? It seems I am not alone in having trouble with these WU's, but I don't want to clutter the servers and not produce anything useful. Thanks for the analysis on the WU's,
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: I am having rampant failures with the newest NVIDIA driv

Post by P5-133XL »

Project: 8900 (Run 264, Clone 3, Gen 73) has 5 failures and no successes.

There is no switch in v7 that will limit to Core_15. If you revert back to v6 then you can guarantee running Core_15 because Core_17 won't run on v6.

Even if you are just failing WU's, as long as no one else is finishing them then it is still useful to list them here because if enough people try and fail, then I'll manually kill the individual WU so it won't be assigned to anyone else.

Every moderator has the capability of manually killing a WU permanently. My normal threshold is 6+ failures before I lock-out a specific WU. No one has ever told me what the threshold is supposed to be. Other Moderators may use a different threshold. The main reason that my number is so high is because lots of people OC and that is the main cause of WU failure. I don't want to permanently destroy a good WU, so I deliberately keep my threshold high.
Image
calxalot
Site Moderator
Posts: 1140
Joined: Sat Dec 08, 2007 1:33 am
Location: San Francisco, CA
Contact:

Re: I am having rampant failures with the newest NVIDIA driv

Post by calxalot »

Some WUs also seem to be bad only on a particular platform.
For example, I've had bunches of 7610 WUs repeatedly fail immediately on start on OSX, but the same PRCG gets completed by others.
Turbo_T
Posts: 26
Joined: Mon Mar 11, 2013 1:46 am

Re: I am having rampant failures with the newest NVIDIA driv

Post by Turbo_T »

Well my luck with the 8900 series continues to be all bad. I am laso encountering a new issue that may be related, but I am not certain what is causing it.

*********************** Log Started 2014-02-16T04:14:51Z ***********************
*********************** Log Started 2014-02-16T04:14:51Z ***********************
04:25:27:WARNING:Exception: 8:127.0.0.1: Send error: 10053: An established connection was aborted by the software in your host machine.
*********************** Log Started 2014-02-16T04:14:51Z ***********************
04:25:27:WARNING:Exception: 8:127.0.0.1: Send error: 10053: An established connection was aborted by the software in your host machine.

Does anyone know what may be causing this? I am not aware of any firewall or Virus program errors or alerts that correspond to the connection being denied or stopped for any reason.
Jim Saunders
Posts: 45
Joined: Fri Jan 03, 2014 4:53 am
Hardware configuration: A: i5 + 2 GTX 660
B: i5 + 2 GTX 670
C: i7 + GTX670

Re: I am having rampant failures with the newest NVIDIA driv

Post by Jim Saunders »

One thing I've noticed is that one computer out of four here has the latest batch of Windows updates, and GPU folding has come up unstable on two different cards both known to be reliable; that isn't the end-all and be-all but it is distinct.

Jim
Good science and heat for my basement you say?
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: I am having rampant failures with the newest NVIDIA driv

Post by P5-133XL »

Turbo_T wrote: *********************** Log Started 2014-02-16T04:14:51Z ***********************
04:25:27:WARNING:Exception: 8:127.0.0.1: Send error: 10053: An established connection was aborted by the software in your host machine.

Does anyone know what may be causing this? I am not aware of any firewall or Virus program errors or alerts that correspond to the connection being denied or stopped for any reason.
Web client shutting down will cause that warning.
Image
PantherX
Site Moderator
Posts: 6986
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: I am having rampant failures with the newest NVIDIA driv

Post by PantherX »

Turbo_T wrote:...Does anyone know what may be causing this?...
Do note that in the latest Public Beta (V7.4.2), this message is now hidden by default since it was being mistaken as an error (https://fah.stanford.edu/projects/FAHClient/ticket/1054).
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Turbo_T
Posts: 26
Joined: Mon Mar 11, 2013 1:46 am

Re: I am having rampant failures with the newest NVIDIA driv

Post by Turbo_T »

Correct, the second GPU is failing in all folding operations and the primary does also if the SLI is enabled. iwhen SLI is disabled and the sli connectors are disconnected the primary GPU is able to fold all Cores fine. The second GPU passes all memory and cuda benchmark tests fine but will not fold successfully. The errors quoted above repeat and the unit fails.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: I am having rampant failures with the newest NVIDIA driv

Post by bruce »

calxalot wrote:Some WUs also seem to be bad only on a particular platform.
For example, I've had bunches of 7610 WUs repeatedly fail immediately on start on OSX, but the same PRCG gets completed by others.
Has this been documented and reported as a bug associated with (each) particular core and WU?

If the PI is presented with the facts, they'll decide if it's worth restricting the project. Each project can be restricted to/from particular platforms until somebody fixes the core and/or the drivers, but only if they know about it. Things like that are supposed to be caught in beta testing, but might have been missed at the time.
Joe_H
Site Admin
Posts: 7939
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: I am having rampant failures with the newest NVIDIA driv

Post by Joe_H »

bruce wrote:
calxalot wrote:Some WUs also seem to be bad only on a particular platform.
For example, I've had bunches of 7610 WUs repeatedly fail immediately on start on OSX, but the same PRCG gets completed by others.
Has this been documented and reported as a bug associated with (each) particular core and WU?

If the PI is presented with the facts, they'll decide if it's worth restricting the project. Each project can be restricted to/from particular platforms until somebody fixes the core and/or the drivers, but only if they know about it. Things like that are supposed to be caught in beta testing, but might have been missed at the time.
Back when I was getting many 7610/7611 WU's, I noticed the same pattern with some that would fail. I did make a number of postings in the forum for Problem WU's. Overall, about half to two thirds of the WU's that immediately failed for me on OS X also failed for others, but the rest did work for other folders. Causing additional confusion is that there were a number of "bad" WU's from those projects which would almost fail in some way and then continue to fold at a much higher TPF than normal. Since it has been at least a year since the last problem WU from those projects, the details are dimming in my memory.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: I am having rampant failures with the newest NVIDIA driv

Post by bruce »

Unfortunately, when your Mods check the Mod DB, it may show one or more failures plus a success or in some cases, two but we don't know that all of the failures were on (say, OSX) and all the successes were on (say Windows). During beta testing, that information is often reported so someone may put together enough information to propose a theory that it's related to being a different platform and the Pande Group can do additional research potentially uncovering a bug and potentially leading to a fix.

We encourage anyone who runs the beta flag to report their suspicions in the beta forum which means you have to join the forum and you have to keep a close watch on what's happening while the project is still in beta. The project will always be getting extra scrutiny from the PG at that time. It's harder to get their attention once the project has moved out of beta. It's still possible, though, as long as we can put together enough data to make the case that it's specifically related to the platform differences.

That's also why we ask beta testers to agree to pay extra attention to the WUs they're assigned and why we DISCOURAGE people who aren't willing to make that sort of commitment from running with the beta flag.
Post Reply