I am having rampant failures with the newest NVIDIA drivers

Turbo_T · Post by **Turbo_T** » Sat Feb 15, 2014 5:14 pm

It was folding 15's fine, and has completed a couple 17's as well but it has been unable to get beyond the first few steps in the last few days. I tried setting all clocks to the lowest settings(stock), manually deleting all Nvidia drivers and performing a clean install of the gpu drivers just now, and I bumped the voltage on the PCIE IOH and ICH to see if the dual card setup was drawing voltage too low. I have a GTX 650Ti on the second slot. I will try folding that card again but with the specific nature of the errors I thought someone might have seen it before and new of a fix. Thanks,

Turbo_T · Post by **Turbo_T** » Sat Feb 15, 2014 5:51 pm

Still getting the same failures after reloading the drivers,

Code: Select all

*********************** Log Started 2014-02-15T17:02:43Z ***********************
17:02:44:WU00:FS00:Connecting to assign-GPU.stanford.edu:80
17:02:45:WU00:FS00:News: Welcome to Folding@Home
17:02:45:WU00:FS00:Assigned to work server 171.64.65.69
17:02:45:WU00:FS00:Requesting new work unit for slot 00: READY gpu:0:GF100 [GeForce GTX 480] from 171.64.65.69
17:02:45:WU00:FS00:Connecting to 171.64.65.69:8080
17:02:45:WU00:FS00:Downloading 4.18MiB
17:02:49:WU00:FS00:Download complete
17:02:49:WU00:FS00:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:8900 run:874 clone:0 gen:83 core:0x17 unit:0x00000095028c126651a6e90a37885474
17:02:49:WU00:FS00:Starting
17:02:49:WU00:FS00:Running FahCore: "E:\Stanford FAH\FAHClient/FAHCoreWrapper.exe" "E:/Stanford FAH/cores/www.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_17.fah/FahCore_17.exe" -dir 00 -suffix 01 -version 703 -lifeline 5588 -checkpoint 30 -gpu 0 -gpu-vendor nvidia
17:02:49:WU00:FS00:Started FahCore on PID 368
17:02:49:WU00:FS00:Core PID:4280
17:02:49:WU00:FS00:FahCore 0x17 started
17:02:49:WU00:FS00:0x17:*********************** Log Started 2014-02-15T17:02:49Z ***********************
17:02:49:WU00:FS00:0x17:Project: 8900 (Run 874, Clone 0, Gen 83)
17:02:49:WU00:FS00:0x17:Unit: 0x00000095028c126651a6e90a37885474
17:02:49:WU00:FS00:0x17:CPU: 0x00000000000000000000000000000000
17:02:49:WU00:FS00:0x17:Machine: 0
17:02:49:WU00:FS00:0x17:Reading tar file state.xml
17:02:50:WU00:FS00:0x17:Reading tar file system.xml
17:02:50:WU00:FS00:0x17:Reading tar file integrator.xml
17:02:50:WU00:FS00:0x17:Reading tar file core.xml
17:02:50:WU00:FS00:0x17:Digital signatures verified
17:02:50:WU00:FS00:0x17:Folding@home GPU core17
17:02:50:WU00:FS00:0x17:Version 0.0.52
17:06:26:WU00:FS00:0x17:Completed 0 out of 2500000 steps (0%)
17:06:26:WU00:FS00:0x17:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
17:22:33:WU00:FS00:0x17:ERROR:exception: The periodic box size has decreased to less than twice the nonbonded cutoff.
17:22:33:WU00:FS00:0x17:Saving result file logfile_01.txt
17:22:33:WU00:FS00:0x17:Saving result file log.txt
17:22:33:WU00:FS00:0x17:Folding@home Core Shutdown: BAD_WORK_UNIT
17:22:33:WARNING:WU00:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
17:22:33:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:8900 run:874 clone:0 gen:83 core:0x17 unit:0x00000095028c126651a6e90a37885474
17:22:33:WU00:FS00:Uploading 2.47KiB to 171.64.65.69
17:22:33:WU00:FS00:Connecting to 171.64.65.69:8080
17:22:34:WU01:FS00:Connecting to assign-GPU.stanford.edu:80
17:22:34:WU00:FS00:Upload complete
17:22:34:WU00:FS00:Server responded WORK_ACK (400)
17:22:34:WU00:FS00:Cleaning up
17:22:34:WU01:FS00:News: Welcome to Folding@Home
17:22:34:WU01:FS00:Assigned to work server 171.64.65.69
17:22:34:WU01:FS00:Requesting new work unit for slot 00: READY gpu:0:GF100 [GeForce GTX 480] from 171.64.65.69
17:22:34:WU01:FS00:Connecting to 171.64.65.69:8080
17:22:35:WU01:FS00:Downloading 4.18MiB
17:22:38:WU01:FS00:Download complete
17:22:38:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:8900 run:264 clone:3 gen:73 core:0x17 unit:0x0000006c028c126651a6615b1a5badfe
17:22:38:WU01:FS00:Starting
17:22:38:WU01:FS00:Running FahCore: "E:\Stanford FAH\FAHClient/FAHCoreWrapper.exe" "E:/Stanford FAH/cores/www.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_17.fah/FahCore_17.exe" -dir 01 -suffix 01 -version 703 -lifeline 5588 -checkpoint 30 -gpu 0 -gpu-vendor nvidia
17:22:38:WU01:FS00:Started FahCore on PID 1256
17:22:38:WU01:FS00:Core PID:4112
17:22:38:WU01:FS00:FahCore 0x17 started
17:22:38:WU01:FS00:0x17:*********************** Log Started 2014-02-15T17:22:38Z ***********************
17:22:38:WU01:FS00:0x17:Project: 8900 (Run 264, Clone 3, Gen 73)
17:22:38:WU01:FS00:0x17:Unit: 0x0000006c028c126651a6615b1a5badfe
17:22:38:WU01:FS00:0x17:CPU: 0x00000000000000000000000000000000
17:22:38:WU01:FS00:0x17:Machine: 0
17:22:38:WU01:FS00:0x17:Reading tar file state.xml
17:22:39:WU01:FS00:0x17:Reading tar file system.xml
17:22:40:WU01:FS00:0x17:Reading tar file integrator.xml
17:22:40:WU01:FS00:0x17:Reading tar file core.xml
17:22:40:WU01:FS00:0x17:Digital signatures verified
17:22:40:WU01:FS00:0x17:Folding@home GPU core17
17:22:40:WU01:FS00:0x17:Version 0.0.52
17:25:48:WU01:FS00:0x17:Completed 0 out of 2500000 steps (0%)
17:25:48:WU01:FS00:0x17:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
17:39:50:FS00:Finishing
17:47:31:WU01:FS00:0x17:Completed 25000 out of 2500000 steps (1%)

7im · Post by **7im** » Sat Feb 15, 2014 7:17 pm

Maybe a mod could check some of those WUs to see if others completed or also failed them.

P5-133XL · Post by **P5-133XL** » Sat Feb 15, 2014 7:44 pm

Project: 8900 (Run 446, Clone 3, Gen 67) Completed by someone else
Project: 8900 (Run 294, Clone 3, Gen 43) Completed by someone else
Project: 8900 (Run 302, Clone 0, Gen 296) Completed by someone else
Project: 8900 (Run 649, Clone 4, Gen 83) Has yet to be completed -- 2 failures
Project: 8900 (Run 627, Clone 1, Gen 286) Has yet to be completed -- 2 failures

Removed OC'ing

Project: 8900 (Run 874, Clone:0, Gen 83) Has yet to be completed -- 2 failures and a successful completion.
Project: 8900 (Run 264, Clone 3, Gen 73) Has yet to be completed -- 4 failures
Project: 8900 (Run 264, Clone 3, Gen 73) Has yet to be completed -- 4 failures and a successful completion.

Since you removed the OC'ing, there hasn't been a failed WU that was completed by someone else so it is still possible that you are just being unlucky in the WU's that have been assigned rather than a flaw in your setup.

Turbo_T · Post by **Turbo_T** » Sat Feb 15, 2014 9:08 pm

Well Add one more failure to the Project: 8900 (Run 264, Clone 3, Gen 73) . I got to 1% and failed. I think this make 8 in a row with not results. Should I continue to allow this GPU to try to Run or stop? is there a switch I can activate in the Advance or Expert tabs to constrain this GPU to X15 work and see if it will complete one of those? It seems I am not alone in having trouble with these WU's, but I don't want to clutter the servers and not produce anything useful. Thanks for the analysis on the WU's,

P5-133XL · Post by **P5-133XL** » Sun Feb 16, 2014 1:14 am

Project: 8900 (Run 264, Clone 3, Gen 73) has 5 failures and no successes.

There is no switch in v7 that will limit to Core_15. If you revert back to v6 then you can guarantee running Core_15 because Core_17 won't run on v6.

Even if you are just failing WU's, as long as no one else is finishing them then it is still useful to list them here because if enough people try and fail, then I'll manually kill the individual WU so it won't be assigned to anyone else.

Every moderator has the capability of manually killing a WU permanently. My normal threshold is 6+ failures before I lock-out a specific WU. No one has ever told me what the threshold is supposed to be. Other Moderators may use a different threshold. The main reason that my number is so high is because lots of people OC and that is the main cause of WU failure. I don't want to permanently destroy a good WU, so I deliberately keep my threshold high.

Post by **calxalot** » Sun Feb 16, 2014 1:24 am

Some WUs also seem to be bad only on a particular platform.
For example, I've had bunches of 7610 WUs repeatedly fail immediately on start on OSX, but the same PRCG gets completed by others.

Turbo_T · Post by **Turbo_T** » Sun Feb 16, 2014 4:34 am

Well my luck with the 8900 series continues to be all bad. I am laso encountering a new issue that may be related, but I am not certain what is causing it.

*********************** Log Started 2014-02-16T04:14:51Z ***********************
*********************** Log Started 2014-02-16T04:14:51Z ***********************
04:25:27:WARNING:Exception: 8:127.0.0.1: Send error: 10053: An established connection was aborted by the software in your host machine.
*********************** Log Started 2014-02-16T04:14:51Z ***********************
04:25:27:WARNING:Exception: 8:127.0.0.1: Send error: 10053: An established connection was aborted by the software in your host machine.

Does anyone know what may be causing this? I am not aware of any firewall or Virus program errors or alerts that correspond to the connection being denied or stopped for any reason.

Jim Saunders · Post by **Jim Saunders** » Sun Feb 16, 2014 5:08 am

One thing I've noticed is that one computer out of four here has the latest batch of Windows updates, and GPU folding has come up unstable on two different cards both known to be reliable; that isn't the end-all and be-all but it is distinct.

Jim

P5-133XL · Post by **P5-133XL** » Sun Feb 16, 2014 6:01 am

Turbo_T wrote: *********************** Log Started 2014-02-16T04:14:51Z ***********************
04:25:27:WARNING:Exception: 8:127.0.0.1: Send error: 10053: An established connection was aborted by the software in your host machine.

Does anyone know what may be causing this? I am not aware of any firewall or Virus program errors or alerts that correspond to the connection being denied or stopped for any reason.

Web client shutting down will cause that warning.

Post by **PantherX** » Sun Feb 16, 2014 11:29 am

Turbo_T wrote:...Does anyone know what may be causing this?...

Do note that in the latest Public Beta (V7.4.2), this message is now hidden by default since it was being mistaken as an error (https://fah.stanford.edu/projects/FAHClient/ticket/1054).

Turbo_T · Post by **Turbo_T** » Tue Mar 11, 2014 5:08 am

Correct, the second GPU is failing in all folding operations and the primary does also if the SLI is enabled. iwhen SLI is disabled and the sli connectors are disconnected the primary GPU is able to fold all Cores fine. The second GPU passes all memory and cuda benchmark tests fine but will not fold successfully. The errors quoted above repeat and the unit fails.

Post by **bruce** » Tue Mar 11, 2014 6:01 pm

calxalot wrote:Some WUs also seem to be bad only on a particular platform.
For example, I've had bunches of 7610 WUs repeatedly fail immediately on start on OSX, but the same PRCG gets completed by others.

Has this been documented and reported as a bug associated with (each) particular core and WU?

If the PI is presented with the facts, they'll decide if it's worth restricting the project. Each project can be restricted to/from particular platforms until somebody fixes the core and/or the drivers, but only if they know about it. Things like that are supposed to be caught in beta testing, but might have been missed at the time.

Post by **Joe_H** » Tue Mar 11, 2014 8:23 pm

bruce wrote:
calxalot wrote:Some WUs also seem to be bad only on a particular platform.
For example, I've had bunches of 7610 WUs repeatedly fail immediately on start on OSX, but the same PRCG gets completed by others.
Has this been documented and reported as a bug associated with (each) particular core and WU?

If the PI is presented with the facts, they'll decide if it's worth restricting the project. Each project can be restricted to/from particular platforms until somebody fixes the core and/or the drivers, but only if they know about it. Things like that are supposed to be caught in beta testing, but might have been missed at the time.

Back when I was getting many 7610/7611 WU's, I noticed the same pattern with some that would fail. I did make a number of postings in the forum for Problem WU's. Overall, about half to two thirds of the WU's that immediately failed for me on OS X also failed for others, but the rest did work for other folders. Causing additional confusion is that there were a number of "bad" WU's from those projects which would almost fail in some way and then continue to fold at a much higher TPF than normal. Since it has been at least a year since the last problem WU from those projects, the details are dimming in my memory.

Post by **bruce** » Wed Mar 12, 2014 1:53 am

Unfortunately, when your Mods check the Mod DB, it may show one or more failures plus a success or in some cases, two but we don't know that all of the failures were on (say, OSX) and all the successes were on (say Windows). During beta testing, that information is often reported so someone may put together enough information to propose a theory that it's related to being a different platform and the Pande Group can do additional research potentially uncovering a bug and potentially leading to a fix.

We encourage anyone who runs the beta flag to report their suspicions in the beta forum which means you have to join the forum and you have to keep a close watch on what's happening while the project is still in beta. The project will always be getting extra scrutiny from the PG at that time. It's harder to get their attention once the project has moved out of beta. It's still possible, though, as long as we can put together enough data to make the case that it's specifically related to the platform differences.

That's also why we ask beta testers to agree to pay extra attention to the WUs they're assigned and why we DISCOURAGE people who aren't willing to make that sort of commitment from running with the beta flag.

Folding Forum

I am having rampant failures with the newest NVIDIA drivers

Re: I am having rampant failures with the newest NVIDIA driv

Re: I am having rampant failures with the newest NVIDIA driv

Re: I am having rampant failures with the newest NVIDIA driv

Re: I am having rampant failures with the newest NVIDIA driv

Re: I am having rampant failures with the newest NVIDIA driv

Re: I am having rampant failures with the newest NVIDIA driv

Re: I am having rampant failures with the newest NVIDIA driv

Re: I am having rampant failures with the newest NVIDIA driv

Re: I am having rampant failures with the newest NVIDIA driv

Re: I am having rampant failures with the newest NVIDIA driv

Re: I am having rampant failures with the newest NVIDIA driv

Re: I am having rampant failures with the newest NVIDIA driv

Re: I am having rampant failures with the newest NVIDIA driv

Re: I am having rampant failures with the newest NVIDIA driv

Re: I am having rampant failures with the newest NVIDIA driv