New Assignment Server feedback/problem
Moderators: Site Moderators, FAHC Science Team
Re: New Assignment Server feedback/problem
As the cores themselves didn't change overnight, I am guessing that the new AS server is now giving out projects/WUs to Maxwell cards which it shouldn't. Your GPUs are not trying to complete anything - the process has hung. You either get a bad unit failure or a core hang.
Windows 11 x64 / 5800X@5Ghz / 32GB DDR4 3800 CL14 / 4090 FE / Creative Titanium HD / Sennheiser 650 / PSU Corsair AX1200i
-
- Posts: 47
- Joined: Sun May 16, 2010 1:44 am
Re: New Assignment Server feedback/problem
I have had the 2nd pair fail/not complete, and been assigned 2 more, unfortunately. Any way to get these to stop showing up until this is resolved (probably Monday at this point, right?)
Re: New Assignment Server feedback/problem
I've had nothing but Bad Work Units for a few hours from projects 9406, 13000 and 13001.
Re: New Assignment Server feedback/problem
Yes you are not the only one. 6 maxwell doing nada. 3x 750Ti and 3x 980.
-
- Posts: 177
- Joined: Tue Aug 26, 2014 9:48 pm
- Hardware configuration: 10 SMP folding slots on Intel Phi "Knights Landing" system, configured as 24 CPUs/slot
9 AMD GPU folding slots
31 Nvidia GPU folding slots
50 total folding slots
Average PPD/slot = 459,500 - Location: Dallas, TX
Re: New Assignment Server feedback/problem
Similar issues here: Got a series of 9406 WU's on a new 980 Maxwell GPU that immediately failed. I've disabled the slot until a fix is apparent. The same 980 has completed about two dozen Core 15 work units prior to the slot failure on Core 17 WUs. Same failure mode every time:
22:46:33:WU02:FS01:0x17:ERROR:exception: Force RMSE error of 617.919 with threshold of 5
22:46:33:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
980 GPU is under-clocked by 5% from stock speeds.
22:46:33:WU02:FS01:0x17:ERROR:exception: Force RMSE error of 617.919 with threshold of 5
22:46:33:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
980 GPU is under-clocked by 5% from stock speeds.
Hardware config viewtopic.php?f=66&t=17997&p=277235#p277235
-
- Posts: 47
- Joined: Sun May 16, 2010 1:44 am
Re: New Assignment Server feedback/problem
Curious, why are you underclocked?
-
- Posts: 177
- Joined: Tue Aug 26, 2014 9:48 pm
- Hardware configuration: 10 SMP folding slots on Intel Phi "Knights Landing" system, configured as 24 CPUs/slot
9 AMD GPU folding slots
31 Nvidia GPU folding slots
50 total folding slots
Average PPD/slot = 459,500 - Location: Dallas, TX
Re: New Assignment Server feedback/problem
I under-clock all my GPUs.
On AMD-based GPU's they run more stable by under-clocking and reliably processing and then uploading WUs to the collection server. The under-clock also improves their longevity by reducing heat. I pack a lot of AMD GPUs (6 GPUs in a 4u highly modified server case, two systems configured this way), so managing heat levels is an important issue with that kind of density. They all run under 80 deg C in those enclosures when slightly under-clocked. Also, even though the power supply is spec'd at 1500 watts, the under-clock keeps consumed wattage at 1,200 which also helps preserve the life of the PS.
For Nvidia, the primary reason is for heat and longevity, since they tend to process and upload more reliably than the AMD GPUs (at least for me)
I have a planned life-cycle of 3 years for each GPU, at which point they are retired. Everyone has different priorities, but for me that balances a decent life-cycle investment with the time and energy it takes to maintain older hardware. After 3 years, it's time to give the GPU away or trash it and move on to newer hardware. Under-clocking virtually guarantees they make the 3 year time horizon for replacement.
That policy is also in effect for motherboards. I just decommissioned 2 AMD FX8350-based motherboards. A PCIe slot failed in one, and both could not support PCIe 3.0 spec that's needed for optimal performance (on a PPD basis) for highest-end Nvidia and AMD GPUs (both MBs replaced with i7-4790K based MBs). Next weekend, I'm decommissioning a 3rd AMD FX8350 motherboard and replacing it with an i7-5960X-based motherboard, retiring two AMD 7970's and replacing them with two Maxwell GTX 980's. Using the EVGA Classified X99 MB (socket 2011-v3) will be able to add a third 980 in that configuration at a later date.
I've wasted way too many hours trying to keep older hardware running reliably. Admittedly, it's a challenge to see if you can extract the last bit of life from a piece of ancient hardware, and even more interesting is how much I've learned about troubleshooting hardware issues, but I've only got so much time I can spend on my "FAH hobby" that I have to optimize around known, good and contemporary hardware systems.
On AMD-based GPU's they run more stable by under-clocking and reliably processing and then uploading WUs to the collection server. The under-clock also improves their longevity by reducing heat. I pack a lot of AMD GPUs (6 GPUs in a 4u highly modified server case, two systems configured this way), so managing heat levels is an important issue with that kind of density. They all run under 80 deg C in those enclosures when slightly under-clocked. Also, even though the power supply is spec'd at 1500 watts, the under-clock keeps consumed wattage at 1,200 which also helps preserve the life of the PS.
For Nvidia, the primary reason is for heat and longevity, since they tend to process and upload more reliably than the AMD GPUs (at least for me)
I have a planned life-cycle of 3 years for each GPU, at which point they are retired. Everyone has different priorities, but for me that balances a decent life-cycle investment with the time and energy it takes to maintain older hardware. After 3 years, it's time to give the GPU away or trash it and move on to newer hardware. Under-clocking virtually guarantees they make the 3 year time horizon for replacement.
That policy is also in effect for motherboards. I just decommissioned 2 AMD FX8350-based motherboards. A PCIe slot failed in one, and both could not support PCIe 3.0 spec that's needed for optimal performance (on a PPD basis) for highest-end Nvidia and AMD GPUs (both MBs replaced with i7-4790K based MBs). Next weekend, I'm decommissioning a 3rd AMD FX8350 motherboard and replacing it with an i7-5960X-based motherboard, retiring two AMD 7970's and replacing them with two Maxwell GTX 980's. Using the EVGA Classified X99 MB (socket 2011-v3) will be able to add a third 980 in that configuration at a later date.
I've wasted way too many hours trying to keep older hardware running reliably. Admittedly, it's a challenge to see if you can extract the last bit of life from a piece of ancient hardware, and even more interesting is how much I've learned about troubleshooting hardware issues, but I've only got so much time I can spend on my "FAH hobby" that I have to optimize around known, good and contemporary hardware systems.
Hardware config viewtopic.php?f=66&t=17997&p=277235#p277235
-
- Posts: 47
- Joined: Sun May 16, 2010 1:44 am
Re: New Assignment Server feedback/problem
That makes sense, I was curious. I still wish the 9406s would stop coming...until they are fixed or such...
-
- Pande Group Member
- Posts: 2058
- Joined: Fri Nov 30, 2007 6:25 am
- Location: Stanford
Re: New Assignment Server feedback/problem
We've been asked by donors to give Maxwell's Core17 & Core18 and we've done so with the adv setting (please see the previous blog post). You can opt out by removing the adv setting.
Sounds like Maxwell's even with the latest drivers aren't ready and/or we need to see what we can do on the core side.
Sounds like Maxwell's even with the latest drivers aren't ready and/or we need to see what we can do on the core side.
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
-
- Posts: 47
- Joined: Sun May 16, 2010 1:44 am
Re: New Assignment Server feedback/problem
Will it help at all to downgrade the drivers?
-
- Pande Group Member
- Posts: 2058
- Joined: Fri Nov 30, 2007 6:25 am
- Location: Stanford
Re: New Assignment Server feedback/problem
We had reports that the newest drivers worked (and were looking good in our testing as well). We're going to
1) revert back to not giving Maxwell's the latest cores
2) Yutong (aka Proteineer) has a plan for upgrading the cores to work around the driver issue and will get on that on Monday if not sooner.
1) revert back to not giving Maxwell's the latest cores
2) Yutong (aka Proteineer) has a plan for upgrading the cores to work around the driver issue and will get on that on Monday if not sooner.
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
-
- Posts: 47
- Joined: Sun May 16, 2010 1:44 am
Re: New Assignment Server feedback/problem
Sounds great, thanks for the update.
Re: New Assignment Server feedback/problem
I had ten 9406 fail on my brand new 970. All with Force RMSE error. No OC, 344.16 driver. Last client 7.4.4. I removed FAH and did a fresh install but the first 9406 failed as well.
Re: New Assignment Server feedback/problem
@VijayPande, thanks, waiting for the revert. My experience: at the moment fah, advanced and beta all get 0x17 WUs which all fail and from what I have seen last week some, but not all 0x18 WUs insta-fail too.
What I don't understand is how a few months ago Maxwell was apparently folding Core 17 WUs and now it's not:
viewtopic.php?f=80&t=25887&start=120
What I don't understand is how a few months ago Maxwell was apparently folding Core 17 WUs and now it's not:
viewtopic.php?f=80&t=25887&start=120
Windows 11 x64 / 5800X@5Ghz / 32GB DDR4 3800 CL14 / 4090 FE / Creative Titanium HD / Sennheiser 650 / PSU Corsair AX1200i
Re: New Assignment Server feedback/problem
I don't understand the need for the "advanced settings". I was getting Core 17 fine on all six of my GTX 750 Ti's (without the advanced tag) until a couple of days ago, when we were asked to use it. So I did, and have been getting only failures ever since.
My latest log is in this post.
viewtopic.php?f=80&t=25887&start=135
My latest log is in this post.
viewtopic.php?f=80&t=25887&start=135