16600 consistently crashing on AMD Radeon VII
Moderators: Site Moderators, FAHC Science Team
-
- Posts: 1996
- Joined: Sun Mar 22, 2020 5:52 pm
- Hardware configuration: 1: 2x Xeon [email protected], 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon [email protected], 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: [email protected], 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21 - Location: UK
Re: 16600 consistently crashing on AMD Radeon VII
@muziqaz ... so specific (guessing amd) cards are running failure rates in the 80-90% range over a range of Projects is what you are confirming? ... If so can we change the thread topic to something more relevant than current which implies the discussion/issue is about a single project ... the two recent sets of failure rates posted in this thread have very different failure rates profiles across projects - one is specific to this project for the most part the other appears to be failing on potentially all projects - albeit there may be some projects it is actually working on that have not been posted - this might indicated two different types of scenario?
Maybe even move the topic thread if it is a much wider issue from the "Issues with a specific WU" forum as it appears from what you are confirming it isn't.
Maybe even move the topic thread if it is a much wider issue from the "Issues with a specific WU" forum as it appears from what you are confirming it isn't.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
-
- Posts: 59
- Joined: Tue Apr 07, 2020 8:53 pm
Re: 16600 consistently crashing on AMD Radeon VII
At least for me, this thread's focus and main finding, as in the title, is p16600 - which now has been disabled for everything AMD before Navi (5000 series), yeah! Thanks, muziqaz!
The "waters" are muddled if you just look at the plain failure rates per project reported in this thread, because at least the p134XX series is also showing up as failures (I guess on many cards, but also on the before-Navi AMD cards which are affected by the totally unrelated p16600). But those p134XX failures seem to be far less critical, because they almost always happen in the first 9-17 seconds. And if not failing right in the beginning, those projects seem to usually complete without any further hickup.
This thread is a very good example how several effects can overlay and make the data hard to interpret, and made "the usually correct" explanation of just overclocked hardware etc appearing very convincing.
Even I was convinced my card has a hardware, clock or driver issue, and spent almost a week fiddling with drivers & underclocking, and now even more posters come forward reporting going through the same motions... almost like a #p16600metoo
The "waters" are muddled if you just look at the plain failure rates per project reported in this thread, because at least the p134XX series is also showing up as failures (I guess on many cards, but also on the before-Navi AMD cards which are affected by the totally unrelated p16600). But those p134XX failures seem to be far less critical, because they almost always happen in the first 9-17 seconds. And if not failing right in the beginning, those projects seem to usually complete without any further hickup.
This thread is a very good example how several effects can overlay and make the data hard to interpret, and made "the usually correct" explanation of just overclocked hardware etc appearing very convincing.
Even I was convinced my card has a hardware, clock or driver issue, and spent almost a week fiddling with drivers & underclocking, and now even more posters come forward reporting going through the same motions... almost like a #p16600metoo
-
- Posts: 946
- Joined: Sun Dec 16, 2007 6:22 pm
- Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP - Location: London
- Contact:
Re: 16600 consistently crashing on AMD Radeon VII
The goal of this thread was finally achieved by bringing this issue to project owner's attention. The project now is being checked out in more detail and excluded for problematic hardware (for now).
FAH Omega tester
-
- Posts: 1996
- Joined: Sun Mar 22, 2020 5:52 pm
- Hardware configuration: 1: 2x Xeon [email protected], 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon [email protected], 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: [email protected], 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21 - Location: UK
Re: 16600 consistently crashing on AMD Radeon VII
Please could you get the Moonshot 134xx series Project owners to assess these impact of rapid failures from types of cards ... whilst from the folders perspective having the vast majority fail within the first 20 seconds I really can't see that that makes sense ... If a number of these cards are failing WUs at speed the chances that WUs that could be valid and folded without issue on say nvidia cards get 5 failures from these cards and get labled as bad when they aren't necessarily so makes little sense to me.
The speed of failure of potentially ok WUs by these cards means that it would not take very many of them to raise the statistical chance of a WU being hit by 5 of these card related failures to occur to non negligible levels ... Moonshot WUs are quick results - but potentially throwing away valid results due to a group of "rapid fail doesn't matter because it doesn't impact our usefulness" gpus seems madness to me
Perhaps someone could check all WUs that have had five failed returns and check that they are not all quick failures from these types of gpu?
The speed of failure of potentially ok WUs by these cards means that it would not take very many of them to raise the statistical chance of a WU being hit by 5 of these card related failures to occur to non negligible levels ... Moonshot WUs are quick results - but potentially throwing away valid results due to a group of "rapid fail doesn't matter because it doesn't impact our usefulness" gpus seems madness to me
Perhaps someone could check all WUs that have had five failed returns and check that they are not all quick failures from these types of gpu?
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
-
- Posts: 946
- Joined: Sun Dec 16, 2007 6:22 pm
- Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP - Location: London
- Contact:
Re: 16600 consistently crashing on AMD Radeon VII
Moonshot project owner knows about failed WUs, and has accepted it as is. They are not failing as much as 16600 or 16448.
Moonshot has similar failure rate on all different GPUs, not just AMD. This is due to the nature of simulation being done. The owner of Moonshot project is also one of the lead devs for fahcore_22, so while other project owners are researchers using fahcore_22, Moonshot owner actually developed and updated fahcore_22 to be able to do Moonshot type simulations In that update process a lo of other bugs and issues have been fixed
P.S. Out of 77 13422s my 3 different GPUs received, only 2 of them failed. One on Navi, one on either Radeon VII or Vega64
Moonshot has similar failure rate on all different GPUs, not just AMD. This is due to the nature of simulation being done. The owner of Moonshot project is also one of the lead devs for fahcore_22, so while other project owners are researchers using fahcore_22, Moonshot owner actually developed and updated fahcore_22 to be able to do Moonshot type simulations In that update process a lo of other bugs and issues have been fixed
P.S. Out of 77 13422s my 3 different GPUs received, only 2 of them failed. One on Navi, one on either Radeon VII or Vega64
FAH Omega tester
-
- Posts: 1996
- Joined: Sun Mar 22, 2020 5:52 pm
- Hardware configuration: 1: 2x Xeon [email protected], 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon [email protected], 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: [email protected], 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21 - Location: UK
Re: 16600 consistently crashing on AMD Radeon VII
... but on another machine we have been told of failures rates of 30 out of 37 for p13421 and 8 out of 9 WUs for p13423: 9 WUs !!!! ... and those errors are being assigned to the Project WUs and that level of errors is seemingly being accepted as normal with everyone happy for the machine to just keep fast erroring WUs en masse.
Heck, if everyone is fine with machines failing at this rate as declared earlier in this thread - and continuing to do so - when others such as yours or mine have minimal failure rates then fine - I've tried to make the potentially overlooking perfectly good WUs due to this issue ... I guess it the Project Owner is actually happy to have this level of failures from a single machine (at least 80&) then far be it from me to argue.
I'll wind my neck in and simply ignore the absurdity of this scenario.
Heck, if everyone is fine with machines failing at this rate as declared earlier in this thread - and continuing to do so - when others such as yours or mine have minimal failure rates then fine - I've tried to make the potentially overlooking perfectly good WUs due to this issue ... I guess it the Project Owner is actually happy to have this level of failures from a single machine (at least 80&) then far be it from me to argue.
I'll wind my neck in and simply ignore the absurdity of this scenario.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
-
- Posts: 946
- Joined: Sun Dec 16, 2007 6:22 pm
- Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP - Location: London
- Contact:
Re: 16600 consistently crashing on AMD Radeon VII
Single AMD GPU is 0.00001% of horsepower in the see of nVidia GPUs of you have one machine which fails constantly but have 10000 other machines which fold same project stably, would you go out of the way to halt the project (which needs to be finished as soon as possible)? Now if we had AMD actively involved to contribute in developing fahcore_22 and debugging most likely their own OpenCL mess, that would be insanely helpful and less time consuming. Fah devs are extremely limited resource with their own priorities
FAH Omega tester
-
- Posts: 79
- Joined: Fri May 29, 2020 4:10 pm
Re: 16600 consistently crashing on AMD Radeon VII
Of course I'm repeating myself, but as FAH devs are a very valuable and limited resource, they should mostly stick to the core development (in the sense of science) and should leave at least part of the debugging to others - which implies providing the source. Could AMD even be actively involved in debugging, would they get access to the source to find out what triggers failures, even if it is caused by weaknesses in their openCL stack? Who else could get the source and help in debugging? I'm aware that JohnChodera is really listening and active to get things solved, but probably third party help would speed up things. In a closed company project you can say "We limit us to this and that hardware to reduce compatibility issues", but in a project relying on the contribution of volunteers spread around the world the problems of contributors need to be taken serious. If you start to say "We don't care, it's only a small number of contributors, so not worthwhile to deal with" it will hurt the project as a whole.muziqaz wrote:Now if we had AMD actively involved to contribute in developing fahcore_22 and debugging most likely their own OpenCL mess, that would be insanely helpful and less time consuming. Fah devs are extremely limited resource with their own priorities
Re: 16600 consistently crashing on AMD Radeon VII
That might be interesting. 13423 WUs never failed on my machine. Last 2-3 days it was the only project I got and 100% of them completed successfully.Neil-B wrote:... and high rates of failure on 13421 (30 of 37 failed) and 13423 (7 of 8 failed) on the same rig ... that doesn't just feel like an issue with the 16600 project as far as that rig is concerned ... yes the 34 of 38 failures on 16600 may be down to an issue with the project but with the wider failures it feels like a rig issue or possibly an incompatible core to rig issue
Dedicated machine, Win7 64bit, AMD RX570 4gb (not overclocked), Adrenalin 20.2.2, Client ver. 7.6.9, OpenCL 2.0 AMD-APP Driver 3004.8
Re: 16600 consistently crashing on AMD Radeon VII
The folding cores (the part that runs on the GPU) is already from open source projects, and if I understand correctly they're planning to make the whole client open source. But if AMD or Apple were interested in helping out by improving their drivers or even making a folding core for Metal, then closed source isn't a hindrance to that.ThWuensche wrote:Who else could get the source and help in debugging?
Online: GTX 1660 Super + occasional CPU folding in the cold.
Offline: Radeon HD 7770, GTX 1050 Ti 4G OC, RX580
-
- Posts: 946
- Joined: Sun Dec 16, 2007 6:22 pm
- Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP - Location: London
- Contact:
Re: 16600 consistently crashing on AMD Radeon VII
nVidia is doing that.ThWuensche wrote:Of course I'm repeating myself, but as FAH devs are a very valuable and limited resource, they should mostly stick to the core development (in the sense of science) and should leave at least part of the debugging to others - which implies providing the source. Could AMD even be actively involved in debugging, would they get access to the source to find out what triggers failures, even if it is caused by weaknesses in their openCL stack? Who else could get the source and help in debugging? I'm aware that JohnChodera is really listening and active to get things solved, but probably third party help would speed up things. In a closed company project you can say "We limit us to this and that hardware to reduce compatibility issues", but in a project relying on the contribution of volunteers spread around the world the problems of contributors need to be taken serious. If you start to say "We don't care, it's only a small number of contributors, so not worthwhile to deal with" it will hurt the project as a whole.muziqaz wrote:Now if we had AMD actively involved to contribute in developing fahcore_22 and debugging most likely their own OpenCL mess, that would be insanely helpful and less time consuming. Fah devs are extremely limited resource with their own priorities
FAH dev creates fahcore>nVidia rep takes that core and runs it through their hardware in their lab with all their driver profilers and tools>driver team either optimises the drivers for the fahcore, or they give suggestions/submit patches of code to fah devs to improve fahcore.
Hardware vendor does not need to have source code in order to optimise for the code.
I know how much nVidia is involved, and I just don't see the same involvement from AMD, not even close, which is a shame, as their hardware was always very strong in pure compute tasks.
Also, fah devs mainly have nVidia hardware, as far as I know. I do not believe there are any AMD GPUs in their possession. At least we can be content that AMD CPUs punched through Intel wall when it comes to fah
FAH Omega tester
-
- Posts: 124
- Joined: Sat Apr 18, 2020 1:50 pm
Re: 16600 consistently crashing on AMD Radeon VII
Yep, still getting 16600 on Radeon VII (2nd Gen Vega) as of today (all time UTC):muziqaz wrote:Project has been disabled on all AMD cards but Navi. Please let us know if you still receive new p16600 WU on AMD GPU
Machine 1:
- 19.08.2020 - 20:55
- 19.08.2020 - 22:43
- 20.08.2020 - 10:2x
Machine 2:
- 19.08.2020 - 13:06
- 19.08.2020 - 13:11
- 19.08.2020 - 13:27
-
- Posts: 946
- Joined: Sun Dec 16, 2007 6:22 pm
- Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP - Location: London
- Contact:
Re: 16600 consistently crashing on AMD Radeon VII
Thanks, we'll try other meansNormalDiffusion wrote:Yep, still getting 16600 on Radeon VII (2nd Gen Vega) as of today (all time UTC):muziqaz wrote:Project has been disabled on all AMD cards but Navi. Please let us know if you still receive new p16600 WU on AMD GPU
Machine 1:
- 19.08.2020 - 20:55
- 19.08.2020 - 22:43
- 20.08.2020 - 10:2x
Machine 2:
- 19.08.2020 - 13:06
- 19.08.2020 - 13:11
- 19.08.2020 - 13:27
FAH Omega tester
-
- Posts: 124
- Joined: Sat Apr 18, 2020 1:50 pm
Re: 16600 consistently crashing on AMD Radeon VII
But it's a lot less than before!muziqaz wrote:
Thanks, we'll try other means
-
- Posts: 946
- Joined: Sun Dec 16, 2007 6:22 pm
- Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP - Location: London
- Contact:
Re: 16600 consistently crashing on AMD Radeon VII
That's not good enough. It was set to exclude everything but Navi. Appearantly the setting failedNormalDiffusion wrote:But it's a lot less than before!muziqaz wrote:
Thanks, we'll try other means
FAH Omega tester