Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

It seems that a lot of GPU problems revolve around specific versions of drivers. Though NVidia has their own support structure, you can often learn from information reported by others who fold.

Moderators: Site Moderators, FAHC Science Team

ntsarb
Posts: 10
Joined: Fri Sep 13, 2013 6:38 pm

Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Post by ntsarb »

Hello,

I've used F@H on a system with Quad 1080Ti + ASRock X99 WS + i7 6850K, for several months without any major issue.

I recently upgraded motherboard+CPU to ASUS X299 SAGE WS + i9 7900X and I can't run F@H on all GPUs, as in this case the system freezes for 1-2 minutes before it crashes to a BSOD with error code: DPC_WATCHDOG_VIOLATION. OS is Windows 10 Pro 64bit.

Last year, I experienced the same problem with an ASUS X99-E WS motherboard, using same GPUs and system components. Notably, both ASUS motherboards feature a PLX chip, i.e. PCI-E switch.

RAM has passed latest MemTest86 and F@H doesn't have any issues with any two GPUs on the motherboard. Blender3D doesn't have any issues rendering with all four GPUs, using CUDA, either.

Anyone else having similar issues with the particular motherboard in quad GPUs configuration?

Is this something that needs reporting to F@H developers as a potential bug? Could be an NVIDIA driver issue, too, but I presume F@H developers would be in a much better position to communicate the issue with NVIDIA's Tech Support?
foldy
Posts: 2040
Joined: Sat Dec 01, 2012 3:43 pm
Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441

Re: Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Post by foldy »

FAH may be the trigger but something else could be the reason.

Do you have latest BIOS for the mainboard and drivers updated?

In Windows event log there should be a reason shown what caused the DPC_WATCHDOG_VIOLATION.

Here are some general solutions for the problem https://thewindowsplus.org/dpc_watchdog_violation/

Does your system freeze and crash instantly when you start FAH on all GPUs or does it first run for some time without problem?
toTOW
Site Moderator
Posts: 6349
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Post by toTOW »

Something may be wrong with one of your GPU or your PSU ...
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
ntsarb
Posts: 10
Joined: Fri Sep 13, 2013 6:38 pm

Re: Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Post by ntsarb »

Hi foldy, toTOW,

Thanks for your responses. Here's the info you asked for:

- Already using the latest UEFI/BIOS firmware and drivers from ASUS's and NVIDIA's web sites.
- I've already gone through the common problems and solutions for dpc_watchdog_violation. None of these appear to be relevant to or help with the particular configuration.
- The system can freeze for several seconds, at random times. When the freeze lasts long enough (about 1-2 minutes), the dpc_watchdog_violation BSOD is triggered.
- The PSU is an EVGA 1600P2. The same PSU and the same other components (GPUs, CPU, RAM, SSD) have be used without any problem at all on an ASRock X99 WS motherboard; that was rock solid!

Worth noting:

Facts:
* The problem has been confirmed on two brand new ASUS X99-E WS motherboard and another brand new ASUS X299 SAGE motherboard. The particular motherboards employee Broadcom PEX8747 PCI-E switches, for x16 PCI-E lanes (3.0) on each slot (which is useful for Deep Learning applications).

Opinion:
I suspect there's an incompatibility between the NVIDIA GPUs or Kernel Driver and the PLX PEX9747 switch... but I can't confirm, debug and resolve this. Only NVIDIA could do this, as they have the source code for their Kernel Driver.

Facts:
* All Kernel Memory MiniDump files (from multiple BSOD minidumps investigated with WinDBG) indicates "Probably caused by : nvlddmkm.sys ( nvlddmkm+1c8301 )". Temperatures of GPUs, PCH (chipset) and CPU are within limits, most often far below the limits (e.g. 50-60 degrees C).

Opinion:
I suspect the NVIDIA Kernel Driver loses communication with one of the GPUs. Maybe the PLX chip freezes. Whatever happens, it's always related to NVIDIA's Kernel Driver, on a PLX-based motherboard, which is why I suspect an incompatibility with the PLX switch.

Facts:
* Each GPU has been tested on its own on the ASUS motherboards, no problem at all.
* Each pair (all permutations tests) of GPUs were tested on the ASUS motherboards, no problem at all with that either.
* Add a third GPU or a fourth GPU and the system becomes unstable.
* I've tested all (4) permutations of 3 GPUs, in good hope I may single out the one that may cause instability. All three permutations lead to unstable system.

Opinions:
One could theorise that one of the GPUs has a hardware defect that prevents it from working well in a tripple or quad GPU setup, but this should be exhibitted on the ASRock motherboard, too, where 3 and 4 GPUs were working perfectly fine for about 6 months (prior to upgrading to the ASUS X299 Sage).

Hence, I'm quite confident the GPUs are good. Both ASUS motherboards have exactly the same issue and they were tested with X99 CPU (same i7 6850K that was used on the ASRock motherboard) and X299 CPU (i9 7900X).

I've reported these to ASUS UK, which doesn't reply to my support requests (doesn't pick up the phone and doesn't respond to web form tickets), and NVIDIA's Tech Support. NVIDIA's tech support are still asking for reinstalling drivers and other basics, which have be performed numerous times.

If there are other users of the same hardware configuration who don't experience this issue, I'd like to hear from them, so as to better understand if it's a more general problem. From another forum, of 3D Rendering professionals, I've so far only found people with the same setup who suffer from the same problem.

I'm hopeful this is a software driver issue that can be fixed but I don't know how to persuade NVIDIA and/or ASUS to look into this. If there's a F@H developer or an NVIDIA or ASUS employee herein, who can help towards this direction, I'd be very grateful.

Regards
foldy
Posts: 2040
Joined: Sat Dec 01, 2012 3:43 pm
Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441

Re: Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Post by foldy »

"From another forum, of 3D Rendering professionals, I've so far only found people with the same setup who suffer from the same problem."

"If there's a F@H developer or an NVIDIA or ASUS employee herein, who can help towards this direction, I'd be very grateful."
There is none of them here. I guess we cannot solve this issue.

Last ideas: FAH has a very high pcie bus usage. Maybe there is some BIOS settings which can change the pcie somehow, e.g. reduce to pcie gen 2 or change the clock speed. Or the PLX chip still gets too hot, try to find it on the mobo and feel temp with your finger.

(You did put in the 2x extra 8pin plugs for the mainboard?)

Or a roughly similar bluescreen with GPUs freeze: increase TdrDelay from 2 to 10.
https://docs.microsoft.com/de-de/window ... d-recovery

And found some registry values for DPC_WATCHDOG_VIOLATION from Windows 2012. Increase timeout could help or not.
https://support.symantec.com/en_US/arti ... 36958.html
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Post by bruce »

ntsarb wrote:Is this something that needs reporting to F@H developers as a potential bug? Could be an NVIDIA driver issue, too, but I presume F@H developers would be in a much better position to communicate the issue with NVIDIA's Tech Support?
It's highly unlikely that FAH developers can do anything about this problem. Each FAHCore runs independently of the others and on a different GPU. It's up to your hardware and to the OS to supply the necessary resources to all of them.

The WATCHDOG VIOLATION is a generic Windows problem indicating that one of your drivers is hogging resources, but it's not smart enough to give a meaningful diagnosis of which one. (In fact, early watchdog violations were due to an SSD driver that windows included in their list of approved drivers -- but it can be any one of your drivers.) FAH is not the problem :!: While we will cooperating with you getting it fixed, ultimately, it's a Windows problem that will be best solved by a site with better experience in diagnosing general driver problems.

Foldy has already suggested that it's probably a problem in distributing PCIe resources when there's a lot of contention ... and that sounds reasonable to me. FAH can't do anything about that.
ntsarb
Posts: 10
Joined: Fri Sep 13, 2013 6:38 pm

Re: Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Post by ntsarb »

Foldy, bruce, thanks for your constructive feedback, very much appreciated.

- Regarding thermals:
* There are 2 PLX chips ( PEX8747 ) under the same heatsink as the PCH (X299 chipset), which means the temperature of the PLX chips cannot deviate much from the heatsink's, which is measured in real time.
* The PCH temperature doesn't exceed 65C and the PLX chips are meant to operate up to 100C.
* There are 2x industrial-grade 14mm NOCTUA fans (up to 1800rpm) located in the front of the computer case, blowing cool air (ambient temperature 20-21C) over the motherboard. 3x NOCTUA fans (same type) are used for exhaust, one at the back and two more at the top.

- TDRDelay. Indeed, I'm aware of this setting and how it affects the operation of the computer. I've seen it in action and I don't think TDR is triggered.

- Regarding contention of PCI-E resources: I expect F@H to be exchanging small amounts of data with the GPUs. Furthermore, each GPU completes its work at a different time. There shouldn't be an issue there, except if there is a defect, which can't be true for 3 brand new motherboards.

- More Testing - turning things around!

Last night, I installed Ubuntu Linux 17.10.1 with NVIDIA's closed source drivers. I loaded all 4x GPUs with workload from Folding@Home for 6+ hours and Linux did not "panic" at all, i.e. no kernel panic (the equivalent of Window's BSOD). This is great news, but I need to run lots more tests before I'm confident.

If confirmed with more tests in Linux passing succesfully, that would mean the issue affects Windows OS in particular. In this case, it could be one or more of the following:

- Microsoft Windows 10 kernel
- Intel Chipset driver
- NVIDIA driver

Unfortunately, ASUS Tech UK does not respond to calls or support requests (tickets) that I submitted. As a matter of fact, their drivers are actually Intel's (chipset) drivers, but they should still respond, talk to Intel about the problem and let me know if/when a solution can be provided. NVIDIA is still looking for possible common causes, Microsoft blamed ASUS as incompatible with Windows 10 Pro Creator's Update (but they still allow ASUS to advertise its motherboard as being compatible with Windows 10) and closed the ticket. I think it's time to open a ticket with Intel, too.

Engineers from these companies need to talk to each other, otherwise there's little hope for this issue to be resolved.
foldy
Posts: 2040
Joined: Sat Dec 01, 2012 3:43 pm
Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441

Re: Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Post by foldy »

So Linux is the solution. It often also has better FAH performance in PPD. Do you need to switch back to Windows?
Jimboc
Posts: 68
Joined: Sun Feb 12, 2012 11:43 am
Hardware configuration: Corsair Obsidian 750D Windows Airflow Edition

Intel Core i9 Extreme 7980XE @ 2.6 GHz (18 cores, 36 threads)

64 GB Corsair Dominator Platinum DDR4 RAM

2x Nvidia Titan RTX (NVLink Enabled) (Nvidia 526.98 Studio Driver)

Asus Rampage VI APEX (BIOS 1401)(Intel X299 Chipset)

Corsair AX1600i Titanium Plus Power Supply

Corsair Neutron NX500 800GB SSD (System Drive)

Seagate SkyHawk 10 TB , 256 MB Cache (Data Drive)

Creative Sound BlasterX AE-5 Plus Pure Edition

Dell UP3218K and Dell U2711

Windows 11 Pro for Workstations 64 Bit (Version 22H2)
Location: Ireland

Re: Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Post by Jimboc »

ntsarb wrote:Foldy, bruce, thanks for your constructive feedback, very much appreciated.

Last night, I installed Ubuntu Linux 17.10.1 with NVIDIA's closed source drivers. I loaded all 4x GPUs with workload from Folding@Home for 6+ hours and Linux did not "panic" at all, i.e. no kernel panic (the equivalent of Window's BSOD). This is great news, but I need to run lots more tests before I'm confident.
Hi ntsarb,

I was sorry to learn of the difficulties you faced with this issue especially when your hardware is so high-end. I know that frustration only too well myself.

I will be upgrading from Windows 8.1 to Windows 10 and using an Asus X299 motherboard in the next 2 to 3 months. Since I will have 2 GPUs, I probably won’t experience this but please do let us know how the testing on Linux you mentioned works out.

Many thanks for this information.
Kuno
Posts: 31
Joined: Sat Sep 23, 2017 4:59 pm

Re: Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Post by Kuno »

You've stated that you are testing in Linux and having no issues, which would mean it's more than likely an issue with the drivers in windows or the way that Windows is handling the data on the bus. You should stick with Linux anyways as you will be able to get more work done, and have less overhead on your folding rig. Windows should not be used for folding as there is seriously just too much overhead and you end up losing about 15%-20% of the performance of your cards.
networkingdude
Posts: 1
Joined: Thu Apr 05, 2018 12:07 am

Re: Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Post by networkingdude »

I am experiencing the exact same error with the same motherboard. I have 2 GTX 1080 ti's and an intel X520 10gb card. The 2 GPU's are installed in slot 1 and 3 and the 10gb card is in port 7.

Bios is up to date, drivers are updated with SDI driver to latest editions.

The crash occurs when benching both GPU's at the same time.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Post by bruce »

Kuno wrote:Windows should not be used for folding as there is seriously just too much overhead and you end up losing about 15%-20% of the performance of your cards.
It's true that Windows has more overhead that Linux, but folding with whatever you have is much more important than not folding, just because you have a Windows system. Continuing to fold with Windows is a no-brainer. Upgrading a Windows system with a Linux system may be easy for those with a good computer background, or those with a desire to learn something new, but for those with moderate (or less) computer skills, can become a non-trivial system upgrade.
foldy
Posts: 2040
Joined: Sat Dec 01, 2012 3:43 pm
Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441

Re: Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Post by foldy »

Does the Broadcom/Avago/PLX PEX 8747 chip have any drivers listed in Windows Device Manager => System devices?

Try to edit Windows Power Options => Change plan settings and set PCI Express Link State Power Mangement to Off

I found this Windows SDK package which is for developers only but maybe it has some magic to fix the FAH issue?
https://docs.broadcom.com/docs/SDK-Complete-Package
ilxli
Posts: 3
Joined: Tue May 08, 2018 10:43 am

ASUS X299-e WS Multi GPU problem solved

Post by ilxli »

Hey guys, I solved the multi GPU problem at my end.

My system is:
Win 10
ASUS X299-E WS
i7 5930K
2 x 1080 TI
2 x Titan X
64 GB Kingston memory.
All Sata ports in use.

I had the same problem DPC_WATCHDOG_VIOLATION every 5 minutes.
The problem started after the Win10 January update.

What I did to solve it:
Updated Win10 to the most resent version. (took a long white with loads of WATCHDOG'S in between).
When that was finally done my computer was still very unstable like a DPC_WATCHDOG_VIOLATION every view minutes.
then I uninstalled all Nvidia drivers and installed the following drivers from Nvidia: 382.53-desktop-win10-64bit-international-whql

And that did the trick for me : )
Its a week later now and still running smoothly without any crash!

Here a link to the drivers:

Code: Select all

http://www.nvidia.com/download/driverResults.aspx/119914/en-us
For me this is the only driver version that is stable.

This is my first post ever, it was to big a problem for me to let other people suffer from it.

Hope it helps some of you : )
windbeutel
Posts: 1
Joined: Thu May 17, 2018 7:43 am

Re: Quad 1080Ti + ASUS X299 + F@H > DPC_WATCHDOG_VIOLATION

Post by windbeutel »

Hello guys,

I do 3D rendering on GPUs (Octane and Redshift). My machine is the same ASUS x99-e ws with 4x1070.
And I experience the same problems on Win10 with the latest drivers. Random crashes/freezes and BSOD Watchdog violation.

I can confirm, the latest stable driver version is 382.53.
Unfortunately I need to install the latest drivers (supporting CUDA 9) in order to fully use functionality in the newest render builds.
Linux is no option for me since there is no Linux version of my 3D program (C4D).

There are many other people having this issue with PLX mainboards.
Everyone should contact NVIDIA support and open a ticket to push them fixing the driver.
I did this a month ago. But I have the feeling that since no more than two gpus are supported for gaming any longer,
NVIDIA thinks our folks (professional 3D-rendering, folding, deep learning etc.) should go with "professional" cards like the Quadro or Tesla series.
Post Reply