Faulty RTX 3080?

gordonbb · Post by **gordonbb** » Sun Feb 12, 2023 12:40 am

TLDR - I suspect I have a 3080 with faulty hardware and want to RMA it to EVGA but would like some advice on how to best communicate the issue to them.

I purchased a few B-Stock RTX 3080s from EVGA when they had them on Fire-sale a few months ago. The last blast of decent 2-slot width Ampere GPUs.

I've been noticing that one of them was throwing WU errors around every 1 in 100 to 2 in 100 WUs (1%-2% Error rate) so I've been keeping an eye on it. Over the last 3 days that rate jumped so I trolled the logs to get a better idea what is going on.

HfM shows 26 Failed WUs out of 1388 since the card was installed October 10th so a 1.87% Failure rate.

Over the current Log period (about 60 hours) I saw 26 WUs complete without errors; 16 complete with 1-3 "Particle coordinate is NaN" errors that were recovered from and 6 that had multiple "Particle coordinate is NaN" errors ending with the dreaded"

"Max number of attempts to resume from last checkpoint (2) reached. Aborting."

So a 45% error rate and a 12.5% failure rate.

Looking at the results I see the same Projects completing with no errors sometimes, completing with errors sometimes and failing sometimes and I see this for multiple distinct Projects. (i.e. likely not a Bad Project nor a small vs. large atom size dependency)

I have no cause preference set nor do I have the Beta flag set so I'm getting a good mix of bog standard WUs.

My systems are all running Ubuntu LTS 22.04.1 and Driver 525.60.11 and they all run a dual axial 30[7-8]0(ti) in the lower slot and a 2070 Super Hybrid in the Upper Slot.

The problem has persisted even after loading current patches and upgrading the GPU driver. The other two systems with 3080s are similarly configured and show no errors (i.e. not a single NaN error let alone any Aborts). The RTX 2070Super in this system shows no errors.

This system is my lone Intel system with a 9900K and the rest are x470 and x570 with a mix of R9 39[5|0]0x and 59[5|0]x systems with current BIOSes.

I've run multiple GPUs in this same system for a few years with no other issues.

I run my GPUs with the graphics clocks capped to a maximum of 1500MHz for improved efficiency and the systems all draw between 350-450W depending on the mix of GPUs and they have Corsair 750W 80+ Gold Power Supplies and individual UPSes.

I'm positive I have a hardware issue and the above would seem to exclude a platform or power issue. I'm almost convinced it is a hardware issue with the GPU but as a last sanity check, once the electricity rates double during the on-peak period on Monday, I'm going to swap the 3080 with one of the others (identical models) in another system and if the errors move with the GPU them I will open an RMA with EVGA.

I suspect there is an intermittent electrical fault in the card, bad solder joint, bad memory ...

Are there any well-regarded GPU test programs like memtest for RAM that might be used? I suppose I could run the Core21 version of FAHBench for a while.

Any suggestions on how to best communicate the issue to EVGA?

gordonbb · Post by **gordonbb** » Tue Feb 21, 2023 4:19 am

Swapped the 3080s between two systems. After about a week I'm seeing: 75 WUs; 14 completing with a single Error; 14 completing with 2 or more Errors; 5 failing; 42 Completing without Error.

Success Rate: 93.3%; Failure Rate: 6.7%; Error Rate: 44%

The conclusion appears to be that as the Errors moved with the GPU it is the GPU at fault not the underlying system.

I'm going to try running FAHBench tomorrow in the hope it will throw errors that are reproducible.

BobWilliams757 · Post by **BobWilliams757** » Wed Feb 22, 2023 4:30 am

You stated that you had failures across a number of work units. Have you checked to see if any of the failures were consecutive, with a trend of starting with a certain work unit? In your case, you might want to also check to see what the other (2070 supers) were running or doing at the time.

I had an issue with the 18601 work units that I never really entirely figured out. It seems that once any error on a 18601 WU was tripped, the GPU was more likely to have more errors over time. On a couple of occasions it failed those WU's and caused the ones that came after it to fail as well. This happened before the GPU was in the system a once also, and I thought it had to do with the driver issues for my iGPU. Not much later I picked up a Nvidia GPU and had the issues a couple times, only being triggered by 18601.

In my case I noticed that it was only when using my GPU power limited at lowest settings (53% in this case) that I had the issues. Bumping the power limit to 60% or higher ended the failed units, but still I would get an error on say at least 1 of each 5 of project 18601. The trend seemed to be that as power limits grew, the number of errors went down.

As for testing, I couldn't find anything that would make the system even blink except for folding. I did a number of tests a number of ways and they always indicated everything was fine.

gordonbb · Post by **gordonbb** » Fri Feb 24, 2023 8:36 am

Thanks for the suggestion Bob. I’ll try running it at a higher maximum clock this weekend and see if there’s any change.

BobWilliams757 · Post by **BobWilliams757** » Fri Feb 24, 2023 7:39 pm

Gordon,

One thought that I forgot above was the power the project used. Using GPU-Z, the project that gave me errors at lower power limits was the first and only project that I noticed would be power capped by the GPU when running at 100% power. Not all (PRCG) of them would do this, but some would. Since the stability increased when I power limited to higher settings, I never thought to try to run more of that project at 100% settings or higher. But it did seem that the particular project was more capable of drawing higher power from my GPU.

I did try investigating other possible causes in the GPU alone, but none other than the power draw and limits I set seemed to matter. Fan curves, overclock/undervolt, CPU resources feeding it, memory speed reduction, etc.... the only one that mattered in stability was the power limit.

And though I don't know that it had impact, I also noticed on that project that when viewing in MSI Afterburner, it seemed to take longer until the memory and GPU core clocks bumped up to a stable range. With most project you might get a quick spike or two, but with 18601 it was several spikes with more delay between them. Almost as if it was taking more time for the GPU to saturate or something. Once running, the usual speed bumps up and down seemed in line with any other project.

gordonbb · Post by **gordonbb** » Mon Feb 27, 2023 4:47 am

Bob,

Thanks for the suggestions.

The GPU fails independently of any project. Sometimes it works on a specific WU, sometimes it errors but succeeds on the same WU, sometimes it fails. There is no rhyme or reason to the WUs it succeeds at, errors on or fails at.

I did try to raise the clock speed to the mid 1700MHz and it still exhibited the same behaviour.

I'm convinced at this point the hardware is bad and am going to open an RMA with EVGA.

BobWilliams757 · Post by **BobWilliams757** » Mon Feb 27, 2023 9:00 am

gordonbb wrote: ↑Sun Feb 12, 2023 12:40 am
Any suggestions on how to best communicate the issue to EVGA?

Being this got overlooked.....

I'd just give it to them straight. You can prove that other hardware successfully completed each failed work unit, and that your hardware has flaws creating the errors. Being that you have switched machines and the errors followed the GPU, you have ruled out the hardware and OS related questions of that possible angle. It seems you already have the numbers on hand to work out the final failure rates and error rates, as well as the information on what might be considered "normal" for a GPU folding.... which is essentially an error rate of next to nothing and a failure rate closer to nothing.

I've never had to deal with EVGA on returns, but the absolute worst that can happen is that they fight it. Being they are getting out of the GPU game, you could remind them that a decision to not honor a return might easily influence yourself and others you might tell about it as to whether they are an honest and reputable company to purchase from in the future.

But hopefully they will just allow a return.

jchang6 · Post by **jchang6** » Fri Apr 21, 2023 12:52 pm

I got the EVGA RTX3080 Ti in Aug 2021. At some point it started to hang? periodically, it would complete a WU, but never complete the upload or start the next WU. none of my other cards do this. This might be correlated to a Windows update requiring a restart, but not always.
I have had a RTX2080 Ti fail completely possibly also bringing down the motherboard and CPU - note: I did not try replacing the CPU, or trying the CPU in another motherboard because this was an Intel 10th gen and the 12th Gen was already out that I did not want to buy more 10th gen parts.
I also had a system with RTX3060 fail, moving the CPU to a new motherboard worked. I have not tried the GPU in a different system because it seemed like trying the failed 2080 in different systems caused problems.
because of this, I am hesistant to buy RTX 40 parts

jchang6 · Post by **jchang6** » Sat Apr 22, 2023 1:16 pm

below is the log of a successful completion
03:14:38:WU00:FS01:0x22:Completed 1000000 out of 1000000 steps (100%)
03:14:38:WU00:FS01:0x22:Average performance: 216 ns/day
03:14:39:WU00:FS01:0x22:Checkpoint completed at step 1000000
03:14:40:WU00:FS01:0x22:Saving result file ..\logfile_01.txt
03:14:40:WU00:FS01:0x22:Saving result file checkpointIntegrator.xml
03:14:40:WU00:FS01:0x22:Saving result file checkpointState.xml
03:14:42:WU00:FS01:0x22:Saving result file positions.xtc
03:14:42:WU00:FS01:0x22:Saving result file science.log
03:14:42:WU00:FS01:0x22:Folding@home Core Shutdown: FINISHED_UNIT
03:14:43:WU00:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
03:14:43:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:14941 run:4 clone:41 gen:74 core:0x22 unit:0x000000290000004a00003a5d00000004
03:14:43:WU00:FS01:Uploading 23.80MiB to 128.174.73.78
03:14:43:WU00:FS01:Connecting to 128.174.73.78:8080
03:14:43:WU01:FS01:Starting

below is the log of a stall
03:41:45:WU01:FS01:0x22:Completed 1000000 out of 1000000 steps (100%)
03:41:45:WU01:FS01:0x22:Average performance: 216 ns/day
03:41:46:WU01:FS01:0x22:Checkpoint completed at step 1000000
03:41:48:WU01:FS01:0x22:Saving result file ..\logfile_01.txt
03:41:48:WU01:FS01:0x22:Saving result file checkpointIntegrator.xml
03:41:48:WU01:FS01:0x22:Saving result file checkpointState.xml
03:41:49:WU01:FS01:0x22:Saving result file positions.xtc
03:41:49:WU01:FS01:0x22:Saving result file science.log
03:41:49:WU01:FS01:0x22:Folding@home Core Shutdown: FINISHED_UNIT

then it sits here forever - an OS restart seems to restart folding

gordonbb · Post by **gordonbb** » Tue May 23, 2023 4:59 pm

Update.

I opened an RMA with EVGA and linked this post to add to the description. It went smoothly. They authorized the RMA and I had to ship the GPU back to them at my expense which was about $100Cdn for UPS Ground with Signature and Insurance. It took about 6 days to get to California and the next day they shipped a replacement back UPS Expedited. I received it just now and it's in a test system working on a WU. I'll leave it to run for a couple of days at Stock settings and if all looks good I'll move it back into one of my production systems and lock the graphics clock to 1440MHz for peak efficiency.

All in all it was a pleasure to deal with EVGA and it is a pity they had to get out of the business as I've heard some real horror stories about other people's experiences trying to RMA GPUs to some of the other vendors. I haven't settled on a new preferred brand yet. I have a Asus TUF 3070ti which is a little too chunky for my liking but I'm also impressed with a Zotac Trinity 4070ti that I picked up recently.

BobWilliams757 · Post by **BobWilliams757** » Tue May 23, 2023 9:00 pm

Good to hear that they gave you a prompt replacement, though it sucks that shipping is that bad these days. I've heard plenty of horror stories with returns on PC gear, and I'm glad to see EVGA did you right. Even more so since I purchased a power supply from them last fall.

Folding Forum

Faulty RTX 3080?

Faulty RTX 3080?

Re: Faulty RTX 3080?

Re: Faulty RTX 3080?

Re: Faulty RTX 3080?

Re: Faulty RTX 3080?

Re: Faulty RTX 3080?

Re: Faulty RTX 3080?

Re: Faulty RTX 3080?

Re: Faulty RTX 3080?

Re: Faulty RTX 3080?

Re: Faulty RTX 3080?