Temperature/GPU Usage unstable; drops align with checkpoints

It seems that a lot of GPU problems revolve around specific versions of drivers. Though NVidia has their own support structure, you can often learn from information reported by others who fold.

Moderators: Site Moderators, FAHC Science Team

PRP_R148H
Posts: 8
Joined: Sat Feb 27, 2021 7:14 am

Temperature/GPU Usage unstable; drops align with checkpoints

Post by PRP_R148H »

Code: Select all

OS: Kubuntu 18.04 LTS
FAHClient: 7.6.21
Nvidia Driver: 460.39
Folding for a few hours on my 3070 yields the following trend in the GPU temperature:
Image

And GPU utilisation
Image

Each dip corresponds to the checkpoint session. Even though I have set the interval to 30 minutes, it seems that my client is checkpointing every 5-6 minutes. At first I thought this might be due to some latency from the write time of my HDD (USB 3.1) but I have seen other work units on the card perform with much more stability, eg the work unit on the far left of this graph from a separate GPU (another 3070) Image

No error logs in the FAHClient console. Possibly still working through my first 10 WUs. Can provide nvidia log dump if needed or can provide a snippet if directed.

Thoughts?
JimboPalmer
Posts: 2522
Joined: Mon Feb 16, 2009 4:12 am
Location: Greenwood MS USA

Re: Temperature/GPU Usage unstable; drops align with checkpo

Post by JimboPalmer »

Welcome to Folding@Home!

I would comment that the GPU check point code is run on the CPU, not the GPU, so a drop when checkpointing makes sense.

My understanding is that the GPU checkpoint interval is set by the developer, not the user, like it is with CPU checkpoints.
Tsar of all the Rushers
I tried to remain childlike, all I achieved was childish.
A friend to those who want no friends
PRP_R148H
Posts: 8
Joined: Sat Feb 27, 2021 7:14 am

Re: Temperature/GPU Usage unstable; drops align with checkpo

Post by PRP_R148H »

Thanks for the welcome Jimbo, and for the tip about the checkpoint flag.

I'm currently folding on a 980ti as well - that card folds with both temperature and GPU as a steady line. Should I be concerned about this fluctuation on the 3070s?
JimboPalmer
Posts: 2522
Joined: Mon Feb 16, 2009 4:12 am
Location: Greenwood MS USA

Re: Temperature/GPU Usage unstable; drops align with checkpo

Post by JimboPalmer »

[None of my GPUs are as fast as your slower GPU, let alone your faster one, so this is theory]

I think your 3070 is so fast that the CPU is 'slow' while your 980ti is so 'slow' that the CPU gets it stuff done quickly.
Tsar of all the Rushers
I tried to remain childlike, all I achieved was childish.
A friend to those who want no friends
Joe_H
Site Admin
Posts: 7939
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: Temperature/GPU Usage unstable; drops align with checkpo

Post by Joe_H »

As mentioned, the checkpoint for a GPU WU is set by the researcher. When it happens is printed in the log file.

Utilization of a card like your 3070 will depend on the size of the WU in atoms. WUs with many will have a higher utilization percentage, but the checkpoint done on your CPU which includes a sanity check calculation may take longer as compared to a WU with fewer atoms.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon [email protected], 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon [email protected], 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: [email protected], 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Temperature/GPU Usage unstable; drops align with checkpo

Post by Neil-B »

I am not seeing anything like the same level of fluctuations tbh ... I am running Asus Strix rtx3070 OC on Win 10 ... At checkpoint I see maybe a 2C dop in temp from 55C to 53C on current WU ... Monitored by HWMonitor and confirmed by GPU Tweak II ... If your GPu is having time to cool as much as it is and show the utilisation drop your are seeing I have to ask is your CPU loaded up as well? ... for a variety of reasons at the moment I am just gpu folding with the gpu supported by i9-1850K, 64GB Ram and an nvme so basically doing nothing else with kit but making sure gpu is folding effectively ... I am wondering if the gpu WU checkpointing is having to queue for resource allocation if the cpu is loaded up - this might mean the gpu has to wait much longer both giving the gpu a chance to cool significantly and show significant drops in utilisation (mine only showed a drop from 100% to 96% before it ramped up again)
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
PRP_R148H
Posts: 8
Joined: Sat Feb 27, 2021 7:14 am

Re: Temperature/GPU Usage unstable; drops align with checkpo

Post by PRP_R148H »

Thanks Neil, I've checked my CPU and RAM usage (CPU: Ryzen 5 3500X. 2 cores at 100% for the GPU threads, other 4 cores are idle. RAM: 3.3GB of 7.7GB Used, no fluctuation). The only thing I can think of is that my HDD is a USB 3.1 stick. I am noticing in my System Monitor that during checkpointing the CPU thread goes to zero and puts the thread into `disk sleep` for 3-4 seconds. This could explain the dip - that the GPU is waiting with nothing to do while this processes, Maybe I should go and pick up cheap SSD and re-install the system.
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon [email protected], 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon [email protected], 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: [email protected], 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Temperature/GPU Usage unstable; drops align with checkpo

Post by Neil-B »

That sounds as if you have identified the issue - If the GPU is waiting on the CPU which is waiting on the USB read/writes for that long then your GPU probably thinks it is on holiday ;) ... Hope you get it sorted :) - at least sdd/nvmes are not artificially inflated in price at the moment - the lunacy of gpu prices at the moment is scary :(
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
PRP_R148H
Posts: 8
Joined: Sat Feb 27, 2021 7:14 am

Re: Temperature/GPU Usage unstable; drops align with checkpo

Post by PRP_R148H »

Well, if we can't go on holiday right now, at least our GPUs can! I'll see how the SSD fares. Yes it's quite a farce what's happened to the GPU market. I managed to grab a pair of fairly (?) well priced 3070s from a store and as soon as I bought them, they raised the price $200 for the next batch. Wow.

Also thanks Joe_H for that explanation of the WU:atom business and how that affects checkpointing.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Temperature/GPU Usage unstable; drops align with checkpo

Post by bruce »

As has already been said, the checkpoint frequency for GPUs is defined by the researcher. So project M may be quite different than project N.

The client's checkpoint frequency setting does apply to CPU projects. The software in the FAHCores (GROMACS vs. OpenMM) have been developed by different teams so there are significant differences. Closer to the users are FAHControl and FAHClient which have to support whatever is available at the interface with the FAHCore.

Question: Is your CPU loaded doing the calculation of another WU or is it idle ... free to accept the workload of doing the sanity check of the GPU's assignment? The shapes of the dips may depend on that answer.
ipkh
Posts: 173
Joined: Thu Jul 16, 2015 2:03 pm

Re: Temperature/GPU Usage unstable; drops align with checkpo

Post by ipkh »

What are your complete system specs? A larger amount of RAM could help as Linux will use it a buffer for HDD writes. Making sure there's always a free CPU core for checkpoints would also help. USB and spinning disks aren't great for random access, but writing should be cached in RAM or the HardDisks onboard cache.
If your concern is the temp/utilization spikes, you can use the Nvidia X-Server Settings applet to prefer maximum performance and it won't enter idle clocks.
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon [email protected], 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon [email protected], 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: [email protected], 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Temperature/GPU Usage unstable; drops align with checkpo

Post by Neil-B »

The Op has posted that they have probably tracked down the issue (see middle of thread) .. Using a USB Disk is causing the CPU to hang for a few seconds on Checkpointing (so no Harddisk onboard cache to worry about) which appears to be the cause of the significant drop in temp and utilisation of the GPU ... Probably isn't worth finessing any possible solutions until this part of the equation has been resolved?
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
PRP_R148H
Posts: 8
Joined: Sat Feb 27, 2021 7:14 am

Re: Temperature/GPU Usage unstable; drops align with checkpo

Post by PRP_R148H »

ipkh wrote:What are your complete system specs?
Here; but no HDD - just a Gen 3.1 USB stick running kubuntu:
me wrote:I've checked my CPU and RAM usage (CPU: Ryzen 5 3500X. 2 cores at 100% for the GPU threads, other 4 cores are idle. RAM: 3.3GB of 7.7GB Used [During folding], no fluctuation).
ipkh wrote:If your concern is the temp/utilization spikes, you can use the Nvidia X-Server Settings applet to prefer maximum performance and it won't enter idle clocks.
Good thinking. I'm currently running a soft power limit with persistance mode enabled, and a moderate overclock. But the problem persists regardless of what powermizer state or manual clock I put it in.
Neil-B wrote:Probably isn't worth finessing any possible solutions until this part of the equation has been resolved?
Right :) I'm buying an SSD as we speak and I'll report back later this week when I...

when I...

..reinstall linux again. And battle with the nvidia xorg configuration files to enable coolbits correctly.

Ohno.
MeeLee
Posts: 1339
Joined: Tue Feb 19, 2019 10:16 pm

Re: Temperature/GPU Usage unstable; drops align with checkpo

Post by MeeLee »

PRP_R148H wrote:
ipkh wrote:What are your complete system specs?
Here; but no HDD - just a Gen 3.1 USB stick running kubuntu:
me wrote:I've checked my CPU and RAM usage (CPU: Ryzen 5 3500X. 2 cores at 100% for the GPU threads, other 4 cores are idle. RAM: 3.3GB of 7.7GB Used [During folding], no fluctuation).
ipkh wrote:If your concern is the temp/utilization spikes, you can use the Nvidia X-Server Settings applet to prefer maximum performance and it won't enter idle clocks.
Good thinking. I'm currently running a soft power limit with persistance mode enabled, and a moderate overclock. But the problem persists regardless of what powermizer state or manual clock I put it in.
Neil-B wrote:Probably isn't worth finessing any possible solutions until this part of the equation has been resolved?
Right :) I'm buying an SSD as we speak and I'll report back later this week when I...

when I...

..reinstall linux again. And battle with the nvidia xorg configuration files to enable coolbits correctly.

Ohno.

Code: Select all

Sudo nvidia-smi -i 0 -lgc 1835,1935
With 1835 min and 1935 max gpu speeds.
It'll force gpu speeds to remain high, and have less of a temp drop.
Setting max too high, won't damage the gpu, as it'll only go as high as the driver allows.
Setting min value too high the same.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Temperature/GPU Usage unstable; drops align with checkpo

Post by bruce »

Data transfer speeds are an important factor in determining the WIDTH of the dip, but so is the speed of your CPU. Offloading some processing to the CPU is generally a good practice.

What project was running, and what was the checkpoint interval?

In this example project, that's every 2% (every 4 minutes) though it will depend on the GPU speed.

19:37:08:WU00:FS01:0x22: Checkpoint write interval: 25000 steps (2%) [50 total]
19:37:08:WU00:FS01:0x22: JSON viewer frame write interval: 12500 steps (1%) [100 total]
19:37:08:WU00:FS01:0x22: XTC frame write interval: 10000 steps (0.8%) [125 total]
19:37:08:WU00:FS01:0x22: Global context and integrator variables write interval: disabled

20:44:56:WU00:FS01:0x22:Checkpoint completed at step 25000
21:16:21:WU00:FS01:0x22:Completed 37500 out of 1250000 steps (3%)
21:47:46:WU00:FS01:0x22:Completed 50000 out of 1250000 steps (4%)
21:48:45:WU00:FS01:0x22:Checkpoint completed at step 50000
22:20:09:WU00:FS01:0x22:Completed 62500 out of 1250000 steps (5%)
22:51:36:WU00:FS01:0x22:Completed 75000 out of 1250000 steps (6%)
22:52:34:WU00:FS01:0x22:Checkpoint completed at step 75000
Post Reply