Nvidia vs AMD CPU usage

arisu · Post by **arisu** » Tue Mar 04, 2025 4:09 am

I've heard that the CPU thread that feeds an Nvidia GPU is always 100% loaded no matter how much work the GPU is doing, but the CPU thread to feed an AMD GPU is proportional to the amount of processing being done on the GPU. Is this true? And why is this, on a technical level?

Unless I'm wrong, the purpose of the CPU thread for any GPU project is:
- Transferring data to and from the GPU and doing bookkeeping
- Performing occasional sanity checks (initially and during each checkpoint I think)
- Reconciling computed forces between independently-processed slices/ranks (or is that only a thing for GROMACS?)

But the work required to do any of that is proportional to the amount of work the GPU is doing. So why does the CPU thread that manages folding on an Nvidia GPU always use 100% but not on an AMD GPU? I guess it's got something to do with OpenCL vs CUDA?

Not complaining about high usage or anything, just curious.

Post by **Joe_H** » Tue Mar 04, 2025 4:40 am

From what I understand, the difference is in how Nvidia and AMD wrote their drivers. Nvidia's driver is doing a spin-wait looking for instructions to be processed and sent to the GPU. AMD from the explanations I have seen implemented this as an interrupt instead. As soon as something is handed off to the driver to process, it wakes up and takes CPU cycles to handle the request and then goes inactive until the next request. So the Nvidia driver process is always active, but the actual amount of work done by the CPU may be a fraction of the cycles available.

Your understanding of the CPU thread usage is similar to mine. Though I am not certain exactly how OpenMM handles the reconciliation between blocks of data from the WU sent to the GPU for processing. There are options as well to pass on other functions to the CPU in OpenMM that the GPU folding cores for F@h currently do not use. For example they could pass 64-bit calculations to the CPU and use GPUs without 64-bit support. But that would mostly only be needed for older or less powerful GPUs, and from testing in the past would also slow down processing on the rest of the GPUs.

arisu · Post by **arisu** » Tue Mar 04, 2025 5:58 am

It's shocking that the Nvidia drivers would be polling-driven instead of interrupt-driven. It does this for both Windows and Linux drivers?

When HIP is rolled out to FAH (pull request #328 in fah-client-bastet gives me hope), I pray it will increase AMD performance so that it is comparable to Nvidia and so the polling-driven wait loops can be avoided. CUDA source code can be transpiled into HIP source code, so in theory every CUDA project can immediately switch to HIP on AMD platforms.

Is there a high-level overview somewhere about how OpenMM handles slices/ranks and reconciling forces between them?

Post by **Joe_H** » Tue Mar 04, 2025 6:34 am

I don't know the details of the Nvidia driver implementation across OSs, but from the reports I have seen posted here under both there appears to be a CPU thread continuously active while folding whether it is on Windows or Linux. Some experimented a few years ago in multi GPU systems and it appeared a single CPU core could be enough to handle the driver overhead for two or more. But that was with somewhat less powerful GPUs.

Originally OpenCL was used on both Nvidia and AMD GPUs. But Nvidia always gave less than the best support for OpenCL and eventually the GPU core developers took on the extra programming overhead to support both CUDA and OpenCL. There are some limitations, to support the latest GPUs and older ones could require more core versions. At the moment the least common denominator CUDA library will support use of Maxwell to the newest cards. Kepler cards end up falling back to OpenCL. They are working on adding HIP support for AMD, but there have been some issues. Eventually it may be ready for release, but no idea when exactly.

I don't know if there is an overview available, but the site for OpenMM may have something - https://openmm.org.

arisu · Post by **arisu** » Thu Mar 06, 2025 7:05 am

Would be much utility to have a program that monitors GPU usage and uses cgroups to limit the CPU usage of the corresponding thread (actually limiting usage, not priority)? Slowly lowering the max CPU use of the thread until the GPU usage starts to decrease / increasing it until GPU usage stops rising. Cgroups CPU limiting works by refusing to return a timeslice to the thread if the thread has exceeded its CPU usage limit over a short period (like a millisecond), even if it means scheduling SCHED_IDLE or the idle kthread instead.

In theory, that would reduce the waste from polling.

muziqaz · Post by **muziqaz** » Thu Mar 06, 2025 7:20 am

Nvidia is so sensitive to CPU loads, that it even slows down if CPU is folding on free cores. It gains quite a lot of points per day by just not folding on the CPU at all

arisu · Post by **arisu** » Thu Mar 06, 2025 7:26 am

Wow! Maybe the scheduler jostling the thread around hurts performance. I bet a solution would be to bind the Nvidia's CPU thread to a specific core with taskset, and use cpusets to block that CPU thread (and its SMT sibling thread) from the scheduler.

muziqaz · Post by **muziqaz** » Thu Mar 06, 2025 7:35 am

arisu wrote: ↑Thu Mar 06, 2025 7:26 am Wow! Maybe the scheduler jostling the thread around hurts performance. I bet a solution would be to bind the Nvidia's CPU thread to a specific core with taskset, and use cpusets to block that CPU thread (and its SMT sibling thread) from the scheduler.

I think it was tried by Nvidians. No luck

arisu · Post by **arisu** » Sat Mar 08, 2025 7:19 am

After looking into CUDA programming a little, it seems there is a way to switch it from polling in a loop to interrupt-driven for synchronization. Because it's so easy to make it switch to interrupt-driven there is likely a good reason it hasn't been done. I'm guessing it's because the latency would increase and maybe the Nvidia driver has unacceptable latency when using the interrupt-driven approach.

muziqaz · Post by **muziqaz** » Sat Mar 08, 2025 8:18 am

The person who integrated CUDA into FAHcore was working for Nvidia at that time, so I'm pretty sure they knew what they were doing

Post by **toTOW** » Sun Mar 09, 2025 6:21 pm

The fun fact is that OpenCL uses passive polling and doesn't use CPU much while CUDA uses active polling and a full CPU thread ... and this all on nVidia (Windows).

arisu · Post by **arisu** » Mon Mar 10, 2025 1:08 am

Does CUDA transpiled to HIP use busy polling? It will be annoying if AMD gets a speedup at the expense of a CPU core at least on lower end devices where the CPU makes up a good fraction of the PPD.

muziqaz · Post by **muziqaz** » Mon Mar 10, 2025 8:22 am

No, there were no signs of that in initial hip testing.

arisu · Post by **arisu** » Tue Mar 11, 2025 12:35 am

When HIP is rolled out, will Nvidia systems use it instead of CUDA since hipify (supposedly) produces equally-performant kernels?

muziqaz · Post by **muziqaz** » Tue Mar 11, 2025 5:55 am

arisu wrote: ↑Tue Mar 11, 2025 12:35 am When HIP is rolled out, will Nvidia systems use it instead of CUDA since hipify (supposedly) produces equally-performant kernels?

No

Folding Forum

Nvidia vs AMD CPU usage

Nvidia vs AMD CPU usage

Re: Nvidia vs AMD CPU usage

Re: Nvidia vs AMD CPU usage

Re: Nvidia vs AMD CPU usage

Re: Nvidia vs AMD CPU usage

Re: Nvidia vs AMD CPU usage

Re: Nvidia vs AMD CPU usage

Re: Nvidia vs AMD CPU usage

Re: Nvidia vs AMD CPU usage

Re: Nvidia vs AMD CPU usage

Re: Nvidia vs AMD CPU usage

Re: Nvidia vs AMD CPU usage

Re: Nvidia vs AMD CPU usage

Re: Nvidia vs AMD CPU usage

Re: Nvidia vs AMD CPU usage