Unable to run NVIDIA GPU with driver 535 [Solved]

Moderators: Site Moderators, FAHC Science Team

AthanSpod
Posts: 11
Joined: Wed Mar 25, 2020 8:27 am

Re: Unable to run NVIDIA GPU with driver 535 [Solved]

Post by AthanSpod »

Hmmm, except after shutdown and booting backup:
  1. `nvidia_uvm` module is and was loaded,
  2. But fah-client again didn't think CUDA was supported. The usual `CUDA not supported: cuInit() returned 999` logged.
  3. `systemctl restart fah-client.service` has it working OK again.
Logging for module insertion and fah-client startup:

Code: Select all

2024-12-06T08:26:46.967616+00:00 emilia systemd-modules-load[552]: Inserted module 'nvidia_uvm'
2024-12-06T08:26:58.410268+00:00 emilia systemd[1]: Started fah-client.service - Folding@home Client.
So you'd think that was timed such that it should have worked.
HackinDoge
Posts: 1
Joined: Wed Dec 18, 2024 12:03 am

Re: Unable to run NVIDIA GPU with driver 535 [Solved]

Post by HackinDoge »

Not sure if valuable/applicable, but my workaround has been running /usr/bin/nvidia-smi right before starting up FAH. No insight as to how/why that works, but it does...

Code: Select all

$ cat /sys/module/nvidia/version 
550.135

Code: Select all

$ nvidia-smi
Tue Dec 17 16:08:57 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.135                Driver Version: 550.135        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1070 Ti     Off |   00000000:01:00.0 Off |                  N/A |
| 38%   69C    P2            156W /  180W |    1035MiB /   8192MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     30083      C   ...2009-64bit-release-8.1.4/FahCore_24       1032MiB |
+-----------------------------------------------------------------------------------------+

Code: Select all

$ podman image ls
REPOSITORY                          TAG         IMAGE ID      CREATED        SIZE
lscr.io/linuxserver/foldingathome   latest      53f4ad7aec5a  21 hours ago   420 MB

Code: Select all

$ podman container ls
CONTAINER ID  IMAGE                                     COMMAND     CREATED       STATUS       PORTS                             NAMES
9d734ec5fb0d  lscr.io/linuxserver/foldingathome:latest              17 hours ago  Up 17 hours  0.0.0.0:7396->7396/tcp            foldingathome
Marcos FRM
Posts: 28
Joined: Fri Feb 23, 2024 6:26 pm

Re: Unable to run NVIDIA GPU with driver 535 [Solved]

Post by Marcos FRM »

Up until version 8.4.9, the fah-client service has NoNewPrivileges=yes set, meaning any process running as a normal user can't escalate privileges, like running SUID root binaries. It's a crucial security measure, as nothing in the fah-client needs root privileges.

Unfortunately, the Nvidia driver is buggy and, under certain circumstances, relies on the nvidia-modprobe binary (see https://manpages.ubuntu.com/manpages/or ... obe.1.html) to create device nodes and do other tweaks. This binary will be invoked, I'm not entirely sure how, by the process requesting CUDA: in other words, by a process running within the fah-client service, running as the fah-client user, which is restricted by NoNewPrivileges=yes. So it ends up failing.

For the next version, we're disabling this feature (reluctantly) to avoid this issue, hoping Nvidia fixes their driver someday.
Post Reply