Page 1 of 1

debian; gpu sometimes vanishes

Posted: Mon Aug 03, 2020 5:17 pm
by Knish
Sunday was interesting: found 3 VMs over the course of the day getting assigned to 192.0.2.1 Always after coming back from what otherwise looks like a normal, stable preemption. The start of their logs was very telling...

Code: Select all

*********************** Log Started 2020-08-02T17:17:36Z ***********************
17:17:36:Trying to access database...
17:17:36:Successfully acquired database lock
17:17:36:Read GPUs.txt
17:17:38:Enabled folding slot 01: READY gpu:0:TU104GL [Tesla T4] 8141
[91m17:17:38:ERROR:No compute devices matched GPU #0 {[0m
[91m17:17:38:ERROR:  "vendor": 4318,[0m
[91m17:17:38:ERROR:  "device": 7864,[0m
[91m17:17:38:ERROR:  "type": 2,[0m
[91m17:17:38:ERROR:  "species": 6,[0m
[91m17:17:38:ERROR:  "description": "TU104GL [Tesla T4] 8141"[0m
[91m17:17:38:ERROR:}.  You may need to update your graphics drivers.[0m
17:17:38:****************************** FAHClient ******************************
Fortunately the fix was very simple with a re-do of the commands

Code: Select all

sudo apt install linux-headers-$(uname -r)

sudo ./NVIDIA-<driver details blah>.run
I didn't even have to stop/start FAH client; just paused it if it had a cpu WU.

As far as I could tell, the only difference between before and after the error was the OS line in ****System**** where Linux 4.19.0-10-cloud-amd64 x86_64 used to be 4.19.0-9
I wonder if this is related to the GnuTLS discovery? But if I create a new VM, it's still using 9
Oh well, at least they're all working now.



Here are both logs from the first VM if you're curious
before:

Code: Select all

***************** Log Started 2020-08-01T16:47:40Z ***************
16:47:40:Trying to access database...
16:47:40:Successfully acquired database lock
16:47:40:Read GPUs.txt
16:47:43:Enabled folding slot 01: READY gpu:0:TU104GL [Tesla T4] 8141
16:47:43:************************* FAHClient ***********************
16:47:43:        Version: 7.6.13
16:47:43:         Author: Joseph Coffland <[email protected]>
16:47:43:      Copyright: 2020 foldingathome.org
16:47:43:       Homepage: https://foldingathome.org/
16:47:43:           Date: Apr 28 2020
16:47:43:           Time: 04:20:16
16:47:43:       Revision: 5a652817f46116b6e135503af97f18e094414e3b
16:47:43:         Branch: master
16:47:43:       Compiler: GNU 8.3.0
16:47:43:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
16:47:43:                 -funroll-loops -fno-pie
16:47:43:       Platform: linux2 4.19.0-5-amd64
16:47:43:           Bits: 64
16:47:43:           Mode: Release
16:47:43:           Args: --child /etc/fahclient/config.xml --run-as fahclient
16:47:43:                 --pid-file=/var/run/fahclient.pid --daemon
16:47:43:         Config: /etc/fahclient/config.xml
16:47:43:************************** CBang **************************
16:47:43:           Date: Apr 25 2020
16:47:43:           Time: 00:07:53
16:47:43:       Revision: ea081a3b3b0f4a37c4d0440b4f1bc184197c7797
16:47:43:         Branch: master
16:47:43:       Compiler: GNU 8.3.0
16:47:43:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
16:47:43:                 -funroll-loops -fno-pie -fPIC
16:47:43:       Platform: linux2 4.19.0-5-amd64
16:47:43:           Bits: 64
16:47:43:           Mode: Release
16:47:43:************************* System **************************
16:47:43:            CPU: Intel(R) Xeon(R) CPU @ 2.20GHz
16:47:43:         CPU ID: GenuineIntel Family 6 Model 79 Stepping 0
16:47:43:           CPUs: 1
16:47:43:         Memory: 2.44GiB
16:47:43:    Free Memory: 2.18GiB
16:47:43:        Threads: POSIX_THREADS
16:47:43:     OS Version: 4.19
16:47:43:    Has Battery: false
16:47:43:     On Battery: false
16:47:43:     UTC Offset: 0
16:47:43:            PID: 450
16:47:43:            CWD: /var/lib/fahclient
16:47:43:             OS: Linux 4.19.0-9-cloud-amd64 x86_64
16:47:43:        OS Arch: AMD64
16:47:43:           GPUs: 1
16:47:43:          GPU 0: Bus:0 Slot:4 Func:0 NVIDIA:6 TU104GL [Tesla T4] 8141
16:47:43:  CUDA Device 0: Platform:0 Device:0 Bus:0 Slot:4 Compute:7.5 Driver:10.0
16:47:43:OpenCL Device 0: Platform:0 Device:0 Bus:0 Slot:4 Compute:1.2 Driver:410.104
16:47:43:************************** libFAH **************************
16:47:43:           Date: Apr 15 2020
16:47:43:           Time: 21:43:24
16:47:43:       Revision: 216968bc7025029c841ed6e36e81a03a316890d3
16:47:43:         Branch: master
16:47:43:       Compiler: GNU 8.3.0
16:47:43:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
16:47:43:                 -funroll-loops -fno-pie
16:47:43:       Platform: linux2 4.19.0-5-amd64
16:47:43:           Bits: 64
16:47:43:           Mode: Release
After:

Code: Select all

*************** Log Started 2020-08-02T17:17:36Z ***********************
17:17:36:Trying to access database...
17:17:36:Successfully acquired database lock
17:17:36:Read GPUs.txt
17:17:38:Enabled folding slot 01: READY gpu:0:TU104GL [Tesla T4] 8141
[91m17:17:38:ERROR:No compute devices matched GPU #0 {[0m
[91m17:17:38:ERROR:  "vendor": 4318,[0m
[91m17:17:38:ERROR:  "device": 7864,[0m
[91m17:17:38:ERROR:  "type": 2,[0m
[91m17:17:38:ERROR:  "species": 6,[0m
[91m17:17:38:ERROR:  "description": "TU104GL [Tesla T4] 8141"[0m
[91m17:17:38:ERROR:}.  You may need to update your graphics drivers.[0m
17:17:38:****************************** FAHClient ******************************
17:17:38:    Version: 7.6.13
17:17:38:     Author: Joseph Coffland <[email protected]>
17:17:38:  Copyright: 2020 foldingathome.org
17:17:38:   Homepage: https://foldingathome.org/
17:17:38:       Date: Apr 28 2020
17:17:38:       Time: 04:20:16
17:17:38:   Revision: 5a652817f46116b6e135503af97f18e094414e3b
17:17:38:     Branch: master
17:17:38:   Compiler: GNU 8.3.0
17:17:38:    Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
17:17:38:             -fno-pie
17:17:38:   Platform: linux2 4.19.0-5-amd64
17:17:38:       Bits: 64
17:17:38:       Mode: Release
17:17:38:       Args: --child /etc/fahclient/config.xml --run-as fahclient
17:17:38:             --pid-file=/var/run/fahclient.pid --daemon
17:17:38:     Config: /etc/fahclient/config.xml
17:17:38:******************************** CBang ********************************
17:17:38:       Date: Apr 25 2020
17:17:38:       Time: 00:07:53
17:17:38:   Revision: ea081a3b3b0f4a37c4d0440b4f1bc184197c7797
17:17:38:     Branch: master
17:17:38:   Compiler: GNU 8.3.0
17:17:38:    Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
17:17:38:             -fno-pie -fPIC
17:17:38:   Platform: linux2 4.19.0-5-amd64
17:17:38:       Bits: 64
17:17:38:       Mode: Release
17:17:38:******************************* System ********************************
17:17:38:        CPU: Intel(R) Xeon(R) CPU @ 2.20GHz
17:17:38:     CPU ID: GenuineIntel Family 6 Model 79 Stepping 0
17:17:38:       CPUs: 1
17:17:38:     Memory: 2.44GiB
17:17:38:Free Memory: 2.24GiB
17:17:38:    Threads: POSIX_THREADS
17:17:38: OS Version: 4.19
17:17:38:Has Battery: false
17:17:38: On Battery: false
17:17:38: UTC Offset: 0
17:17:38:        PID: 413
17:17:38:        CWD: /var/lib/fahclient
17:17:38:         OS: Linux 4.19.0-10-cloud-amd64 x86_64
17:17:38:    OS Arch: AMD64
17:17:38:       GPUs: 1
17:17:38:      GPU 0: Bus:0 Slot:4 Func:0 NVIDIA:6 TU104GL [Tesla T4] 8141
17:17:38:       CUDA: Not detected: cuInit() returned 100
17:17:38:     OpenCL: Not detected: clGetPlatformIDs() returned -1001
17:17:38:******************************* libFAH ********************************
17:17:38:       Date: Apr 15 2020
17:17:38:       Time: 21:43:24
17:17:38:   Revision: 216968bc7025029c841ed6e36e81a03a316890d3
17:17:38:     Branch: master
17:17:38:   Compiler: GNU 8.3.0
17:17:38:    Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
17:17:38:             -fno-pie
17:17:38:   Platform: linux2 4.19.0-5-amd64
17:17:38:       Bits: 64
17:17:38:       Mode: Release

Re: debian; gpu sometimes vanishes

Posted: Mon Aug 03, 2020 7:35 pm
by toTOW
There are other solutions in this long thread about this issue : viewtopic.php?f=18&t=35906