debian; gpu sometimes vanishes
Posted: Mon Aug 03, 2020 5:17 pm
Sunday was interesting: found 3 VMs over the course of the day getting assigned to 192.0.2.1 Always after coming back from what otherwise looks like a normal, stable preemption. The start of their logs was very telling...
Fortunately the fix was very simple with a re-do of the commands
I didn't even have to stop/start FAH client; just paused it if it had a cpu WU.
As far as I could tell, the only difference between before and after the error was the OS line in ****System**** where Linux 4.19.0-10-cloud-amd64 x86_64 used to be 4.19.0-9
I wonder if this is related to the GnuTLS discovery? But if I create a new VM, it's still using 9
Oh well, at least they're all working now.
Here are both logs from the first VM if you're curious
before:
After:
Code: Select all
*********************** Log Started 2020-08-02T17:17:36Z ***********************
17:17:36:Trying to access database...
17:17:36:Successfully acquired database lock
17:17:36:Read GPUs.txt
17:17:38:Enabled folding slot 01: READY gpu:0:TU104GL [Tesla T4] 8141
[91m17:17:38:ERROR:No compute devices matched GPU #0 {[0m
[91m17:17:38:ERROR: "vendor": 4318,[0m
[91m17:17:38:ERROR: "device": 7864,[0m
[91m17:17:38:ERROR: "type": 2,[0m
[91m17:17:38:ERROR: "species": 6,[0m
[91m17:17:38:ERROR: "description": "TU104GL [Tesla T4] 8141"[0m
[91m17:17:38:ERROR:}. You may need to update your graphics drivers.[0m
17:17:38:****************************** FAHClient ******************************
Code: Select all
sudo apt install linux-headers-$(uname -r)
sudo ./NVIDIA-<driver details blah>.run
As far as I could tell, the only difference between before and after the error was the OS line in ****System**** where Linux 4.19.0-10-cloud-amd64 x86_64 used to be 4.19.0-9
I wonder if this is related to the GnuTLS discovery? But if I create a new VM, it's still using 9
Oh well, at least they're all working now.
Here are both logs from the first VM if you're curious
before:
Code: Select all
***************** Log Started 2020-08-01T16:47:40Z ***************
16:47:40:Trying to access database...
16:47:40:Successfully acquired database lock
16:47:40:Read GPUs.txt
16:47:43:Enabled folding slot 01: READY gpu:0:TU104GL [Tesla T4] 8141
16:47:43:************************* FAHClient ***********************
16:47:43: Version: 7.6.13
16:47:43: Author: Joseph Coffland <[email protected]>
16:47:43: Copyright: 2020 foldingathome.org
16:47:43: Homepage: https://foldingathome.org/
16:47:43: Date: Apr 28 2020
16:47:43: Time: 04:20:16
16:47:43: Revision: 5a652817f46116b6e135503af97f18e094414e3b
16:47:43: Branch: master
16:47:43: Compiler: GNU 8.3.0
16:47:43: Options: -std=c++11 -ffunction-sections -fdata-sections -O3
16:47:43: -funroll-loops -fno-pie
16:47:43: Platform: linux2 4.19.0-5-amd64
16:47:43: Bits: 64
16:47:43: Mode: Release
16:47:43: Args: --child /etc/fahclient/config.xml --run-as fahclient
16:47:43: --pid-file=/var/run/fahclient.pid --daemon
16:47:43: Config: /etc/fahclient/config.xml
16:47:43:************************** CBang **************************
16:47:43: Date: Apr 25 2020
16:47:43: Time: 00:07:53
16:47:43: Revision: ea081a3b3b0f4a37c4d0440b4f1bc184197c7797
16:47:43: Branch: master
16:47:43: Compiler: GNU 8.3.0
16:47:43: Options: -std=c++11 -ffunction-sections -fdata-sections -O3
16:47:43: -funroll-loops -fno-pie -fPIC
16:47:43: Platform: linux2 4.19.0-5-amd64
16:47:43: Bits: 64
16:47:43: Mode: Release
16:47:43:************************* System **************************
16:47:43: CPU: Intel(R) Xeon(R) CPU @ 2.20GHz
16:47:43: CPU ID: GenuineIntel Family 6 Model 79 Stepping 0
16:47:43: CPUs: 1
16:47:43: Memory: 2.44GiB
16:47:43: Free Memory: 2.18GiB
16:47:43: Threads: POSIX_THREADS
16:47:43: OS Version: 4.19
16:47:43: Has Battery: false
16:47:43: On Battery: false
16:47:43: UTC Offset: 0
16:47:43: PID: 450
16:47:43: CWD: /var/lib/fahclient
16:47:43: OS: Linux 4.19.0-9-cloud-amd64 x86_64
16:47:43: OS Arch: AMD64
16:47:43: GPUs: 1
16:47:43: GPU 0: Bus:0 Slot:4 Func:0 NVIDIA:6 TU104GL [Tesla T4] 8141
16:47:43: CUDA Device 0: Platform:0 Device:0 Bus:0 Slot:4 Compute:7.5 Driver:10.0
16:47:43:OpenCL Device 0: Platform:0 Device:0 Bus:0 Slot:4 Compute:1.2 Driver:410.104
16:47:43:************************** libFAH **************************
16:47:43: Date: Apr 15 2020
16:47:43: Time: 21:43:24
16:47:43: Revision: 216968bc7025029c841ed6e36e81a03a316890d3
16:47:43: Branch: master
16:47:43: Compiler: GNU 8.3.0
16:47:43: Options: -std=c++11 -ffunction-sections -fdata-sections -O3
16:47:43: -funroll-loops -fno-pie
16:47:43: Platform: linux2 4.19.0-5-amd64
16:47:43: Bits: 64
16:47:43: Mode: Release
Code: Select all
*************** Log Started 2020-08-02T17:17:36Z ***********************
17:17:36:Trying to access database...
17:17:36:Successfully acquired database lock
17:17:36:Read GPUs.txt
17:17:38:Enabled folding slot 01: READY gpu:0:TU104GL [Tesla T4] 8141
[91m17:17:38:ERROR:No compute devices matched GPU #0 {[0m
[91m17:17:38:ERROR: "vendor": 4318,[0m
[91m17:17:38:ERROR: "device": 7864,[0m
[91m17:17:38:ERROR: "type": 2,[0m
[91m17:17:38:ERROR: "species": 6,[0m
[91m17:17:38:ERROR: "description": "TU104GL [Tesla T4] 8141"[0m
[91m17:17:38:ERROR:}. You may need to update your graphics drivers.[0m
17:17:38:****************************** FAHClient ******************************
17:17:38: Version: 7.6.13
17:17:38: Author: Joseph Coffland <[email protected]>
17:17:38: Copyright: 2020 foldingathome.org
17:17:38: Homepage: https://foldingathome.org/
17:17:38: Date: Apr 28 2020
17:17:38: Time: 04:20:16
17:17:38: Revision: 5a652817f46116b6e135503af97f18e094414e3b
17:17:38: Branch: master
17:17:38: Compiler: GNU 8.3.0
17:17:38: Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
17:17:38: -fno-pie
17:17:38: Platform: linux2 4.19.0-5-amd64
17:17:38: Bits: 64
17:17:38: Mode: Release
17:17:38: Args: --child /etc/fahclient/config.xml --run-as fahclient
17:17:38: --pid-file=/var/run/fahclient.pid --daemon
17:17:38: Config: /etc/fahclient/config.xml
17:17:38:******************************** CBang ********************************
17:17:38: Date: Apr 25 2020
17:17:38: Time: 00:07:53
17:17:38: Revision: ea081a3b3b0f4a37c4d0440b4f1bc184197c7797
17:17:38: Branch: master
17:17:38: Compiler: GNU 8.3.0
17:17:38: Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
17:17:38: -fno-pie -fPIC
17:17:38: Platform: linux2 4.19.0-5-amd64
17:17:38: Bits: 64
17:17:38: Mode: Release
17:17:38:******************************* System ********************************
17:17:38: CPU: Intel(R) Xeon(R) CPU @ 2.20GHz
17:17:38: CPU ID: GenuineIntel Family 6 Model 79 Stepping 0
17:17:38: CPUs: 1
17:17:38: Memory: 2.44GiB
17:17:38:Free Memory: 2.24GiB
17:17:38: Threads: POSIX_THREADS
17:17:38: OS Version: 4.19
17:17:38:Has Battery: false
17:17:38: On Battery: false
17:17:38: UTC Offset: 0
17:17:38: PID: 413
17:17:38: CWD: /var/lib/fahclient
17:17:38: OS: Linux 4.19.0-10-cloud-amd64 x86_64
17:17:38: OS Arch: AMD64
17:17:38: GPUs: 1
17:17:38: GPU 0: Bus:0 Slot:4 Func:0 NVIDIA:6 TU104GL [Tesla T4] 8141
17:17:38: CUDA: Not detected: cuInit() returned 100
17:17:38: OpenCL: Not detected: clGetPlatformIDs() returned -1001
17:17:38:******************************* libFAH ********************************
17:17:38: Date: Apr 15 2020
17:17:38: Time: 21:43:24
17:17:38: Revision: 216968bc7025029c841ed6e36e81a03a316890d3
17:17:38: Branch: master
17:17:38: Compiler: GNU 8.3.0
17:17:38: Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
17:17:38: -fno-pie
17:17:38: Platform: linux2 4.19.0-5-amd64
17:17:38: Bits: 64
17:17:38: Mode: Release