Lost ability to run two GPUs
Posted: Tue Feb 23, 2021 2:43 am
I have a two-GPU configuration using a GeForce GTX 1080 Ti that was added (back around November) to the GTX 1660 Ti that was in the box. I got them both working and was getting really nice crunching on FaH. (Mobo/cpu is nothing really special, so with these two GPUs I don't even bother with CPU crunching.) FaH version 7.6.13. Nothing fancy in the config.xml file -- just a second "<slot id='2' type='GPU'/>".
About two weeks ago, the dreaded "something happened" happened. The 1080's DVI port that the monitor was plugged into stopped displaying anything. Reducing the complexity of the system to debug this resulted in concluding that yes, something happened to that one port and the 1080's DisplayPort are fine. Great, moved the monitor to it, boot up with just the 1080 Ti and FaH happily crunches using CUDA. (Although attempting to get a task for the other "device" -- both were still in config.xml -- squawked about not having a default for that device. Fine, that was expected.)
I reconnected the 1660 Ti back in into the configuration that both GPUs happily used for crunching since November and while the system booted up and lspci shows both GPUs, FaH was unhappy. (Also, "psensor" only showed the sensors for the 1080 Ti.) The log in this configuration starts with:
No compute devices matched GPU #0 {
"vendor": 4318,
"device": 6918,
"type": 2,
"species": 8,
"description": "GP102 [GeForce GTX 1080 Ti] 11380"
}. You may need to update your graphics drivers.
No compute devices matched GPU #1 {
"vendor": 4318,
"device": 8578,
"type": 2,
"species": 7,
"description": "TU116 [GeForce GTX 1660 Ti]"
}. You may need to update your graphics drivers.
FaH was trying to allocate tasks but kept cycling on this in the log:
WU01:FS02:Assigned to work server 192.0.2.1
WU01:FS02:Requesting new work unit for slot 02: READY gpu:1:TU116 [GeForce GTX 1660 Ti] from 192.0.2.1
WU01:FS02:Connecting to 192.0.2.1:8080
WU00:FS01:Starting
ERROR:WU00:FS01:Failed to start core: OpenCL device matching slot 1 not found, make sure the OpenCL driver is installed or try setting 'opencl-index' manually
WARNING:WU01:FS02:WorkServer connection failed on port 8080 trying 80
lather, rinse, repeat. OpenCL?!? Both cards normally run CUDA. (Having "<slot id='2' type='GPU'/>" enabled or disabled in my config did what one would expect.)
Ok, thinking that the 1660 Ti had a problem, I took the 1080 Ti out and ran with just the 1660 Ti (hmm, I realize now it was in the same pci-e slot that the 1080 Ti was in - hint?), system rebooted and FaH ran happily using CUDA on the 1660 Ti.
<pause>
Given the realization I had while typing about the 1660 Ti running alone in the last paragraph, that when run alone in the earlier tests that it was in the same slot that the 1880 Ti happily runs in alone, I just reconfigured the system to use only the 1660 Ti but now in the second slot. It worked fine with FaH running CUDA on it. (Well, "fine" except for having to dump the current task that was about 1/3rd done. I hate doing that.)
I'm at a loss. I can usually figure out configuration issues, but this one now has me stumped and for most of the past two weeks, I've had a nice 1660 Ti sitting idle instead of FaH crunching. What do I try next? Rebuild the OS and FaH-supporting software from scratch?
Thanks!
About two weeks ago, the dreaded "something happened" happened. The 1080's DVI port that the monitor was plugged into stopped displaying anything. Reducing the complexity of the system to debug this resulted in concluding that yes, something happened to that one port and the 1080's DisplayPort are fine. Great, moved the monitor to it, boot up with just the 1080 Ti and FaH happily crunches using CUDA. (Although attempting to get a task for the other "device" -- both were still in config.xml -- squawked about not having a default for that device. Fine, that was expected.)
I reconnected the 1660 Ti back in into the configuration that both GPUs happily used for crunching since November and while the system booted up and lspci shows both GPUs, FaH was unhappy. (Also, "psensor" only showed the sensors for the 1080 Ti.) The log in this configuration starts with:
No compute devices matched GPU #0 {
"vendor": 4318,
"device": 6918,
"type": 2,
"species": 8,
"description": "GP102 [GeForce GTX 1080 Ti] 11380"
}. You may need to update your graphics drivers.
No compute devices matched GPU #1 {
"vendor": 4318,
"device": 8578,
"type": 2,
"species": 7,
"description": "TU116 [GeForce GTX 1660 Ti]"
}. You may need to update your graphics drivers.
FaH was trying to allocate tasks but kept cycling on this in the log:
WU01:FS02:Assigned to work server 192.0.2.1
WU01:FS02:Requesting new work unit for slot 02: READY gpu:1:TU116 [GeForce GTX 1660 Ti] from 192.0.2.1
WU01:FS02:Connecting to 192.0.2.1:8080
WU00:FS01:Starting
ERROR:WU00:FS01:Failed to start core: OpenCL device matching slot 1 not found, make sure the OpenCL driver is installed or try setting 'opencl-index' manually
WARNING:WU01:FS02:WorkServer connection failed on port 8080 trying 80
lather, rinse, repeat. OpenCL?!? Both cards normally run CUDA. (Having "<slot id='2' type='GPU'/>" enabled or disabled in my config did what one would expect.)
Ok, thinking that the 1660 Ti had a problem, I took the 1080 Ti out and ran with just the 1660 Ti (hmm, I realize now it was in the same pci-e slot that the 1080 Ti was in - hint?), system rebooted and FaH ran happily using CUDA on the 1660 Ti.
<pause>
Given the realization I had while typing about the 1660 Ti running alone in the last paragraph, that when run alone in the earlier tests that it was in the same slot that the 1880 Ti happily runs in alone, I just reconfigured the system to use only the 1660 Ti but now in the second slot. It worked fine with FaH running CUDA on it. (Well, "fine" except for having to dump the current task that was about 1/3rd done. I hate doing that.)
I'm at a loss. I can usually figure out configuration issues, but this one now has me stumped and for most of the past two weeks, I've had a nice 1660 Ti sitting idle instead of FaH crunching. What do I try next? Rebuild the OS and FaH-supporting software from scratch?
Thanks!