Page 1 of 1

Lost ability to run two GPUs

Posted: Tue Feb 23, 2021 2:43 am
by craig110
I have a two-GPU configuration using a GeForce GTX 1080 Ti that was added (back around November) to the GTX 1660 Ti that was in the box. I got them both working and was getting really nice crunching on FaH. (Mobo/cpu is nothing really special, so with these two GPUs I don't even bother with CPU crunching.) FaH version 7.6.13. Nothing fancy in the config.xml file -- just a second "<slot id='2' type='GPU'/>".

About two weeks ago, the dreaded "something happened" happened. The 1080's DVI port that the monitor was plugged into stopped displaying anything. Reducing the complexity of the system to debug this resulted in concluding that yes, something happened to that one port and the 1080's DisplayPort are fine. Great, moved the monitor to it, boot up with just the 1080 Ti and FaH happily crunches using CUDA. (Although attempting to get a task for the other "device" -- both were still in config.xml -- squawked about not having a default for that device. Fine, that was expected.)

I reconnected the 1660 Ti back in into the configuration that both GPUs happily used for crunching since November and while the system booted up and lspci shows both GPUs, FaH was unhappy. (Also, "psensor" only showed the sensors for the 1080 Ti.) The log in this configuration starts with:

No compute devices matched GPU #0 {
"vendor": 4318,
"device": 6918,
"type": 2,
"species": 8,
"description": "GP102 [GeForce GTX 1080 Ti] 11380"
}. You may need to update your graphics drivers.
No compute devices matched GPU #1 {
"vendor": 4318,
"device": 8578,
"type": 2,
"species": 7,
"description": "TU116 [GeForce GTX 1660 Ti]"
}. You may need to update your graphics drivers.

FaH was trying to allocate tasks but kept cycling on this in the log:

WU01:FS02:Assigned to work server 192.0.2.1
WU01:FS02:Requesting new work unit for slot 02: READY gpu:1:TU116 [GeForce GTX 1660 Ti] from 192.0.2.1
WU01:FS02:Connecting to 192.0.2.1:8080
WU00:FS01:Starting
ERROR:WU00:FS01:Failed to start core: OpenCL device matching slot 1 not found, make sure the OpenCL driver is installed or try setting 'opencl-index' manually
WARNING:WU01:FS02:WorkServer connection failed on port 8080 trying 80

lather, rinse, repeat. OpenCL?!? Both cards normally run CUDA. (Having "<slot id='2' type='GPU'/>" enabled or disabled in my config did what one would expect.)

Ok, thinking that the 1660 Ti had a problem, I took the 1080 Ti out and ran with just the 1660 Ti (hmm, I realize now it was in the same pci-e slot that the 1080 Ti was in - hint?), system rebooted and FaH ran happily using CUDA on the 1660 Ti.

<pause>

Given the realization I had while typing about the 1660 Ti running alone in the last paragraph, that when run alone in the earlier tests that it was in the same slot that the 1880 Ti happily runs in alone, I just reconfigured the system to use only the 1660 Ti but now in the second slot. It worked fine with FaH running CUDA on it. (Well, "fine" except for having to dump the current task that was about 1/3rd done. I hate doing that.)

I'm at a loss. I can usually figure out configuration issues, but this one now has me stumped and for most of the past two weeks, I've had a nice 1660 Ti sitting idle instead of FaH crunching. :( What do I try next? Rebuild the OS and FaH-supporting software from scratch?

Thanks!

Re: Lost ability to run two GPUs

Posted: Tue Feb 23, 2021 6:08 am
by ajm
It looks like a driver problem. You have that line:

Code: Select all

You may need to update your graphics drivers.
The fact that OpenCL was not found also points to a driver issue. And FAH needs to find OpenCL first, even if it eventually will use CUDA.

Also, generally, it is safer to uninstall FAH (incl. data) before a change in the hardware configuration, and reinstall after the change.
I would uninstall FAH (incl. data), then update or replace the GPU drivers, and then reinstall FAH.

If it doesn't work, please post your log: viewtopic.php?p=327412&f=24#p327412

Re: Lost ability to run two GPUs

Posted: Tue Feb 23, 2021 6:49 pm
by bruce
Adding GPUs or moving them to a different slot has generally required reinstalling drivers and restarting FAHClient. I have not done that extensively and I can't tell you why reinstalling drivers often seems to be required. I'm quite certain that Installing a different device re-customizes the driver package to support any new features -- and that process probably uninstalls features that are no longer connected. I can't think of a good reason why moving a device from one slot to another would matter, but I can't guarantee whether it matters or not.

Does the driver package change when you move a monitor from a DVI port to a HDMI port or to a DisplayPort? I have no idea but that's a possibility, too.

Bottom line: If you always reinstall drivers when you change hardware in any way and reboot, it'll probably work. Is that always required? Possibly not.

Re: Lost ability to run two GPUs

Posted: Tue Mar 02, 2021 7:24 pm
by craig110
Oh, this is totally embarrassing, but I found the problem and might as well 'fess up in case it helps anyone else to avoid this.

To recap the problem, my system has two video cards. Folded for several months in this config yet one day the monitor (connected to one of these two cards) would show nothing. Simplifying the config in order to isolate the problem (moving the cards around, using just one at a time, etc.), I found that the one port that the monitor was connected to doesn't work anymore, but the others do. Moving everything back to the original config (albeit using a different video port ;-) I could not get FAH to fold on both cards again. Back to debugging, each of the cards folded just fine when it was the only card in use, but when I tried two, nope, at best the main one would fold but often neither would on various FAH complaints about not finding CUDA nor OpenCL.

I finally had time last evening to take the drastic step of rebuilding Ubuntu from scratch and reload all the drivers. Nope, still not good. And by now I'm getting really puzzled why psensor isn't showing any of the sensors on the external video card while lspci shows that the device is present. Note that word "external." To keep the GeForce 1080Ti and 1660Ti from cooking each other due to how close the mobo's 16x PCIe slots are, the 1660Ti is mounted outside the case via a PCIe riser. (Strange, yes, but I've separately tied it into the case for grounding and, ahem, my cats have learned to stay away from it due to the spinning fans.) When I was debugging this situation one card at a time for simplifying, one of the things I did was eliminate the riser and only use the first mobo's 16x slot. Great, that worked, as I said above. After finishing rebuilding the systems and drivers today with no luck, I looked at the syslog file (duh, why didn't I do this earlier?) and the "external" 1660Ti was complaining that it didn't have enough power!

Silly me. Since this card is external, early on I had to get a dual-4-pin to 8-pin PCIe adapter to get a power line long out to reach out of the case. With all the hardware moving, I dislodged one of the two 4-pin connectors, so the 8-pin was giving the 1660Ti enough power to run the fans and participate enough on the bus to be recognized, but it knew it didn't have enough power to do any real work. Thus lspci showed it, but psensor and FaH didn't see it. For whatever reason, FaH knowing that the device was there but not being able to use it caused it to report errors concerning not finding CUDA or OpenCL!

The system is now back to happily crunching on both video cards. Moral of the story: Double-check sufficient power. :-(

Re: Lost ability to run two GPUs

Posted: Tue Mar 02, 2021 10:33 pm
by bruce
Thanks for the report.

I have a couple of low-power GPUs which might be usable with a 1x riser cable provided the vacant 1x slot can drive it and it would have to be external. Does anybody have any experience with that sort of setup?

Re: Lost ability to run two GPUs

Posted: Wed Mar 03, 2021 12:13 am
by craig110
bruce wrote:Thanks for the report.

I have a couple of low-power GPUs which might be usable with a 1x riser cable provided the vacant 1x slot can drive it and it would have to be external. Does anybody have any experience with that sort of setup?
One thing that I immediately noticed when I moved the original 1660 Ti into its secondary position behind the 1080 Ti is a change in how much of the PCIe it consumed. (I don't know which OS you use, but Ubuntu's psensor output will show me what percentage of the PCIe is being consumed by each card.) When my 1660 was alone in a 16x slot, it averaged around 30-35% of the PCIe. When I put the 1080 Ti into that first 16x slot, it is getting around the same 30-35% PCIe, but the 1660 went up to 100% minus whatever percentage the 1080 was getting. I found that the slot I had externally connected the 1660 into wasn't a 16x slot, so I moved the riser to the other 16x slot. What do I now see? The 1660 still takes about twice the PCIe as does the 1080! After researching this, I found that the actual max number of PCIe lanes that can be used (or at least be used simultaneously) is controlled by the motherboard's southbridge chip and the CPU itself (with the minimum amount "winning", of course). Essentially, even though my 1660 is in a "16x" slot, it is only getting 8x of bandwidth while both it and the 1080 are busy communicating over the PCIe. Now, I've not seen a major decrease in the total throughput of the 1660 which speaks well to the amount of FAH computing being given to the card for each PCI transfer, but at some point slowing down the data throughput is going to impact the card's throughput.

The reason the 1x risers are used so commonly in mining applications is that their ratio of computes-per-transfer is so high than doing the transfers at 1x speed doesn't have much of a real impact. I don't know where FAH sits in the computes-per-transfer ratio, but it appears to be a lot lower than the mining software's ratio and is thus more sensitive to the lane constraints of a particular system's configuration.

This is a long-winded, round-about way of saying "hmm, depends on your system." ;-) If your motherboard / CPU combination / PCI adapters-in-use combination leaves PCIe lanes that aren't being used (which is different from just having a physical slot that is unused), putting an older card on a 1x riser should give you more FAH power (though my gut says that at 1x you might see less than expected results from that one card). The good news is that 1x risers are cheap, so give it a try! What would be an interesting test is to take a single 16x video card system with a known PPD and plug that card instead into a 1x riser to see how the PPD changes as that would be the true test of how badly being in a 1x slot hurts FAH. If it isn't a bad hit, great, add more cards via the risers. If the hit at 1x is pretty bad, though, you might want to consider the value of running the extra card compared to the cost of its power consumption.

Do be careful to watch your total power consumption, though. I have a rather beefy power supply in the system running the two video cards.

Re: Lost ability to run two GPUs

Posted: Wed Mar 03, 2021 12:04 pm
by gunnarre
With PCIe v4, you can also transfer more data per lane than PCIe v3, due to the higher bandwith. But presumably risers will be harder to make reliable for PCIe v4? At the moment there is very little benefit for PCIe v4 for most users, but in this application of multiple GPUs in one machine it might already make sense.

Re: Lost ability to run two GPUs

Posted: Thu Mar 04, 2021 3:25 am
by MeeLee
gunnarre wrote:With PCIe v4, you can also transfer more data per lane than PCIe v3, due to the higher bandwith. But presumably risers will be harder to make reliable for PCIe v4? At the moment there is very little benefit for PCIe v4 for most users, but in this application of multiple GPUs in one machine it might already make sense.
It''ll be hard to find compatible BIOSes that will allow more than 2 or 3 GPUs.
PCIE 4.0 will possibly be a must on RTX 4000/5000 series GPUs, as the 3000 GPUs easily fill PCIE 3.0 x8 speeds.

Re: Lost ability to run two GPUs

Posted: Sat Apr 03, 2021 3:09 pm
by craig110
Earlier in this thread there was the question about how much performance is lost moving to narrower PCIe 3.0 slots. Yesterday I moved my GeForce 1660 Ti (reminder: this is a dual GPU system; GeForce 1080 Ti and 1660 Ti) from a slot operating at x8 speed to a slot operating at x4 speed. The result was that Psensor showed that the graphics utilization of the 1660 dropped from 95-100% to consistently being in the 77-78% range now. Running in this configuration overnight to see my total points-per-day also reflects that I lost about a quarter of the 1660's computing power.

This doesn't mean that everyone would lose a quarter of the GPU going from x8 to x4, of course, as the resulting performance is dependent upon how fast the GPU is. In general, the faster the GPU the more of a hit will be taken moving it into a slower slot. (Keep in mind that the physical slot size doesn't mean that is the speed it will be transferring data at. My 1080 Ti is in the first x16 slot and is actually getting 16 PCIe lanes so it is running at full speed. My motherboard and CPU don't support enough lanes to feed 16 to all three of the "x16" physical slots, so the first is assigned 16 lanes, the second is assigned 8 and the last "x16" slot gets 4, so these three "x16" physical slots run at, respectively, x16, x8, and x4.)

So, the original question about using x1 risers (like they do in mining boxes) to move cards externally to ease their cooling is probably not a good idea with folding@home. The mining applications do far more computing per unit of data transferred so they can get away with an x1 speed, but not folding@home.

Re: Lost ability to run two GPUs

Posted: Sat Apr 03, 2021 5:07 pm
by MeeLee
craig110 wrote:Earlier in this thread there was the question about how much performance is lost moving to narrower PCIe 3.0 slots. Yesterday I moved my GeForce 1660 Ti (reminder: this is a dual GPU system; GeForce 1080 Ti and 1660 Ti) from a slot operating at x8 speed to a slot operating at x4 speed. The result was that Psensor showed that the graphics utilization of the 1660 dropped from 95-100% to consistently being in the 77-78% range now. Running in this configuration overnight to see my total points-per-day also reflects that I lost about a quarter of the 1660's computing power.

This doesn't mean that everyone would lose a quarter of the GPU going from x8 to x4, of course, as the resulting performance is dependent upon how fast the GPU is. In general, the faster the GPU the more of a hit will be taken moving it into a slower slot. (Keep in mind that the physical slot size doesn't mean that is the speed it will be transferring data at. My 1080 Ti is in the first x16 slot and is actually getting 16 PCIe lanes so it is running at full speed. My motherboard and CPU don't support enough lanes to feed 16 to all three of the "x16" physical slots, so the first is assigned 16 lanes, the second is assigned 8 and the last "x16" slot gets 4, so these three "x16" physical slots run at, respectively, x16, x8, and x4.)

So, the original question about using x1 risers (like they do in mining boxes) to move cards externally to ease their cooling is probably not a good idea with folding@home. The mining applications do far more computing per unit of data transferred so they can get away with an x1 speed, but not folding@home.
Try running Linux instead for folding.
An x4 slot should still easily feed a 2070 or less.

Re: Lost ability to run two GPUs

Posted: Sat Apr 03, 2021 5:20 pm
by bruce
craig110 wrote:This doesn't mean that everyone would lose a quarter of the GPU going from x8 to x4, of course, as the resulting performance is dependent upon how fast the GPU is. In general, the faster the GPU the more of a hit will be taken moving it into a slower slot.
Yes, this is true, but it also depends on the size of the protein. Moving the GPUs to new slots will change their speed, but your testing won't be valid unless you're still running the same size protein.

Since the BIOS generally allocates more lanes to the first slots, the general rule is: Put your fastest GPU in the first slot, and so forth.

Re: Lost ability to run two GPUs

Posted: Sat Apr 03, 2021 5:49 pm
by gunnarre
On some motherboards, the first PCIe slot is connected directly to the CPU without even going through the motherboard chipset. Some server style boards have even more PCIe slots which are directly connected to the CPU.

Re: Lost ability to run two GPUs

Posted: Sat Apr 03, 2021 8:08 pm
by craig110
Bruce: Absolutely it depends upon the protein. My letting it run overnight plus just spot-checking it again includes probably ~5 proteins and it is holding the 77-78% graphics usage pretty consistently, so I'm happy with the validity of that number. Not that it really matters, though, as the point was to answer the initial question about whether running F@H on x1 risers would be problematic. I think you'd agree it would be unless the GPU was a really slow one. I do totally agree with putting the fastest card in the first slot and yes, it tends to get the max lanes for that slot.

Gunnarre: Oh yes, I'm starting to look at higher-end / server motherboards that can pump x16 speeds to multiple slots, especially since I was hoping to add a third GPU towards the end of this year. Too bad that also requires going to server-level CPUs so that the CPU also supports enough lanes to keep them busy. At that point it almost becomes easier to just have multiple conventional systems with one x16 and one x8 slot used in each. :-(