Page 5 of 8
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Fri Aug 14, 2020 7:32 pm
by ThWuensche
muziqaz wrote:I believe we might make a decision to ban AMD folding on Linux
fahbenching will not help anything on Linux. It is clear as day that AMD on Linux is like winning a lottery. Does more harm than good.
On Linux there should be better chances to find people which can analyze code, debug failures or bottlenecks than on Windows, where probably a lot of contribution comes from gamers, which in majority don't have that insight or experience. For example I'm owner of a company with about 15 employees and my company is developing embedded electronics, a lot of that running embedded variants of Linux - for which we develop software and sometimes create or own ports of Linux.
I'm running Linux since my time at the university, so since about 1992. My first contribution at that time was a driver for the SCSI adapter I had in my system. My company is using Linux in embedded devices, since in case of trouble we can debug ourselves, while for Windows we are dependent on whether Microsoft intends to do something and when something will be done. That is not acceptable for us and our clients in industrial automation, our clients expect that systems run and in case of problems fixes are provided.
The problem is, to help in debugging, the system has to be open and accessible - what unfortunately is not the case for FAH, since it is closed source. And even the parts which are open source (like openMM) don't help, if some problematic WUs can not (effectively, in a documented way) be run on the open source parts for debugging. The new compute stack ROCm from AMD is completely open source, so debugging problems with FAH would help both FAH and the AMD software. But in lack of debug possibilities from the FAH side the only thing that is left is complaining.
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Fri Aug 14, 2020 8:39 pm
by UofM.MartinK
@ThWuensche: My background seems to be very similar to yours, although most of the embedded systems I work with are much smaller and can't even run Linux
F@h wants to become fully open source, at least that's one of their goals, but it ain't easy. I think the main problem right now is not licensing, but that it would just make the creation of cheating clients pretty easy. This can be addressed, but needs some redesign which would disrupt the current folding efforts (so far my understanding).
And currently, I am still fighting with AMD closed source drivers. I just succeeded upgrading from amdgpu-pro-20.10-1048554-ubuntu-18.04 to amdgpu-pro-20.30-1109583-ubuntu-20.04, but it took far more time and effort than I anticipated
@muziqaz: AMD could be more helpful, yes, and their driver installers got better with much room to still improve. But it would also help if F@h or the community would better consolidate the knowledge about how to get F@h on linux working, it's often confusing to find the relevant and most recent information in the forum. HOWTOs in "Problems with AMD/ATI drivers" viewforum.php?f=81 , are a great start, but often the information is outdated, solutions include unnecessary steps, have typos in them, ...
I wonder if we, as community, could at least a maintain a handful of common LinuxFlavor+GPU combinations as "pinned posts", and we try to keep the first post fresh?
Back on topic:
Still using amdgpu-pro-20.10-1048554-ubuntu-18.04 and switching to "Power Saving" profile, the card's behavior changed in terms of temperature & CPU scaling, but it still used the highest clock speed occasionally, and still failed most WUs. I "underclocked" by forbidding the highest SCLK frequencies, which reduced temperatures further, but not to the level the card sometimes shows for hours (without having changed anything!) when it is folding stable.
So now I upgraded to amdgpu-pro-20.30-1109583-ubuntu-20.04 to make all further tests more relevant. It is apparent that the frequency scaling behavior and "temperature profile" changed somewhat by default, but out-of-the-box, it's still failing most units when the card displays the "higher temperature" behavior.
I will keep it running for now before fiddling with the settings and hope to be able to take a snapshot of the amdgpu driver state when it's running "cool and reliable" for hours.
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Fri Aug 14, 2020 9:06 pm
by ThWuensche
UofM.MartinK wrote:@ThWuensche: My background seems to be very similar to yours, although most of the embedded systems I work with are much smaller and can't even run Linux
F@h wants to become fully open source, at least that's one of their goals, but it ain't easy. I think the main problem right now is not licensing, but that it would just make the creation of cheating clients pretty easy. This can be addressed, but needs some redesign which would disrupt the current folding efforts (so far my understanding).
And currently, I am still fighting with AMD closed source drivers. I just succeeded upgrading from amdgpu-pro-20.10-1048554-ubuntu-18.04 to amdgpu-pro-20.30-1109583-ubuntu-20.04, but it took far more time and effort than I anticipated
Also most of the systems we build are smaller. No more 8032 8-bitters, but a lot of ARM7, Cortex M devices. Back in 2000 I ported Linux to an ARM7 platform, the devices are still in production.
As for drivers, I'm running the ROCm stack on the standard kernel driver. However as I understand ROCm supports only a limited set of GPU types in the moment, as it is developed for compute center use. It is a Linux only solution, since that is what is run in compute centers (mostly, MS probably will complain about that statement). So it might not support your GPUs.
As for a non-cheatable solution FAH could provide the source, but accept results only from a signed binary. Results from non-signed binaries could be judged, but not counted. This would allow supporters to diagnose, debug, run test versions and provide patches and at the same time still guarantee the integrity of the results. And contributors would run modified versions only for debugging, if the results are not accounted.
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Fri Aug 14, 2020 9:58 pm
by bruce
I fully understand the perspective of both of you. FAH is transitioning from a fully closed source to a fully open source, but that's not something that happens easily or quickly, especially when the "non-cheatable" label must be applied. First, science must accept only results from certified code. Second, the competition for points (even though they cannot be exchanged for things of value) has always been a strong incentive for hacking. BOINC wastes half of the donor resources by requiring every result to be validated by another Donor. Third, we've been going through a very rapid expansion, adding a massivie number of Donors and many, many servers to accommodate them. We're also porting the client to other platforms (ARM, for at least two) and iGPUs for another.
Do check out our github site and offer suggested solutions for things that can be fixed in the code that's already open source -- and remember it has to run Windows/MacOS/RPi/etc. as well as Linux.
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Sat Aug 15, 2020 6:30 am
by Nuitari
I've been checking on the gpu temperatures and mines are around 60C +/- 2C at all times, I don't think its the cause of the problem. The default profile from AMD allows for much (~90C) temperatures.
Plus when they start churning faulty WUs, they are cooler.
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Sat Aug 15, 2020 8:41 am
by muziqaz
The problem with How To's on Linux for AMD is again hit or miss lottery. Everything is hand crafted even the smallest difference makes or breaks the system. So whatever works for one, will most likely not work for others.
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Sat Aug 15, 2020 1:51 pm
by UofM.MartinK
muziqaz wrote:The problem with How To's on Linux for AMD is again hit or miss lottery. Everything is hand crafted even the smallest difference makes or breaks the system. So whatever works for one, will most likely not work for others.
For existing installations, yes, especially if a lot of special stuff was already done with a system.
But for fresh installs, I usually only have to record two or three "variant" steps in between for different hardware scenarios, and then they step-by-step instructions work pretty well.
Admittedly, my experience is mainly limited to mainline Debian and Ubuntu since around 2000, but the track record is pretty good.
My main challenge is usually when I am out of my comfort zone/desperate and start running scripts and changing config files "just because" it helped somebody else - which often turns out A) had nothing to do with the solution, and/or B) changed the system to a degree that now you are not in a well-defined state anymore
Nuitari wrote:I've been checking on the gpu temperatures and mines are around 60C +/- 2C at all times, I don't think its the cause of the problem. The default profile from AMD allows for much (~90C) temperatures.
Plus when they start churning faulty WUs, they are cooler.
I also start to think it's not really the temperatures on these two projects. Somewhere I read 16600 and 13421 use features of the core which were not (commonly) used before.
I can report that upgrading to amdgpu-pro-20.30-1109583-ubuntu-20.04 didn't improve anything on my RX580, and I am pretty much at the end of useful underclocking, at a GPU temperature of 45C.
I will try a little more to (re)produce that "magic" state my RX580 has had reached by itself (before I started fiddling with anything) for sometimes 6-12 hours at a time (see munin graph some posts before this one), the state where temperatures were rock-steady around 45C and almost all WUs completed.
What I didn't do yet: fixing the SCLK and MCLK to a single frequency, I was just blacklisting all the higher ones. Perhaps that's what the card/driver got itself into when it "just worked"?
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Sat Aug 15, 2020 3:04 pm
by ViTe
We all have different cards, different platforms, different drivers, different clients, different OS but we all getting the same problem with a couple of specific projects while other projects are running without any issues. Do you really think that temperatures or boost clocks are the cause of the problem? How about the moon phase?
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Sat Aug 15, 2020 4:01 pm
by UofM.MartinK
Don't know if it was here or on discord, I already blamed the moon phase
No, I don't think it's temperatures - (variations in) temperatures might just be an indication/hint towards what really is going on.
Actually, I think it's the driver - and perhaps something about power management. Because whatever variant we have, we have AMD architecture GPUs and a flavor of AMD drivers.
And some AMD cards of the same type do well, while (most?) AMD cards of the same type have troubles.
Now, as mentioned, I noticed the two personalities of my card - folding successfully for hours, up to 6-8 16600 and 13241's in a row, then for (many more) hours almost never completing any.
And in this "binary behavior", I see a significantly different power+temperature pattern. Not higher or lower, but _constant_ (when it works)
This might be a red herring, but somehow I believe I am on to something - unless the WU's, in phases, were configured differently. But both 16600 and 13241 simultaneously?
Perhaps somebody involved with the creation of the WUs could comment whether the internals of these two projects changed at any point over the last two weeks.
If so, I am hunting a red herring and stop wasting my efforts.
If not, the "binary" nature of my card is very worthwhile exploring more, I believe.
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Sat Aug 15, 2020 4:03 pm
by ajm
For what it's worth, I fold with 4 PCs and 7 GPUs and were it not for AMD products and Linux (which I don't use for folding any more) I would be hard pressed to mention any serious issue with any project these last months -- apart from larger than usual ranges of PPD for some of the newest ones, but which doesn't really make a difference on the whole.
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Sat Aug 15, 2020 5:11 pm
by ThWuensche
bruce wrote:I fully understand the perspective of both of you. FAH is transitioning from a fully closed source to a fully open source, but that's not something that happens easily or quickly, especially when the "non-cheatable" label must be applied. First, science must accept only results from certified code. Second, the competition for points (even though they cannot be exchanged for things of value) has always been a strong incentive for hacking. BOINC wastes half of the donor resources by requiring every result to be validated by another Donor. Third, we've been going through a very rapid expansion, adding a massivie number of Donors and many, many servers to accommodate them. We're also porting the client to other platforms (ARM, for at least two) and iGPUs for another.
Thank you for your answer bruce. This wave of support (donations) is just the point where existing resources may come to a limit and use of additional contributions (on software infrastructure) in an open environment could help. Of course the wave of additional contributors could hardly be foreseen and the pressure to provide scientific results and improve the infrastructure all at one time is enormous.
For sure you have already considered different options and my idea may just be nonsense: As far as I understand the results sent back to the servers are signed anyhow. Wouldn't it be manageable to do this signing with a key available only to a signed version of the binaries or valid only in combination with the signature? In that case the code (for example cores) could be made open source without problem, since only the results of the signed binaries would be accounted by the servers (non-cheatable). To external developers/testers eventually other keys could be made available, which lead to the results be tested by the servers, but not accounted. That way correctness of modifications by external developers could empirically be tested before they are accepted for inclusion into the main code base and quality control.
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Sat Aug 15, 2020 6:20 pm
by ViTe
UofM.MartinK wrote:
Actually, I think it's the driver - and perhaps something about power management. Because whatever variant we have, we have AMD architecture GPUs and a flavor of AMD drivers.
And 16600 or 13241 projects. All other projects are fine, right. Why this projects? What's so special about this two? Maybe some changes in data, generation mechanics, CPU involving or else that makes them a kind of incompatible with AMD devices/drivers. But we dont know any details of project creation. Other people who responsible for creation/testing should work on it.
Some time ago I've seen few WUs in a row completed sucsessfully on my system and after that sometimes single WUs were done successfully. It means that AMD GPU can do the job if getting right data or properly generated WU. Don't see any connection to the driver/temperatures or anything else on my side. I dont touch my sysytem for months. No changes. What is the change? WU only.
About your system: from the graph you provided before I dont see any unusual power/temp events for your GPU. GPU temp (temp1) is pretty constant under the load, CPU temp looks much more suspicious. What is "edge" temperature? Looks like something related to motherboard.
And some AMD cards of the same type do well, while (most?) AMD cards of the same type have troubles.
BTW, people reporting their cards do the job but didn't provide any detalis about their systems. So we cannot try same conditions for our cards.
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Sat Aug 15, 2020 9:20 pm
by UofM.MartinK
ViTe wrote:UofM.MartinK wrote:
Actually, I think it's the driver - and perhaps something about power management. Because whatever variant we have, we have AMD architecture GPUs and a flavor of AMD drivers.
And 16600 or 13241 projects. All other projects are fine, right. Why this projects? What's so special about this two? Maybe some changes in data, generation mechanics, CPU involving or else that makes them a kind of incompatible with AMD devices/drivers. But we dont know any details of project creation. Other people who responsible for creation/testing should work on it.
Yes, I heard that these two projects use newer features of the Fahcore_22, but I don't know the details of these features - I would assume a larger instruction set is used.
ViTe wrote:
Some time ago I saw few projects in a row completed sucsessfully on my system and even after that some rare WUs was completed. It means that AMD GPU can do the job if getting right data or properly generated WU. Don't see any connection to the driver/temperatures or anything else on my side. I dont touch my sysytem for months. No changes. What is the change? WU only.
Do you still have the logs, and would be willing to share them with me? If you don't want to share the whole logs, I am most curious about the SEND (and potentially receive) lines for these two projects, so the output of the following command would work well:
Code: Select all
grep -h logs/* log.txt -e '^\*' -e 'project:16600' -e 'project:13421'
If you got the "rows of working WUs" in the same time window than I did (roughly August 3rd and August 13th), then these WUs were most likely different internally, a strong indication that it's "exclusively" WU related and my efforts are in vain.
ViTe wrote:
About your system: from the graph you provided before I dont see any unusual power/temp events for your GPU. GPU temp (temp1) is pretty constant under the load, CPU temp looks much more suspicious. What is "edge" temperature? Looks like something related to motherboard.
The blue "Edge" is the GPU temperature. The green "Temp1" is another temperature reading related to the CPU. I didn't get around assigning good names to these, and obviously was tired enough to write the wrong name in the original post - corrected it now, sorry for the confusion. I might later annotate these plots with the timestamps of WU-project start, NaN exception and successful completion dates (if it makes sense at all to put more effort into this).
ViTe wrote:
And some AMD cards of the same type do well, while (most?) AMD cards of the same type have troubles.
BTW, people reporting wheir cards do the job didn't provide any detalis about their systems. So we cannot try same conditions for our cards.
That's good to know! I also didn't find any of the WUs failed by my card being completed by another AMD card yet.
Could anybody share logs and/or system details from Linux systems with AMD GPUs which supposedly complete most 16600 and 13421 projects? Thanks!
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Sun Aug 16, 2020 5:39 am
by bruce
I don't have access to those error reporting logs, myself. The project owner does and he and his staff are digesting the data. I do hear them talking about future plans but I have the same access that you do unless I ask a specific question.
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Sun Aug 16, 2020 5:41 pm
by ViTe
UofM.MartinK wrote:
Do you still have the logs, and would be willing to share them with me? If you don't want to share the whole logs, I am most curious about the SEND (and potentially receive) lines for these two projects, so the output of the following command would work well:
Code: Select all
grep -h logs/* log.txt -e '^\*' -e 'project:16600' -e 'project:13421'
I dont have that logs anymore and I'm running windows.