Page 2 of 2
Re: Bad State detected on GPU (AMD)
Posted: Mon May 18, 2020 3:24 pm
by Neil-B
Different project WUs work the cores and your GPU in different ways so whether it is a hardware/driver/whatever issue it is quite possible to have some WUs fold and others not
You appear to be getting a similar type of error from a wide variety of Project WUs at quite high failure rates which other folders are completing - regrettably that probably means an issue with you kit/the way it is configured not with the Project WUs ... Tracking down these types of issues can be tricky - but it is worth persevering.
Couple of questions ... Have you always had these issues or have they started to happen more recently? ... What GPU are you running (couldn't spot this config in any of the posted logs)?
You can check your bonus status with the bonus status app
https://apps.foldingathome.org/bonus
Re: Bad State detected on GPU (AMD)
Posted: Mon May 18, 2020 7:15 pm
by 4n0n
Neil-B wrote:You appear to be getting a similar type of error from a wide variety of Project WUs at quite high failure rates which other folders are completing - regrettably that probably means an issue with you kit/the way it is configured not with the Project WUs ... Tracking down these types of issues can be tricky - but it is worth persevering.
I'm willing to help tracking down these issues. But i'll need advice what to do/try.
For what it is worth: I also think that the PPD on my GPU do not reflect the power that i would expect from such a relatively new hardware. I make about 140k-200k GPU-PPD while older nvidia cards are said to make 1.2 to 2.4M GPU-PPD.
Neil-B wrote:Couple of questions ... Have you always had these issues or have they started to happen more recently? ... What GPU are you running (couldn't spot this config in any of the posted logs)?
I'm folding for 13 years now. Looong time with many platforms and different hardware - but all without GPU - and without trouble also. Some years ago i interrupted folding but reactivated my machines for fighting covid-19 in March 2020. This was the first time i had a GPU equipped machine in my hands and wanted use it for folding too. The nan-errors mentioned in this thread were present from the very beginning of folding on that certain gpu-machine (mar 2020). Decide for yourself if this is "always" or "more recently".
The machine in question is on Linux Mint 19.3, AMD Ryzen 5 3600, AMD Radeon RX 5500 XT and original proprietary GPU drivers from the official AMD site.
Are you looking for these lines?
Code: Select all
19:05:43:****************************** FAHClient ******************************
19:05:43: Version: 7.6.9
19:05:43: Author: Joseph Coffland <[email protected]>
19:05:43: Copyright: 2020 foldingathome.org
19:05:43: Homepage: https://foldingathome.org/
19:05:43: Date: Apr 17 2020
19:05:43: Time: 18:11:26
19:05:43: Revision: 398c2b17fa535e0cc6c9d10856b2154c32771646
19:05:43: Branch: master
19:05:43: Compiler: GNU 8.3.0
19:05:43: Options: -std=c++11 -ffunction-sections -fdata-sections -O3
19:05:43: -funroll-loops -fno-pie
19:05:43: Platform: linux2 4.19.0-5-amd64
19:05:43: Bits: 64
19:05:43: Mode: Release
19:05:43: Args: --child /etc/fahclient/config.xml
19:05:43: --pid-file=/var/run/fahclient/fahclient.pid --daemon
19:05:43: Config: /etc/fahclient/config.xml
19:05:43:******************************** CBang ********************************
19:05:43: Date: Apr 17 2020
19:05:43: Time: 18:10:13
19:05:43:Started thread 1 on PID 26178
19:05:43: Revision: 2fb0be7809c5e45287a122ca5fbc15b5ae859a3b
19:05:43: Branch: master
19:05:43: Compiler: GNU 8.3.0
19:05:43: Options: -std=c++11 -ffunction-sections -fdata-sections -O3
19:05:43: -funroll-loops -fno-pie -fPIC
19:05:43: Platform: linux2 4.19.0-5-amd64
19:05:43: Bits: 64
19:05:43: Mode: Release
19:05:43:******************************* System ********************************
19:05:43: CPU: AMD Ryzen 5 3600 6-Core Processor
19:05:43: CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
19:05:43: CPUs: 12
19:05:43: Memory: 31.37GiB
19:05:43: Free Memory: 6.98GiB
19:05:43: Threads: POSIX_THREADS
19:05:43: OS Version: 5.6
19:05:43: Has Battery: false
19:05:43: On Battery: false
19:05:43: UTC Offset: 2
19:05:43: PID: 26178
19:05:43: CWD: /var/lib/fahclient
19:05:43: OS: Linux 5.6.6-050606-generic x86_64
19:05:43: OS Arch: AMD64
19:05:43: GPUs: 1
19:05:43: GPU 0: Bus:40 Slot:0 Func:0 AMD:6 Navi 14 [Radeon RX 5500/5500M / Pro
19:05:43: 5500M]
19:05:43: CUDA: Not detected: Failed to open dynamic library 'libcuda.so':
19:05:43: libcuda.so: cannot open shared object file: No such file or
19:05:43: directory
19:05:43:OpenCL Device 0: Platform:0 Device:0 Bus:40 Slot:0 Compute:2.0 Driver:3075.10
19:05:43:******************************* libFAH ********************************
19:05:43: Date: Apr 15 2020
19:05:43: Time: 21:43:24
19:05:43: Revision: 216968bc7025029c841ed6e36e81a03a316890d3
19:05:43: Branch: master
19:05:43: Compiler: GNU 8.3.0
19:05:43: Options: -std=c++11 -ffunction-sections -fdata-sections -O3
19:05:43: -funroll-loops -fno-pie
19:05:43: Platform: linux2 4.19.0-5-amd64
19:05:43: Bits: 64
19:05:43: Mode: Release
19:05:43:***********************************************************************
19:05:43:<config>
19:05:43: <!-- Client Control -->
19:05:43: <client-threads v='6'/>
19:05:43: <cycle-rate v='4'/>
19:05:43: <cycles v='-1'/>
19:05:43: <disable-sleep-when-active v='true'/>
19:05:43: <exit-when-done v='false'/>
19:05:43: <fold-anon v='true'/>
19:05:43: <idle-seconds v='300'/>
19:05:43: <open-web-control v='false'/>
19:05:43:
19:05:43: <!-- Configuration -->
19:05:43: <config-rotate v='true'/>
19:05:43: <config-rotate-dir v='configs'/>
19:05:43: <config-rotate-max v='16'/>
19:05:43:
19:05:43: <!-- Debugging -->
19:05:43: <assignment-servers>
19:05:43: assign1.foldingathome.org assign2.foldingathome.org assign3.foldingathome.org assign4.foldingathome.org
19:05:43: </assignment-servers>
19:05:43: <auth-as v='true'/>
19:05:43: <capture-directory v='capture'/>
19:05:43: <capture-on-error v='false'/>
19:05:43: <capture-packets v='false'/>
19:05:43: <capture-requests v='false'/>
19:05:43: <capture-responses v='false'/>
19:05:43: <capture-sockets v='false'/>
19:05:43: <debug-sockets v='false'/>
19:05:43: <exception-locations v='true'/>
19:05:43: <stack-traces v='false'/>
19:05:43:
19:05:43: <!-- Error Handling -->
19:05:43: <max-slot-errors v='10'/>
19:05:43: <max-unit-errors v='5'/>
19:05:43:
19:05:43: <!-- Folding Core -->
19:05:43: <checkpoint v='15'/>
19:05:43: <core-priority v='idle'/>
19:05:43: <cpu-usage v='100'/>
19:05:43: <gpu-usage v='100'/>
19:05:43: <no-assembly v='false'/>
19:05:43:
19:05:43: <!-- Folding Slot Configuration -->
19:05:43: <cause v='COVID_19'/>
19:05:43: <client-subtype v='LINUX'/>
19:05:43: <client-type v='normal'/>
19:05:43: <cpu-species v='X86_AMD'/>
19:05:43: <cpu-type v='AMD64'/>
19:05:43: <cpus v='-1'/>
19:05:43: <disable-viz v='false'/>
19:05:43: <gpu v='true'/>
19:05:43: <max-packet-size v='normal'/>
19:05:43: <os-species v='UNKNOWN'/>
19:05:43: <os-type v='LINUX'/>
19:05:43: <project-key v='0'/>
19:05:43: <smp v='true'/>
19:05:43:
19:05:43: <!-- GUI -->
19:05:43: <gui-enabled v='true'/>
19:05:43:
19:05:43: <!-- HTTP Server -->
19:05:43: <allow v='127.0.0.1 192.168.10.0/24'/>
19:05:43: <connection-timeout v='60'/>
19:05:43: <deny v='0/0'/>
19:05:43: <http-addresses v='0:7396'/>
19:05:43: <https-addresses v=''/>
19:05:43: <max-connect-time v='900'/>
19:05:43: <max-connections v='800'/>
19:05:43: <max-request-length v='52428800'/>
19:05:43: <min-connect-time v='300'/>
19:05:43:
19:05:43: <!-- Logging -->
19:05:43: <log v='log.txt'/>
19:05:43: <log-color v='true'/>
19:05:43: <log-crlf v='false'/>
19:05:43: <log-date v='false'/>
19:05:43: <log-date-periodically v='21600'/>
19:05:43: <log-domain v='false'/>
19:05:43: <log-header v='true'/>
19:05:43: <log-level v='true'/>
19:05:43: <log-no-info-header v='true'/>
19:05:43: <log-redirect v='false'/>
19:05:43: <log-rotate v='true'/>
19:05:43: <log-rotate-dir v='logs'/>
19:05:43: <log-rotate-max v='16'/>
19:05:43: <log-short-level v='false'/>
19:05:43: <log-simple-domains v='true'/>
19:05:43: <log-thread-id v='false'/>
19:05:43: <log-thread-prefix v='true'/>
19:05:43: <log-time v='true'/>
19:05:43: <log-to-screen v='true'/>
19:05:43: <log-truncate v='false'/>
19:05:43: <verbosity v='5'/>
19:05:43:
19:05:43: <!-- Process Control -->
19:05:43: <child v='true'/>
19:05:43: <daemon v='true'/>
19:05:43: <fork v='false'/>
19:05:43: <pid v='false'/>
19:05:43: <pid-file v='/var/run/fahclient/fahclient.pid'/>
19:05:43: <respawn v='false'/>
19:05:43: <service v='false'/>
19:05:43:
19:05:43: <!-- Slot Control -->
19:05:43: <idle v='false'/>
19:05:43: <max-shutdown-wait v='60'/>
19:05:43: <pause-on-battery v='true'/>
19:05:43: <pause-on-start v='false'/>
19:05:43: <paused v='false'/>
19:05:43: <power v='medium'/>
19:05:43:
19:05:43: <!-- Work Unit Control -->
19:05:43: <dump-after-deadline v='true'/>
19:05:43: <max-queue v='16'/>
19:05:43: <max-units v='0'/>
19:05:43: <next-unit-percentage v='99'/>
19:05:43: <stall-detection-enabled v='false'/>
19:05:43: <stall-percent v='5'/>
19:05:43: <stall-timeout v='1800'/>
19:05:43:
19:05:43: <!-- Folding Slots -->
19:05:43: <slot id='0' type='CPU'>
19:05:43: <cpus v='12'/>
19:05:43: <paused v='true'/>
19:05:43: </slot>
19:05:43: <slot id='1' type='GPU'>
19:05:43: <paused v='true'/>
19:05:43: </slot>
19:05:43:</config>
19:05:43:Trying to access database...
19:05:43:Successfully acquired database lock
19:05:43:Enabled folding slot 00: PAUSED cpu:12 (by user)
19:05:43:Enabled folding slot 01: PAUSED gpu:0:Navi 14 [Radeon RX 5500/5500M / Pro 5500M] (by user)
Just tried
FAHBench-cmd with the following result. It also shows some error:
Code: Select all
FAHBench Simulation
-------------------
Plugin directory: "/usr/lib/openmm"
Work unit: dhfr
WU Name: Dihydrofolate reductase
WU Description: A common system for benchmarking molecular dynamics
System XML: /usr/share/fahbench/workunits/dhfr/system.xml
Integrator XML: /usr/share/fahbench/workunits/dhfr/integrator.xml
State XML: /usr/share/fahbench/workunits/dhfr/state.xml
Step chunk: 40
Device ID 0; Platform OpenCL; Platform ID 0
Run length: 60s
Loading plugins from plugin directory
Number of registered plugins: 3
Deserializing input files: system
Deserializing input files: state
Deserializing input files: integrator
Creating context (may take several minutes)
Checking accuracy against reference code
Creating reference context (may take several minutes)
Comparing forces and energy
Something went wrong:
Force RMSE error of 27153.7 with threshold of 5
4n0n wrote:Are the returned WUs considered "successfully returned" in my case?
Thanks for the link. My bonus stats implicate, that WU's are only distinguished between "returned in time" and "timed out". All my failed WUs seem
not to be counted as "timed out". So they must have been either counted as "returned in time" or not counted at all. So i have no impact on bonus stats and am fine with 99.xy percent.
Re: Bad State detected on GPU (AMD)
Posted: Mon May 18, 2020 7:30 pm
by Neil-B
I've CPU folded over the years too ... and only recently started a minor foray into GPU ... so know how you feel re troubleshooting GPUs
... really hope one of the GPU Gurus latches onto this topic
I asked about always to try and see if it might be a config/driver issue that has been with the system for a while as opposed to something that has recently changed as this might speed up the issue diagnosis ... seen this type of questions asked by others diagnosing similar issues so asked them up front so the information is there when the GPU folders read thread ... your ppd observations may well assist them as well
Hope someone can get this sorted for you soon !!
Re: Bad State detected on GPU (AMD)
Posted: Mon May 18, 2020 7:41 pm
by bruce
As far as FAH's credibility for GPUs is concerned, it will be restored when a new version of FAHCore_22 is released. A great deal has been learned from the collection of these error reports at the cost of some temporary setbacks. Ordinarily FAH attempts to collect and fix errors as a result of beta testing and then make a second pass at the remaining ones in Advanced testing but as it turns out, the Donor population for Beta and Advanced is smaller than usual so it has been necessary to distribute a percentage of the WUs to full FAH.
We'll all be pleased when the next version of FAHCore_22 is ready for release.
I think i read somewhere, that bonus score is only added, if more than 80 percent of the WUs were returned successfully.
I don't remember reading that. From whom/where you get that information?
My information is that only WUs which are completed and successfully uploaded can be used to generate the trajectory's next Gen. The token points are simply a reward for your effort.
Re: Bad State detected on GPU (AMD)
Posted: Mon May 18, 2020 7:45 pm
by Neil-B
It is one of the qualifications stated for the QRB on the website .. sorry brevity - on phone
Re: Bad State detected on GPU (AMD)
Posted: Mon May 18, 2020 7:47 pm
by bruce
Aha. Errors don't get bonuses and successful returns only get them as long as you maintain an overall 80% success rate.
Re: Bad State detected on GPU (AMD)
Posted: Mon May 18, 2020 7:58 pm
by 4n0n
bruce wrote:We'll all be pleased when the next version of FAHCore_22 is ready for release.
Where do you have your information from? Is there any public release plan or an estimation in terms of time?
bruce wrote:I don't remember reading that. From whom/where you get that information?
In addition to Neil-B's answer, here is the source:
https://foldingathome.org/support/faq/p ... or-the-qrb
Re: Bad State detected on GPU (AMD)
Posted: Mon May 18, 2020 8:47 pm
by PantherX
4n0n wrote:...Where do you have your information from? Is there any public release plan or an estimation in terms of time?...
From the researcher:
JohnChodera wrote:We've had to checkpoint these WUs every 25% due to some limitations in the core, but we're working to remedy those ASAP in a forthcoming core release so we can checkpoint closer to 5%.
Thanks for bearing with us!
~ John Chodera // MSKCC
viewtopic.php?f=19&t=35175&p=333835#p333835
Please note that there's no ETA, or timeline. It will be released to public after the Beta team tests it out whenever it is made available.
Re: Bad State detected on GPU (AMD)
Posted: Tue May 19, 2020 2:14 am
by bruce
4n0n wrote:Is there any public release plan or an estimation in terms of time?
FAH never pre-announces a release date. (We don't have a sales department that makes predictions of when new features will be available.) The only factual answers are "When it's ready" and more commonly "soon"
Re: Bad State detected on GPU (AMD)
Posted: Tue May 19, 2020 5:02 am
by PantherX
FYI, the timeline that the F@H Project uses is this:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time
You can vaguely map it to something like this:
Public ↞ Beta ↔ Internal ↔ In Development ↔ Thinking/Planning ↠ Backlog
Hence, the "soon" maps to "In Development" which means it is only 3 stages away from Full release