Neil-B wrote:You appear to be getting a similar type of error from a wide variety of Project WUs at quite high failure rates which other folders are completing - regrettably that probably means an issue with you kit/the way it is configured not with the Project WUs ... Tracking down these types of issues can be tricky - but it is worth persevering.
I'm willing to help tracking down these issues. But i'll need advice what to do/try.
For what it is worth: I also think that the PPD on my GPU do not reflect the power that i would expect from such a relatively new hardware. I make about 140k-200k GPU-PPD while older nvidia cards are said to make 1.2 to 2.4M GPU-PPD.
Neil-B wrote:Couple of questions ... Have you always had these issues or have they started to happen more recently? ... What GPU are you running (couldn't spot this config in any of the posted logs)?
I'm folding for 13 years now. Looong time with many platforms and different hardware - but all without GPU - and without trouble also. Some years ago i interrupted folding but reactivated my machines for fighting covid-19 in March 2020. This was the first time i had a GPU equipped machine in my hands and wanted use it for folding too. The nan-errors mentioned in this thread were present from the very beginning of folding on that certain gpu-machine (mar 2020). Decide for yourself if this is "always" or "more recently".
The machine in question is on Linux Mint 19.3, AMD Ryzen 5 3600, AMD Radeon RX 5500 XT and original proprietary GPU drivers from the official AMD site.
Are you looking for these lines?
Code: Select all
19:05:43:****************************** FAHClient ******************************
19:05:43: Version: 7.6.9
19:05:43: Author: Joseph Coffland <[email protected]>
19:05:43: Copyright: 2020 foldingathome.org
19:05:43: Homepage: https://foldingathome.org/
19:05:43: Date: Apr 17 2020
19:05:43: Time: 18:11:26
19:05:43: Revision: 398c2b17fa535e0cc6c9d10856b2154c32771646
19:05:43: Branch: master
19:05:43: Compiler: GNU 8.3.0
19:05:43: Options: -std=c++11 -ffunction-sections -fdata-sections -O3
19:05:43: -funroll-loops -fno-pie
19:05:43: Platform: linux2 4.19.0-5-amd64
19:05:43: Bits: 64
19:05:43: Mode: Release
19:05:43: Args: --child /etc/fahclient/config.xml
19:05:43: --pid-file=/var/run/fahclient/fahclient.pid --daemon
19:05:43: Config: /etc/fahclient/config.xml
19:05:43:******************************** CBang ********************************
19:05:43: Date: Apr 17 2020
19:05:43: Time: 18:10:13
19:05:43:Started thread 1 on PID 26178
19:05:43: Revision: 2fb0be7809c5e45287a122ca5fbc15b5ae859a3b
19:05:43: Branch: master
19:05:43: Compiler: GNU 8.3.0
19:05:43: Options: -std=c++11 -ffunction-sections -fdata-sections -O3
19:05:43: -funroll-loops -fno-pie -fPIC
19:05:43: Platform: linux2 4.19.0-5-amd64
19:05:43: Bits: 64
19:05:43: Mode: Release
19:05:43:******************************* System ********************************
19:05:43: CPU: AMD Ryzen 5 3600 6-Core Processor
19:05:43: CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
19:05:43: CPUs: 12
19:05:43: Memory: 31.37GiB
19:05:43: Free Memory: 6.98GiB
19:05:43: Threads: POSIX_THREADS
19:05:43: OS Version: 5.6
19:05:43: Has Battery: false
19:05:43: On Battery: false
19:05:43: UTC Offset: 2
19:05:43: PID: 26178
19:05:43: CWD: /var/lib/fahclient
19:05:43: OS: Linux 5.6.6-050606-generic x86_64
19:05:43: OS Arch: AMD64
19:05:43: GPUs: 1
19:05:43: GPU 0: Bus:40 Slot:0 Func:0 AMD:6 Navi 14 [Radeon RX 5500/5500M / Pro
19:05:43: 5500M]
19:05:43: CUDA: Not detected: Failed to open dynamic library 'libcuda.so':
19:05:43: libcuda.so: cannot open shared object file: No such file or
19:05:43: directory
19:05:43:OpenCL Device 0: Platform:0 Device:0 Bus:40 Slot:0 Compute:2.0 Driver:3075.10
19:05:43:******************************* libFAH ********************************
19:05:43: Date: Apr 15 2020
19:05:43: Time: 21:43:24
19:05:43: Revision: 216968bc7025029c841ed6e36e81a03a316890d3
19:05:43: Branch: master
19:05:43: Compiler: GNU 8.3.0
19:05:43: Options: -std=c++11 -ffunction-sections -fdata-sections -O3
19:05:43: -funroll-loops -fno-pie
19:05:43: Platform: linux2 4.19.0-5-amd64
19:05:43: Bits: 64
19:05:43: Mode: Release
19:05:43:***********************************************************************
19:05:43:<config>
19:05:43: <!-- Client Control -->
19:05:43: <client-threads v='6'/>
19:05:43: <cycle-rate v='4'/>
19:05:43: <cycles v='-1'/>
19:05:43: <disable-sleep-when-active v='true'/>
19:05:43: <exit-when-done v='false'/>
19:05:43: <fold-anon v='true'/>
19:05:43: <idle-seconds v='300'/>
19:05:43: <open-web-control v='false'/>
19:05:43:
19:05:43: <!-- Configuration -->
19:05:43: <config-rotate v='true'/>
19:05:43: <config-rotate-dir v='configs'/>
19:05:43: <config-rotate-max v='16'/>
19:05:43:
19:05:43: <!-- Debugging -->
19:05:43: <assignment-servers>
19:05:43: assign1.foldingathome.org assign2.foldingathome.org assign3.foldingathome.org assign4.foldingathome.org
19:05:43: </assignment-servers>
19:05:43: <auth-as v='true'/>
19:05:43: <capture-directory v='capture'/>
19:05:43: <capture-on-error v='false'/>
19:05:43: <capture-packets v='false'/>
19:05:43: <capture-requests v='false'/>
19:05:43: <capture-responses v='false'/>
19:05:43: <capture-sockets v='false'/>
19:05:43: <debug-sockets v='false'/>
19:05:43: <exception-locations v='true'/>
19:05:43: <stack-traces v='false'/>
19:05:43:
19:05:43: <!-- Error Handling -->
19:05:43: <max-slot-errors v='10'/>
19:05:43: <max-unit-errors v='5'/>
19:05:43:
19:05:43: <!-- Folding Core -->
19:05:43: <checkpoint v='15'/>
19:05:43: <core-priority v='idle'/>
19:05:43: <cpu-usage v='100'/>
19:05:43: <gpu-usage v='100'/>
19:05:43: <no-assembly v='false'/>
19:05:43:
19:05:43: <!-- Folding Slot Configuration -->
19:05:43: <cause v='COVID_19'/>
19:05:43: <client-subtype v='LINUX'/>
19:05:43: <client-type v='normal'/>
19:05:43: <cpu-species v='X86_AMD'/>
19:05:43: <cpu-type v='AMD64'/>
19:05:43: <cpus v='-1'/>
19:05:43: <disable-viz v='false'/>
19:05:43: <gpu v='true'/>
19:05:43: <max-packet-size v='normal'/>
19:05:43: <os-species v='UNKNOWN'/>
19:05:43: <os-type v='LINUX'/>
19:05:43: <project-key v='0'/>
19:05:43: <smp v='true'/>
19:05:43:
19:05:43: <!-- GUI -->
19:05:43: <gui-enabled v='true'/>
19:05:43:
19:05:43: <!-- HTTP Server -->
19:05:43: <allow v='127.0.0.1 192.168.10.0/24'/>
19:05:43: <connection-timeout v='60'/>
19:05:43: <deny v='0/0'/>
19:05:43: <http-addresses v='0:7396'/>
19:05:43: <https-addresses v=''/>
19:05:43: <max-connect-time v='900'/>
19:05:43: <max-connections v='800'/>
19:05:43: <max-request-length v='52428800'/>
19:05:43: <min-connect-time v='300'/>
19:05:43:
19:05:43: <!-- Logging -->
19:05:43: <log v='log.txt'/>
19:05:43: <log-color v='true'/>
19:05:43: <log-crlf v='false'/>
19:05:43: <log-date v='false'/>
19:05:43: <log-date-periodically v='21600'/>
19:05:43: <log-domain v='false'/>
19:05:43: <log-header v='true'/>
19:05:43: <log-level v='true'/>
19:05:43: <log-no-info-header v='true'/>
19:05:43: <log-redirect v='false'/>
19:05:43: <log-rotate v='true'/>
19:05:43: <log-rotate-dir v='logs'/>
19:05:43: <log-rotate-max v='16'/>
19:05:43: <log-short-level v='false'/>
19:05:43: <log-simple-domains v='true'/>
19:05:43: <log-thread-id v='false'/>
19:05:43: <log-thread-prefix v='true'/>
19:05:43: <log-time v='true'/>
19:05:43: <log-to-screen v='true'/>
19:05:43: <log-truncate v='false'/>
19:05:43: <verbosity v='5'/>
19:05:43:
19:05:43: <!-- Process Control -->
19:05:43: <child v='true'/>
19:05:43: <daemon v='true'/>
19:05:43: <fork v='false'/>
19:05:43: <pid v='false'/>
19:05:43: <pid-file v='/var/run/fahclient/fahclient.pid'/>
19:05:43: <respawn v='false'/>
19:05:43: <service v='false'/>
19:05:43:
19:05:43: <!-- Slot Control -->
19:05:43: <idle v='false'/>
19:05:43: <max-shutdown-wait v='60'/>
19:05:43: <pause-on-battery v='true'/>
19:05:43: <pause-on-start v='false'/>
19:05:43: <paused v='false'/>
19:05:43: <power v='medium'/>
19:05:43:
19:05:43: <!-- Work Unit Control -->
19:05:43: <dump-after-deadline v='true'/>
19:05:43: <max-queue v='16'/>
19:05:43: <max-units v='0'/>
19:05:43: <next-unit-percentage v='99'/>
19:05:43: <stall-detection-enabled v='false'/>
19:05:43: <stall-percent v='5'/>
19:05:43: <stall-timeout v='1800'/>
19:05:43:
19:05:43: <!-- Folding Slots -->
19:05:43: <slot id='0' type='CPU'>
19:05:43: <cpus v='12'/>
19:05:43: <paused v='true'/>
19:05:43: </slot>
19:05:43: <slot id='1' type='GPU'>
19:05:43: <paused v='true'/>
19:05:43: </slot>
19:05:43:</config>
19:05:43:Trying to access database...
19:05:43:Successfully acquired database lock
19:05:43:Enabled folding slot 00: PAUSED cpu:12 (by user)
19:05:43:Enabled folding slot 01: PAUSED gpu:0:Navi 14 [Radeon RX 5500/5500M / Pro 5500M] (by user)
Just tried
FAHBench-cmd with the following result. It also shows some error:
Code: Select all
FAHBench Simulation
-------------------
Plugin directory: "/usr/lib/openmm"
Work unit: dhfr
WU Name: Dihydrofolate reductase
WU Description: A common system for benchmarking molecular dynamics
System XML: /usr/share/fahbench/workunits/dhfr/system.xml
Integrator XML: /usr/share/fahbench/workunits/dhfr/integrator.xml
State XML: /usr/share/fahbench/workunits/dhfr/state.xml
Step chunk: 40
Device ID 0; Platform OpenCL; Platform ID 0
Run length: 60s
Loading plugins from plugin directory
Number of registered plugins: 3
Deserializing input files: system
Deserializing input files: state
Deserializing input files: integrator
Creating context (may take several minutes)
Checking accuracy against reference code
Creating reference context (may take several minutes)
Comparing forces and energy
Something went wrong:
Force RMSE error of 27153.7 with threshold of 5
4n0n wrote:Are the returned WUs considered "successfully returned" in my case?
Thanks for the link. My bonus stats implicate, that WU's are only distinguished between "returned in time" and "timed out". All my failed WUs seem
not to be counted as "timed out". So they must have been either counted as "returned in time" or not counted at all. So i have no impact on bonus stats and am fine with 99.xy percent.