Page 6 of 8
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Mon Aug 17, 2020 1:38 am
by UofM.MartinK
OK, now I have the data of the last 85 WUs processed by my RX580.
TL;DR: there is no "two states" of the card. It's all intrinsic to the WUs as other posters already mentioned, and driver & clock speeds have no significant effect.
My original suspicion - card has a "working" and a "non-working" state - was a typical case of "data artifact" caused by too small N numbers, and some clusters, with p16600 at the "fringes" of a string of working p13421 made it appear like some randomly successful p16600 might have had the same GPU "fingerprint" (Power/Temp/Vdd/Clocks) condition as the p13421, but when actually matching times precisely, this turned out to be false.
Still, the clusters are very unlikely to have happened randomly, but the round-robin behavior of assignments (and re-assignments for faulty units) can be blamed for that.
Here the basic breakdown:
p16600: 38 WUs, 4 completed, the other 34 failed. All 38 had NaN exceptions (statistically distributed over processing time), median time to fail: 3489 seconds. The 4 completed - just by probability ("luck") - after 4-5 hours without hitting the project-internal retry limit, but again, also had at least twice NaN exceptions and resumed from a checkpoint.
p13421: 37 WUs, 7 completed, the other 30 of them discarded after 9-17 seconds, due to '0x22:ERROR:NaNs detected in forces. 0 0'.
p13423: 9 WUs, 1 completed. The other 8 of them discarded as above, after 9-17 seconds, due to '0x22:ERROR:NaNs detected in forces. 0 0'.
p16920: 1 WU, successfully completed.
Summary: In the last 48 hours alone, my RX580 (and most similar AMD cards, I assume) spent ~14 hours for useful computations, or ~18 hours if the one "completed" p16600 in this time window is actually useful (which I doubt, it might actually weaken the project results). That's 60-70% wasted time and energy. And this is going on since at least August 3rd.
Update: I now run a script which tracks the log in real-time, and if a "0x22:An exception occurred at step XXX: Particle coordinate is nan" is found, it dumps that WU. (Since FAHClient --dump doesn't seem to be able to communicate to a running client, the script is instead: pausing the slot, deleting the corresponding work folder, and then un-pausing the slot)
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Mon Aug 17, 2020 5:43 am
by bruce
I'm surprised that the --dump <n> doesn't work but I don't use it. You have to know which WU needs to be dumped (if you run more than one slot). Please explain how you have tested it and what happens along with the client version number.
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Mon Aug 17, 2020 6:14 am
by UofM.MartinK
I couldn't figure out how to give a running client the --dump command, with 7.6.13 under linux. It doesn't behave like --send-pause and --send-unpause etc, and preceding it with --send-command didn't help either. Using "help" while directly connected to the client via "nc localhost 36330" didn't reveal any dump command either.
It only seems to work if the client is stopped, then FAHClient --dump <WU> is run exclusively, and then the client re-started.
This disturbs the other slots, and re-starts the log file etc, so I chose the other solution for now.
I would prefer a variant which properly communicates back to the server that the WU was dumped, though.
A related feature request:
https://github.com/FoldingAtHome/fah-issues/issues/1547
Although, reading the "sibling" bug report:
https://github.com/FoldingAtHome/fah-issues/issues/1549
It seems that whichever way I currently dump, it's a problem - either it wrongly counts as a "faulty" WU (which it would become an hour later anyway, so it would still the better way if it was available as command to a running client), or the WU has to wait for it's timeout to be reassigned. A classic loose-loose situation
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Tue Aug 18, 2020 11:18 pm
by UofM.MartinK
I just noticed that the latest WU, project:16600 run:0 clone:1430 gen:235, allows more than 3 restarts, it did 8 so far and is about to complete the WU on my RX580!
I gather that this means a WU finished this way is still useful, after all?
I that can be confirmed, I will not dump 16600 anymore
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Wed Aug 19, 2020 3:47 am
by bruce
If the WU is crashing because of an overclocked GPU, there's nothing FAH can do about it except to ask folks not to overclock. If it's crashing because of defective hardware, we can admonish you to RMA the hardware. If it's crashing because of a defect in the driver, we can ask you to convince the manufacturer to build good drivers.
If there's a defect in the FAHCore or in the construction of the WU, there's an ongoing project to collect the error reports returned from errors like yours and fix the associated problem(s). FAH does pay attention to those error reports and through them, science can do a better job in the future although I can't promise the fix will be rolled out soon enough to satisfy the folks making the reports.
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Wed Aug 19, 2020 4:55 am
by UofM.MartinK
Bruce, I appreciate your very valuable contributions, but in this case, it's clear that
many AMD models trash 16600 WUs - nothing to do with of overclocking or individual hardware issues, and not even the driver (version) might be at fault - it happens across all drivers and operating systems.
It's actually confirmed that there is something like an "incompatibility" of this project with many AMD models. According to SlinkyDolphinClock on discord the day before yesterday, it might have slipped Beta & Advanced testing because it was tested on an old FAH core - that's at least one hypothesis very actively discussed on slack with the lead developers, and some patch is in the works.
Back to business: My previous post was stating the observation that I now encountered at least one p16600 WU which internally has a significantly higher "Max number of attempts to resume from last checkpoint reached." (usually 3, but project:16600 run:0 clone:1430 gen:235 did resume 8 times and finally was completed). Other p16600 WUs after had the old internal restart limit of 3 again and thus were sent back "faulty" because they only made it to 15%, 32% or 42% before the "resumes" were used up.
bruce wrote:
If there's a defect in the FAHCore or in the construction of the WU, there's an ongoing project to collect the error reports returned from errors like yours and fix the associated problem(s). FAH does pay attention to those error reports and through them, science can do a better job in the future although I can't promise the fix will be rolled out soon enough to satisfy the folks making the reports.
Well, seems to apply in this case. But this is going on since August 3rd, so yes, not very satisfying - and "soon enough" is in the eye of the beholder
Now all I want to know is whether that was a deliberate change to let these "problematic" AMD GPU models complete p16600 WUs (perhaps because they serve some sort of purpose after all?) or if this was just a fluke and there is no value in processing them with an AMD card.
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Wed Aug 19, 2020 5:18 am
by Nuitari
UofM.MartinK wrote:
Code: Select all
grep -h logs/* log.txt -e '^\*' -e 'project:16600' -e 'project:13421'
I did the grep. The forum has a limit of 60000 characters, so I put it in this gist
https://gist.github.com/Nuitari/1306a2a ... 6ecaad7f6e
Rig 1 (5x rx570, 1x rx560, 1x carrizo APU)
Rig 2 (1x RX560 "OC version", 3x RX570)
Rig 3, 1x NVIDIA 660
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Wed Aug 19, 2020 6:06 am
by n_w95482
Another one here having issues with 16600 on an AMD GPU. I'm running a Sapphire Pulse RX 580 8 GB in my home theater PC, underclocked to RX 480 levels (-7%). Here's a tally of the WUs it's worked on in the last two weeks:
13421: 72 finished, 1 failed
13423: 13 finished, 0 failed
16600: 10 finished, 66 failed
16920: 1 finished, 0 failed
Here's the log of one that failed this afternoon:
Code: Select all
20:12:44:WU01:FS01:Starting
20:12:44:WU01:FS01:Running FahCore: \"C:\\Program Files (x86)\\FAHClient/FAHCoreWrapper.exe\" C:\\Users\\Nick\\AppData\\Roaming\\FAHClient\\cores/cores.foldingathome.org/win/64bit/22-0.0.11/Core_22.fah/FahCore_22.exe -dir 01 -suffix 01 -version 705 -lifeline 9488 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
20:12:44:WU01:FS01:Started FahCore on PID 11412
20:12:44:WU01:FS01:Core PID:10468
20:12:44:WU01:FS01:FahCore 0x22 started
20:12:45:WU01:FS01:0x22:*********************** Log Started 2020-08-18T20:12:44Z ***********************
20:12:45:WU01:FS01:0x22:*************************** Core22 Folding@home Core ***************************
20:12:45:WU01:FS01:0x22: Core: Core22
20:12:45:WU01:FS01:0x22: Type: 0x22
20:12:45:WU01:FS01:0x22: Version: 0.0.11
20:12:45:WU01:FS01:0x22: Author: Joseph Coffland <[email protected]>
20:12:45:WU01:FS01:0x22: Copyright: 2020 foldingathome.org
20:12:45:WU01:FS01:0x22: Homepage: https://foldingathome.org/
20:12:45:WU01:FS01:0x22: Date: Jun 26 2020
20:12:45:WU01:FS01:0x22: Time: 19:49:16
20:12:45:WU01:FS01:0x22: Revision: 22010df8a4db48db1b35d33e666b64d8ce48689d
20:12:45:WU01:FS01:0x22: Branch: core22-0.0.11
20:12:45:WU01:FS01:0x22: Compiler: Visual C++ 2015
20:12:45:WU01:FS01:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
20:12:45:WU01:FS01:0x22: Platform: win32 10
20:12:45:WU01:FS01:0x22: Bits: 64
20:12:45:WU01:FS01:0x22: Mode: Release
20:12:45:WU01:FS01:0x22:Maintainers: John Chodera <[email protected]> and Peter Eastman
20:12:45:WU01:FS01:0x22: <[email protected]>
20:12:45:WU01:FS01:0x22: Args: -dir 01 -suffix 01 -version 705 -lifeline 11412 -checkpoint 15
20:12:45:WU01:FS01:0x22: -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
20:12:45:WU01:FS01:0x22:************************************ libFAH ************************************
20:12:45:WU01:FS01:0x22: Date: Jun 26 2020
20:12:45:WU01:FS01:0x22: Time: 19:47:12
20:12:45:WU01:FS01:0x22: Revision: 2b383f4f04f38511dff592885d7c0400e72bdf43
20:12:45:WU01:FS01:0x22: Branch: HEAD
20:12:45:WU01:FS01:0x22: Compiler: Visual C++ 2015
20:12:45:WU01:FS01:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
20:12:45:WU01:FS01:0x22: Platform: win32 10
20:12:45:WU01:FS01:0x22: Bits: 64
20:12:45:WU01:FS01:0x22: Mode: Release
20:12:45:WU01:FS01:0x22:************************************ CBang *************************************
20:12:45:WU01:FS01:0x22: Date: Jun 26 2020
20:12:45:WU01:FS01:0x22: Time: 19:46:11
20:12:45:WU01:FS01:0x22: Revision: f8529962055b0e7bde23e429f5072ff758089dee
20:12:45:WU01:FS01:0x22: Branch: master
20:12:45:WU01:FS01:0x22: Compiler: Visual C++ 2015
20:12:45:WU01:FS01:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
20:12:45:WU01:FS01:0x22: Platform: win32 10
20:12:45:WU01:FS01:0x22: Bits: 64
20:12:45:WU01:FS01:0x22: Mode: Release
20:12:45:WU01:FS01:0x22:************************************ System ************************************
20:12:45:WU01:FS01:0x22: CPU: AMD Ryzen 5 3600 6-Core Processor
20:12:45:WU01:FS01:0x22: CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
20:12:45:WU01:FS01:0x22: CPUs: 12
20:12:45:WU01:FS01:0x22: Memory: 15.95GiB
20:12:45:WU01:FS01:0x22:Free Memory: 13.43GiB
20:12:45:WU01:FS01:0x22: Threads: WINDOWS_THREADS
20:12:45:WU01:FS01:0x22: OS Version: 6.2
20:12:45:WU01:FS01:0x22:Has Battery: false
20:12:45:WU01:FS01:0x22: On Battery: false
20:12:45:WU01:FS01:0x22: UTC Offset: -7
20:12:45:WU01:FS01:0x22: PID: 10468
20:12:45:WU01:FS01:0x22: CWD: C:\\Users\\Nick\\AppData\\Roaming\\FAHClient\\work
20:12:45:WU01:FS01:0x22:********************************************************************************
20:12:45:WU01:FS01:0x22:Project: 16600 (Run 0, Clone 933, Gen 384)
20:12:45:WU01:FS01:0x22:Unit: 0x000001b08f59f36f5ec36911c061f769
20:12:45:WU01:FS01:0x22:Reading tar file core.xml
20:12:45:WU01:FS01:0x22:Reading tar file integrator.xml
20:12:45:WU01:FS01:0x22:Reading tar file state.xml
20:12:46:WU01:FS01:0x22:Reading tar file system.xml
20:12:47:WU01:FS01:0x22:Digital signatures verified
20:12:47:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
20:12:47:WU01:FS01:0x22:Version 0.0.11
20:12:47:WU01:FS01:0x22: Checkpoint write interval: 25000 steps (5%) [20 total]
20:12:47:WU01:FS01:0x22: JSON viewer frame write interval: 5000 steps (1%) [100 total]
20:12:47:WU01:FS01:0x22: XTC frame write interval: 20000 steps (4%) [25 total]
20:12:47:WU01:FS01:0x22: Global context and integrator variables write interval: disabled
20:13:05:WU01:FS01:0x22:Completed 0 out of 500000 steps (0%)
20:14:33:WU01:FS01:0x22:Completed 5000 out of 500000 steps (1%)
20:15:59:WU01:FS01:0x22:Completed 10000 out of 500000 steps (2%)
20:17:25:WU01:FS01:0x22:Completed 15000 out of 500000 steps (3%)
20:18:51:WU01:FS01:0x22:Completed 20000 out of 500000 steps (4%)
20:20:17:WU01:FS01:0x22:Completed 25000 out of 500000 steps (5%)
20:21:45:WU01:FS01:0x22:Completed 30000 out of 500000 steps (6%)
20:23:12:WU01:FS01:0x22:Completed 35000 out of 500000 steps (7%)
20:24:38:WU01:FS01:0x22:Completed 40000 out of 500000 steps (8%)
20:26:05:WU01:FS01:0x22:Completed 45000 out of 500000 steps (9%)
20:27:31:WU01:FS01:0x22:Completed 50000 out of 500000 steps (10%)
20:28:58:WU01:FS01:0x22:Completed 55000 out of 500000 steps (11%)
20:30:25:WU01:FS01:0x22:Completed 60000 out of 500000 steps (12%)
20:31:51:WU01:FS01:0x22:Completed 65000 out of 500000 steps (13%)
20:33:18:WU01:FS01:0x22:Completed 70000 out of 500000 steps (14%)
20:34:44:WU01:FS01:0x22:Completed 75000 out of 500000 steps (15%)
20:36:12:WU01:FS01:0x22:Completed 80000 out of 500000 steps (16%)
20:37:38:WU01:FS01:0x22:Completed 85000 out of 500000 steps (17%)
20:39:04:WU01:FS01:0x22:Completed 90000 out of 500000 steps (18%)
20:40:31:WU01:FS01:0x22:Completed 95000 out of 500000 steps (19%)
20:41:57:WU01:FS01:0x22:Completed 100000 out of 500000 steps (20%)
20:43:25:WU01:FS01:0x22:Completed 105000 out of 500000 steps (21%)
20:44:52:WU01:FS01:0x22:Completed 110000 out of 500000 steps (22%)
20:46:18:WU01:FS01:0x22:Completed 115000 out of 500000 steps (23%)
20:47:44:WU01:FS01:0x22:Completed 120000 out of 500000 steps (24%)
20:49:11:WU01:FS01:0x22:Completed 125000 out of 500000 steps (25%)
20:50:39:WU01:FS01:0x22:Completed 130000 out of 500000 steps (26%)
20:52:05:WU01:FS01:0x22:Completed 135000 out of 500000 steps (27%)
20:53:56:WU01:FS01:0x22:Completed 140000 out of 500000 steps (28%)
20:54:07:WU01:FS01:0x22:An exception occurred at step 140057: Particle coordinate is nan
20:54:07:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
20:54:07:WU01:FS01:0x22:Folding@home Core Shutdown: CORE_RESTART
20:54:07:WARNING:WU01:FS01:FahCore returned: CORE_RESTART (98 = 0x62)
20:54:08:WU01:FS01:Starting
20:54:08:WU01:FS01:Running FahCore: \"C:\\Program Files (x86)\\FAHClient/FAHCoreWrapper.exe\" C:\\Users\\Nick\\AppData\\Roaming\\FAHClient\\cores/cores.foldingathome.org/win/64bit/22-0.0.11/Core_22.fah/FahCore_22.exe -dir 01 -suffix 01 -version 705 -lifeline 9488 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
20:54:08:WU01:FS01:Started FahCore on PID 3516
20:54:08:WU01:FS01:Core PID:8688
20:54:08:WU01:FS01:FahCore 0x22 started
20:54:08:WU01:FS01:0x22:*********************** Log Started 2020-08-18T20:54:08Z ***********************
20:54:08:WU01:FS01:0x22:*************************** Core22 Folding@home Core ***************************
20:54:08:WU01:FS01:0x22: Core: Core22
20:54:08:WU01:FS01:0x22: Type: 0x22
20:54:08:WU01:FS01:0x22: Version: 0.0.11
20:54:08:WU01:FS01:0x22: Author: Joseph Coffland <[email protected]>
20:54:08:WU01:FS01:0x22: Copyright: 2020 foldingathome.org
20:54:08:WU01:FS01:0x22: Homepage: https://foldingathome.org/
20:54:08:WU01:FS01:0x22: Date: Jun 26 2020
20:54:08:WU01:FS01:0x22: Time: 19:49:16
20:54:08:WU01:FS01:0x22: Revision: 22010df8a4db48db1b35d33e666b64d8ce48689d
20:54:08:WU01:FS01:0x22: Branch: core22-0.0.11
20:54:08:WU01:FS01:0x22: Compiler: Visual C++ 2015
20:54:08:WU01:FS01:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
20:54:08:WU01:FS01:0x22: Platform: win32 10
20:54:08:WU01:FS01:0x22: Bits: 64
20:54:08:WU01:FS01:0x22: Mode: Release
20:54:08:WU01:FS01:0x22:Maintainers: John Chodera <[email protected]> and Peter Eastman
20:54:08:WU01:FS01:0x22: <[email protected]>
20:54:08:WU01:FS01:0x22: Args: -dir 01 -suffix 01 -version 705 -lifeline 3516 -checkpoint 15
20:54:08:WU01:FS01:0x22: -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
20:54:08:WU01:FS01:0x22:************************************ libFAH ************************************
20:54:08:WU01:FS01:0x22: Date: Jun 26 2020
20:54:08:WU01:FS01:0x22: Time: 19:47:12
20:54:08:WU01:FS01:0x22: Revision: 2b383f4f04f38511dff592885d7c0400e72bdf43
20:54:08:WU01:FS01:0x22: Branch: HEAD
20:54:08:WU01:FS01:0x22: Compiler: Visual C++ 2015
20:54:08:WU01:FS01:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
20:54:08:WU01:FS01:0x22: Platform: win32 10
20:54:08:WU01:FS01:0x22: Bits: 64
20:54:08:WU01:FS01:0x22: Mode: Release
20:54:08:WU01:FS01:0x22:************************************ CBang *************************************
20:54:08:WU01:FS01:0x22: Date: Jun 26 2020
20:54:08:WU01:FS01:0x22: Time: 19:46:11
20:54:08:WU01:FS01:0x22: Revision: f8529962055b0e7bde23e429f5072ff758089dee
20:54:08:WU01:FS01:0x22: Branch: master
20:54:08:WU01:FS01:0x22: Compiler: Visual C++ 2015
20:54:08:WU01:FS01:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
20:54:08:WU01:FS01:0x22: Platform: win32 10
20:54:08:WU01:FS01:0x22: Bits: 64
20:54:08:WU01:FS01:0x22: Mode: Release
20:54:08:WU01:FS01:0x22:************************************ System ************************************
20:54:08:WU01:FS01:0x22: CPU: AMD Ryzen 5 3600 6-Core Processor
20:54:08:WU01:FS01:0x22: CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
20:54:08:WU01:FS01:0x22: CPUs: 12
20:54:08:WU01:FS01:0x22: Memory: 15.95GiB
20:54:08:WU01:FS01:0x22:Free Memory: 13.43GiB
20:54:08:WU01:FS01:0x22: Threads: WINDOWS_THREADS
20:54:08:WU01:FS01:0x22: OS Version: 6.2
20:54:08:WU01:FS01:0x22:Has Battery: false
20:54:08:WU01:FS01:0x22: On Battery: false
20:54:08:WU01:FS01:0x22: UTC Offset: -7
20:54:08:WU01:FS01:0x22: PID: 8688
20:54:08:WU01:FS01:0x22: CWD: C:\\Users\\Nick\\AppData\\Roaming\\FAHClient\\work
20:54:08:WU01:FS01:0x22:********************************************************************************
20:54:08:WU01:FS01:0x22:Project: 16600 (Run 0, Clone 933, Gen 384)
20:54:08:WU01:FS01:0x22:Unit: 0x000001b08f59f36f5ec36911c061f769
20:54:08:WU01:FS01:0x22:Digital signatures verified
20:54:08:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
20:54:08:WU01:FS01:0x22:Version 0.0.11
20:54:08:WU01:FS01:0x22: Checkpoint write interval: 25000 steps (5%) [20 total]
20:54:08:WU01:FS01:0x22: JSON viewer frame write interval: 5000 steps (1%) [100 total]
20:54:08:WU01:FS01:0x22: XTC frame write interval: 20000 steps (4%) [25 total]
20:54:08:WU01:FS01:0x22: Global context and integrator variables write interval: disabled
20:54:27:WU01:FS01:0x22:Completed 125000 out of 500000 steps (25%)
20:55:53:WU01:FS01:0x22:Completed 130000 out of 500000 steps (26%)
20:57:20:WU01:FS01:0x22:Completed 135000 out of 500000 steps (27%)
20:58:47:WU01:FS01:0x22:Completed 140000 out of 500000 steps (28%)
21:00:13:WU01:FS01:0x22:Completed 145000 out of 500000 steps (29%)
21:01:39:WU01:FS01:0x22:Completed 150000 out of 500000 steps (30%)
21:03:07:WU01:FS01:0x22:Completed 155000 out of 500000 steps (31%)
21:04:06:WU01:FS01:0x22:An exception occurred at step 157627: Particle coordinate is nan
21:04:06:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
21:04:06:WU01:FS01:0x22:Folding@home Core Shutdown: CORE_RESTART
21:04:07:WARNING:WU01:FS01:FahCore returned: CORE_RESTART (98 = 0x62)
21:04:07:WU01:FS01:Starting
21:04:07:WU01:FS01:Running FahCore: \"C:\\Program Files (x86)\\FAHClient/FAHCoreWrapper.exe\" C:\\Users\\Nick\\AppData\\Roaming\\FAHClient\\cores/cores.foldingathome.org/win/64bit/22-0.0.11/Core_22.fah/FahCore_22.exe -dir 01 -suffix 01 -version 705 -lifeline 9488 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
21:04:07:WU01:FS01:Started FahCore on PID 9160
21:04:07:WU01:FS01:Core PID:5532
21:04:07:WU01:FS01:FahCore 0x22 started
21:04:08:WU01:FS01:0x22:*********************** Log Started 2020-08-18T21:04:07Z ***********************
21:04:08:WU01:FS01:0x22:*************************** Core22 Folding@home Core ***************************
21:04:08:WU01:FS01:0x22: Core: Core22
21:04:08:WU01:FS01:0x22: Type: 0x22
21:04:08:WU01:FS01:0x22: Version: 0.0.11
21:04:08:WU01:FS01:0x22: Author: Joseph Coffland <[email protected]>
21:04:08:WU01:FS01:0x22: Copyright: 2020 foldingathome.org
21:04:08:WU01:FS01:0x22: Homepage: https://foldingathome.org/
21:04:08:WU01:FS01:0x22: Date: Jun 26 2020
21:04:08:WU01:FS01:0x22: Time: 19:49:16
21:04:08:WU01:FS01:0x22: Revision: 22010df8a4db48db1b35d33e666b64d8ce48689d
21:04:08:WU01:FS01:0x22: Branch: core22-0.0.11
21:04:08:WU01:FS01:0x22: Compiler: Visual C++ 2015
21:04:08:WU01:FS01:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
21:04:08:WU01:FS01:0x22: Platform: win32 10
21:04:08:WU01:FS01:0x22: Bits: 64
21:04:08:WU01:FS01:0x22: Mode: Release
21:04:08:WU01:FS01:0x22:Maintainers: John Chodera <[email protected]> and Peter Eastman
21:04:08:WU01:FS01:0x22: <[email protected]>
21:04:08:WU01:FS01:0x22: Args: -dir 01 -suffix 01 -version 705 -lifeline 9160 -checkpoint 15
21:04:08:WU01:FS01:0x22: -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
21:04:08:WU01:FS01:0x22:************************************ libFAH ************************************
21:04:08:WU01:FS01:0x22: Date: Jun 26 2020
21:04:08:WU01:FS01:0x22: Time: 19:47:12
21:04:08:WU01:FS01:0x22: Revision: 2b383f4f04f38511dff592885d7c0400e72bdf43
21:04:08:WU01:FS01:0x22: Branch: HEAD
21:04:08:WU01:FS01:0x22: Compiler: Visual C++ 2015
21:04:08:WU01:FS01:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
21:04:08:WU01:FS01:0x22: Platform: win32 10
21:04:08:WU01:FS01:0x22: Bits: 64
21:04:08:WU01:FS01:0x22: Mode: Release
21:04:08:WU01:FS01:0x22:************************************ CBang *************************************
21:04:08:WU01:FS01:0x22: Date: Jun 26 2020
21:04:08:WU01:FS01:0x22: Time: 19:46:11
21:04:08:WU01:FS01:0x22: Revision: f8529962055b0e7bde23e429f5072ff758089dee
21:04:08:WU01:FS01:0x22: Branch: master
21:04:08:WU01:FS01:0x22: Compiler: Visual C++ 2015
21:04:08:WU01:FS01:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
21:04:08:WU01:FS01:0x22: Platform: win32 10
21:04:08:WU01:FS01:0x22: Bits: 64
21:04:08:WU01:FS01:0x22: Mode: Release
21:04:08:WU01:FS01:0x22:************************************ System ************************************
21:04:08:WU01:FS01:0x22: CPU: AMD Ryzen 5 3600 6-Core Processor
21:04:08:WU01:FS01:0x22: CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
21:04:08:WU01:FS01:0x22: CPUs: 12
21:04:08:WU01:FS01:0x22: Memory: 15.95GiB
21:04:08:WU01:FS01:0x22:Free Memory: 13.43GiB
21:04:08:WU01:FS01:0x22: Threads: WINDOWS_THREADS
21:04:08:WU01:FS01:0x22: OS Version: 6.2
21:04:08:WU01:FS01:0x22:Has Battery: false
21:04:08:WU01:FS01:0x22: On Battery: false
21:04:08:WU01:FS01:0x22: UTC Offset: -7
21:04:08:WU01:FS01:0x22: PID: 5532
21:04:08:WU01:FS01:0x22: CWD: C:\\Users\\Nick\\AppData\\Roaming\\FAHClient\\work
21:04:08:WU01:FS01:0x22:********************************************************************************
21:04:08:WU01:FS01:0x22:Project: 16600 (Run 0, Clone 933, Gen 384)
21:04:08:WU01:FS01:0x22:Unit: 0x000001b08f59f36f5ec36911c061f769
21:04:08:WU01:FS01:0x22:Digital signatures verified
21:04:08:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
21:04:08:WU01:FS01:0x22:Version 0.0.11
21:04:08:WU01:FS01:0x22: Checkpoint write interval: 25000 steps (5%) [20 total]
21:04:08:WU01:FS01:0x22: JSON viewer frame write interval: 5000 steps (1%) [100 total]
21:04:08:WU01:FS01:0x22: XTC frame write interval: 20000 steps (4%) [25 total]
21:04:08:WU01:FS01:0x22: Global context and integrator variables write interval: disabled
21:04:26:WU01:FS01:0x22:Completed 150000 out of 500000 steps (30%)
21:05:53:WU01:FS01:0x22:Completed 155000 out of 500000 steps (31%)
21:07:04:WU01:FS01:0x22:An exception occurred at step 156623: Particle coordinate is nan
21:07:04:WU01:FS01:0x22:Max number of attempts to resume from last checkpoint (2) reached. Aborting.
21:07:04:WU01:FS01:0x22:ERROR:114: Max number of attempts to resume from last checkpoint reached.
21:07:04:WU01:FS01:0x22:Saving result file ..\\logfile_01.txt
21:07:04:WU01:FS01:0x22:Saving result file science.log
21:07:04:WU01:FS01:0x22:Saving result file state.xml
21:07:07:WU01:FS01:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
21:07:07:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
21:07:07:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:16600 run:0 clone:933 gen:384 core:0x22 unit:0x000001b08f59f36f5ec36911c061f769
21:07:07:WU01:FS01:Uploading 19.64MiB to 143.89.243.111
21:07:07:WU01:FS01:Connecting to 143.89.243.111:8080
After that, it worked on and successfully finished five 13423's in a row. Right now, it's working on another 16600, cranking away at 582k PPD. It's almost halfway done and has restarted the core once:
Code: Select all
04:24:03:WU01:FS01:Starting
04:24:03:WU01:FS01:Running FahCore: \"C:\\Program Files (x86)\\FAHClient/FAHCoreWrapper.exe\" C:\\Users\\Nick\\AppData\\Roaming\\FAHClient\\cores/cores.foldingathome.org/win/64bit/22-0.0.11/Core_22.fah/FahCore_22.exe -dir 01 -suffix 01 -version 705 -lifeline 9488 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
04:24:03:WU01:FS01:Started FahCore on PID 10084
04:24:03:WU01:FS01:Core PID:11064
04:24:03:WU01:FS01:FahCore 0x22 started
04:24:04:WU01:FS01:0x22:*********************** Log Started 2020-08-19T04:24:03Z ***********************
04:24:04:WU01:FS01:0x22:*************************** Core22 Folding@home Core ***************************
04:24:04:WU01:FS01:0x22: Core: Core22
04:24:04:WU01:FS01:0x22: Type: 0x22
04:24:04:WU01:FS01:0x22: Version: 0.0.11
04:24:04:WU01:FS01:0x22: Author: Joseph Coffland <[email protected]>
04:24:04:WU01:FS01:0x22: Copyright: 2020 foldingathome.org
04:24:04:WU01:FS01:0x22: Homepage: https://foldingathome.org/
04:24:04:WU01:FS01:0x22: Date: Jun 26 2020
04:24:04:WU01:FS01:0x22: Time: 19:49:16
04:24:04:WU01:FS01:0x22: Revision: 22010df8a4db48db1b35d33e666b64d8ce48689d
04:24:04:WU01:FS01:0x22: Branch: core22-0.0.11
04:24:04:WU01:FS01:0x22: Compiler: Visual C++ 2015
04:24:04:WU01:FS01:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
04:24:04:WU01:FS01:0x22: Platform: win32 10
04:24:04:WU01:FS01:0x22: Bits: 64
04:24:04:WU01:FS01:0x22: Mode: Release
04:24:04:WU01:FS01:0x22:Maintainers: John Chodera <[email protected]> and Peter Eastman
04:24:04:WU01:FS01:0x22: <[email protected]>
04:24:04:WU01:FS01:0x22: Args: -dir 01 -suffix 01 -version 705 -lifeline 10084 -checkpoint 15
04:24:04:WU01:FS01:0x22: -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
04:24:04:WU01:FS01:0x22:************************************ libFAH ************************************
04:24:04:WU01:FS01:0x22: Date: Jun 26 2020
04:24:04:WU01:FS01:0x22: Time: 19:47:12
04:24:04:WU01:FS01:0x22: Revision: 2b383f4f04f38511dff592885d7c0400e72bdf43
04:24:04:WU01:FS01:0x22: Branch: HEAD
04:24:04:WU01:FS01:0x22: Compiler: Visual C++ 2015
04:24:04:WU01:FS01:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
04:24:04:WU01:FS01:0x22: Platform: win32 10
04:24:04:WU01:FS01:0x22: Bits: 64
04:24:04:WU01:FS01:0x22: Mode: Release
04:24:04:WU01:FS01:0x22:************************************ CBang *************************************
04:24:04:WU01:FS01:0x22: Date: Jun 26 2020
04:24:04:WU01:FS01:0x22: Time: 19:46:11
04:24:04:WU01:FS01:0x22: Revision: f8529962055b0e7bde23e429f5072ff758089dee
04:24:04:WU01:FS01:0x22: Branch: master
04:24:04:WU01:FS01:0x22: Compiler: Visual C++ 2015
04:24:04:WU01:FS01:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
04:24:04:WU01:FS01:0x22: Platform: win32 10
04:24:04:WU01:FS01:0x22: Bits: 64
04:24:04:WU01:FS01:0x22: Mode: Release
04:24:04:WU01:FS01:0x22:************************************ System ************************************
04:24:04:WU01:FS01:0x22: CPU: AMD Ryzen 5 3600 6-Core Processor
04:24:04:WU01:FS01:0x22: CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
04:24:04:WU01:FS01:0x22: CPUs: 12
04:24:04:WU01:FS01:0x22: Memory: 15.95GiB
04:24:04:WU01:FS01:0x22:Free Memory: 13.25GiB
04:24:04:WU01:FS01:0x22: Threads: WINDOWS_THREADS
04:24:04:WU01:FS01:0x22: OS Version: 6.2
04:24:04:WU01:FS01:0x22:Has Battery: false
04:24:04:WU01:FS01:0x22: On Battery: false
04:24:04:WU01:FS01:0x22: UTC Offset: -7
04:24:04:WU01:FS01:0x22: PID: 11064
04:24:04:WU01:FS01:0x22: CWD: C:\\Users\\Nick\\AppData\\Roaming\\FAHClient\\work
04:24:04:WU01:FS01:0x22:********************************************************************************
04:24:04:WU01:FS01:0x22:Project: 16600 (Run 0, Clone 1566, Gen 116)
04:24:04:WU01:FS01:0x22:Unit: 0x000000898f59f36f5ec36910c82d72db
04:24:04:WU01:FS01:0x22:Reading tar file core.xml
04:24:04:WU01:FS01:0x22:Reading tar file integrator.xml
04:24:04:WU01:FS01:0x22:Reading tar file state.xml
04:24:05:WU01:FS01:0x22:Reading tar file system.xml
04:24:06:WU01:FS01:0x22:Digital signatures verified
04:24:06:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
04:24:06:WU01:FS01:0x22:Version 0.0.11
04:24:06:WU01:FS01:0x22: Checkpoint write interval: 25000 steps (5%) [20 total]
04:24:06:WU01:FS01:0x22: JSON viewer frame write interval: 5000 steps (1%) [100 total]
04:24:06:WU01:FS01:0x22: XTC frame write interval: 20000 steps (4%) [25 total]
04:24:06:WU01:FS01:0x22: Global context and integrator variables write interval: disabled
04:24:24:WU01:FS01:0x22:Completed 0 out of 500000 steps (0%)
04:25:51:WU01:FS01:0x22:Completed 5000 out of 500000 steps (1%)
04:27:17:WU01:FS01:0x22:Completed 10000 out of 500000 steps (2%)
04:28:42:WU01:FS01:0x22:Completed 15000 out of 500000 steps (3%)
04:30:08:WU01:FS01:0x22:Completed 20000 out of 500000 steps (4%)
04:31:35:WU01:FS01:0x22:Completed 25000 out of 500000 steps (5%)
04:33:03:WU01:FS01:0x22:Completed 30000 out of 500000 steps (6%)
04:34:30:WU01:FS01:0x22:Completed 35000 out of 500000 steps (7%)
04:35:56:WU01:FS01:0x22:Completed 40000 out of 500000 steps (8%)
04:37:23:WU01:FS01:0x22:Completed 45000 out of 500000 steps (9%)
04:38:50:WU01:FS01:0x22:Completed 50000 out of 500000 steps (10%)
04:40:19:WU01:FS01:0x22:Completed 55000 out of 500000 steps (11%)
04:41:45:WU01:FS01:0x22:Completed 60000 out of 500000 steps (12%)
04:43:12:WU01:FS01:0x22:Completed 65000 out of 500000 steps (13%)
04:44:39:WU01:FS01:0x22:Completed 70000 out of 500000 steps (14%)
04:45:19:WU01:FS01:0x22:An exception occurred at step 72036: Particle coordinate is nan
04:45:19:WU01:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
04:45:19:WU01:FS01:0x22:Folding@home Core Shutdown: CORE_RESTART
04:45:20:WARNING:WU01:FS01:FahCore returned: CORE_RESTART (98 = 0x62)
04:45:20:WU01:FS01:Starting
04:45:20:WU01:FS01:Running FahCore: \"C:\\Program Files (x86)\\FAHClient/FAHCoreWrapper.exe\" C:\\Users\\Nick\\AppData\\Roaming\\FAHClient\\cores/cores.foldingathome.org/win/64bit/22-0.0.11/Core_22.fah/FahCore_22.exe -dir 01 -suffix 01 -version 705 -lifeline 9488 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
04:45:20:WU01:FS01:Started FahCore on PID 6468
04:45:20:WU01:FS01:Core PID:5160
04:45:20:WU01:FS01:FahCore 0x22 started
04:45:21:WU01:FS01:0x22:*********************** Log Started 2020-08-19T04:45:20Z ***********************
04:45:21:WU01:FS01:0x22:*************************** Core22 Folding@home Core ***************************
04:45:21:WU01:FS01:0x22: Core: Core22
04:45:21:WU01:FS01:0x22: Type: 0x22
04:45:21:WU01:FS01:0x22: Version: 0.0.11
04:45:21:WU01:FS01:0x22: Author: Joseph Coffland <[email protected]>
04:45:21:WU01:FS01:0x22: Copyright: 2020 foldingathome.org
04:45:21:WU01:FS01:0x22: Homepage: https://foldingathome.org/
04:45:21:WU01:FS01:0x22: Date: Jun 26 2020
04:45:21:WU01:FS01:0x22: Time: 19:49:16
04:45:21:WU01:FS01:0x22: Revision: 22010df8a4db48db1b35d33e666b64d8ce48689d
04:45:21:WU01:FS01:0x22: Branch: core22-0.0.11
04:45:21:WU01:FS01:0x22: Compiler: Visual C++ 2015
04:45:21:WU01:FS01:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
04:45:21:WU01:FS01:0x22: Platform: win32 10
04:45:21:WU01:FS01:0x22: Bits: 64
04:45:21:WU01:FS01:0x22: Mode: Release
04:45:21:WU01:FS01:0x22:Maintainers: John Chodera <[email protected]> and Peter Eastman
04:45:21:WU01:FS01:0x22: <[email protected]>
04:45:21:WU01:FS01:0x22: Args: -dir 01 -suffix 01 -version 705 -lifeline 6468 -checkpoint 15
04:45:21:WU01:FS01:0x22: -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
04:45:21:WU01:FS01:0x22:************************************ libFAH ************************************
04:45:21:WU01:FS01:0x22: Date: Jun 26 2020
04:45:21:WU01:FS01:0x22: Time: 19:47:12
04:45:21:WU01:FS01:0x22: Revision: 2b383f4f04f38511dff592885d7c0400e72bdf43
04:45:21:WU01:FS01:0x22: Branch: HEAD
04:45:21:WU01:FS01:0x22: Compiler: Visual C++ 2015
04:45:21:WU01:FS01:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
04:45:21:WU01:FS01:0x22: Platform: win32 10
04:45:21:WU01:FS01:0x22: Bits: 64
04:45:21:WU01:FS01:0x22: Mode: Release
04:45:21:WU01:FS01:0x22:************************************ CBang *************************************
04:45:21:WU01:FS01:0x22: Date: Jun 26 2020
04:45:21:WU01:FS01:0x22: Time: 19:46:11
04:45:21:WU01:FS01:0x22: Revision: f8529962055b0e7bde23e429f5072ff758089dee
04:45:21:WU01:FS01:0x22: Branch: master
04:45:21:WU01:FS01:0x22: Compiler: Visual C++ 2015
04:45:21:WU01:FS01:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
04:45:21:WU01:FS01:0x22: Platform: win32 10
04:45:21:WU01:FS01:0x22: Bits: 64
04:45:21:WU01:FS01:0x22: Mode: Release
04:45:21:WU01:FS01:0x22:************************************ System ************************************
04:45:21:WU01:FS01:0x22: CPU: AMD Ryzen 5 3600 6-Core Processor
04:45:21:WU01:FS01:0x22: CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
04:45:21:WU01:FS01:0x22: CPUs: 12
04:45:21:WU01:FS01:0x22: Memory: 15.95GiB
04:45:21:WU01:FS01:0x22:Free Memory: 13.38GiB
04:45:21:WU01:FS01:0x22: Threads: WINDOWS_THREADS
04:45:21:WU01:FS01:0x22: OS Version: 6.2
04:45:21:WU01:FS01:0x22:Has Battery: false
04:45:21:WU01:FS01:0x22: On Battery: false
04:45:21:WU01:FS01:0x22: UTC Offset: -7
04:45:21:WU01:FS01:0x22: PID: 5160
04:45:21:WU01:FS01:0x22: CWD: C:\\Users\\Nick\\AppData\\Roaming\\FAHClient\\work
04:45:21:WU01:FS01:0x22:********************************************************************************
04:45:21:WU01:FS01:0x22:Project: 16600 (Run 0, Clone 1566, Gen 116)
04:45:21:WU01:FS01:0x22:Unit: 0x000000898f59f36f5ec36910c82d72db
04:45:21:WU01:FS01:0x22:Digital signatures verified
04:45:21:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
04:45:21:WU01:FS01:0x22:Version 0.0.11
04:45:21:WU01:FS01:0x22: Checkpoint write interval: 25000 steps (5%) [20 total]
04:45:21:WU01:FS01:0x22: JSON viewer frame write interval: 5000 steps (1%) [100 total]
04:45:21:WU01:FS01:0x22: XTC frame write interval: 20000 steps (4%) [25 total]
04:45:21:WU01:FS01:0x22: Global context and integrator variables write interval: disabled
04:45:39:WU01:FS01:0x22:Completed 50000 out of 500000 steps (10%)
04:47:06:WU01:FS01:0x22:Completed 55000 out of 500000 steps (11%)
04:48:32:WU01:FS01:0x22:Completed 60000 out of 500000 steps (12%)
04:49:59:WU01:FS01:0x22:Completed 65000 out of 500000 steps (13%)
04:51:25:WU01:FS01:0x22:Completed 70000 out of 500000 steps (14%)
04:52:51:WU01:FS01:0x22:Completed 75000 out of 500000 steps (15%)
04:54:19:WU01:FS01:0x22:Completed 80000 out of 500000 steps (16%)
04:55:45:WU01:FS01:0x22:Completed 85000 out of 500000 steps (17%)
04:57:12:WU01:FS01:0x22:Completed 90000 out of 500000 steps (18%)
04:58:38:WU01:FS01:0x22:Completed 95000 out of 500000 steps (19%)
05:00:05:WU01:FS01:0x22:Completed 100000 out of 500000 steps (20%)
05:01:33:WU01:FS01:0x22:Completed 105000 out of 500000 steps (21%)
05:02:59:WU01:FS01:0x22:Completed 110000 out of 500000 steps (22%)
05:04:26:WU01:FS01:0x22:Completed 115000 out of 500000 steps (23%)
05:05:52:WU01:FS01:0x22:Completed 120000 out of 500000 steps (24%)
05:07:19:WU01:FS01:0x22:Completed 125000 out of 500000 steps (25%)
05:08:47:WU01:FS01:0x22:Completed 130000 out of 500000 steps (26%)
05:10:14:WU01:FS01:0x22:Completed 135000 out of 500000 steps (27%)
05:11:40:WU01:FS01:0x22:Completed 140000 out of 500000 steps (28%)
05:13:07:WU01:FS01:0x22:Completed 145000 out of 500000 steps (29%)
05:14:34:WU01:FS01:0x22:Completed 150000 out of 500000 steps (30%)
05:16:03:WU01:FS01:0x22:Completed 155000 out of 500000 steps (31%)
05:17:29:WU01:FS01:0x22:Completed 160000 out of 500000 steps (32%)
05:18:57:WU01:FS01:0x22:Completed 165000 out of 500000 steps (33%)
05:20:24:WU01:FS01:0x22:Completed 170000 out of 500000 steps (34%)
05:21:50:WU01:FS01:0x22:Completed 175000 out of 500000 steps (35%)
05:23:19:WU01:FS01:0x22:Completed 180000 out of 500000 steps (36%)
05:24:45:WU01:FS01:0x22:Completed 185000 out of 500000 steps (37%)
05:26:12:WU01:FS01:0x22:Completed 190000 out of 500000 steps (38%)
05:27:38:WU01:FS01:0x22:Completed 195000 out of 500000 steps (39%)
05:29:04:WU01:FS01:0x22:Completed 200000 out of 500000 steps (40%)
05:30:32:WU01:FS01:0x22:Completed 205000 out of 500000 steps (41%)
05:31:58:WU01:FS01:0x22:Completed 210000 out of 500000 steps (42%)
05:33:26:WU01:FS01:0x22:Completed 215000 out of 500000 steps (43%)
05:34:53:WU01:FS01:0x22:Completed 220000 out of 500000 steps (44%)
05:36:18:WU01:FS01:0x22:Completed 225000 out of 500000 steps (45%)
05:37:45:WU01:FS01:0x22:Completed 230000 out of 500000 steps (46%)
05:39:11:WU01:FS01:0x22:Completed 235000 out of 500000 steps (47%)
05:40:36:WU01:FS01:0x22:Completed 240000 out of 500000 steps (48%)
Things I've tried so far over this last weekend - 18 failures since then:
- Update drivers from 20.7.2 to 20.8.2
Undo my SoC undervolt (0.975v) - manually set to 1.1v, auto went to 1.25v, yikes!
Set VDDP and VDDG voltages to auto - the former went down, the latter went up
Stop CPU folding
Update BIOS to latest version w/AGESA 1.0.0.6
Run Memtest86 for 20+ hours - 0 errors
Until I looked in HFM's work unit history today and later saw this thread, I was strongly suspecting a hardware issue, hence most of the above troubleshooting steps.
My main PC's GTX 1080 Ti has had no issue processing 16600 WUs - 63 so far this month. It hasn't failed a WU since early July, even with it running slightly above 2 GHz
.
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Wed Aug 19, 2020 9:13 am
by PantherX
UofM.MartinK wrote:...Now all I want to know is whether that was a deliberate change to let these "problematic" AMD GPU models complete p16600 WUs (perhaps because they serve some sort of purpose after all?) or if this was just a fluke and there is no value in processing them with an AMD card.
Changes were made in FahCore_22 to fix the AMD issue and early testing showed promising results. Since there's a limited number of hardware that researchers have access to, they showed success. Beta testing didn't surface those issues and it only happened after being released to Full. Thus, reports like yours on this forum helps surface these issues since testing on every single GPU is not feasible so using the F@H community to identify and work together to solve it is really valuable
If the WU can fold successfully, that's counted towards science. If it can't the failures can still be seen by the researcher it isn't in vain.
Please note that I am aware that the researcher is aware of this issue on AMD GPUs and is allocating dedicated resources to further investigate this issue.
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Wed Aug 19, 2020 9:24 am
by muziqaz
Project has been disabled on all AMD cards but Navi. Please let us know if you still receive new p16600 WU on AMD GPU
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Wed Aug 19, 2020 9:26 am
by Neil-B
Ok so can I point out that we have one poster whose data is showing an issue with 16600 only and the other whose data is showing issues across the board including 16600 .. so there may be an issue with 16600 in some fashion but there also may be an issue with something in the setup of one on the rigs or a wider incompatibility with the current core for that rig? ... if it were simply 16600 it was failing on then yes look to the project but it isn't and so looking to the rig or core even if that is unpalatable may need to be considered - blaming one project for failures across the board seems odd?
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Wed Aug 19, 2020 9:27 am
by muziqaz
Failure rate of 16600 is 32% which is very high
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Wed Aug 19, 2020 9:42 am
by Neil-B
... and high rates of failure on 13421 (30 of 37 failed) and 13423 (7 of 8 failed) on the same rig ... that doesn't just feel like an issue with the 16600 project as far as that rig is concerned ... yes the 34 of 38 failures on 16600 may be down to an issue with the project but with the wider failures it feels like a rig issue or possibly an incompatible core to rig issue
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Wed Aug 19, 2020 9:52 am
by muziqaz
This is not just this particular machine
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Wed Aug 19, 2020 9:53 am
by PantherX
Regarding Projects 13421 and 13423, since they are highly experimental, they have a higher than normal failure rates. John is aware of the higher than normal failure rates and keeping a close eye on the failures. As long as data is successfully uploaded, that's still valuable work being done.
The next version of FahCore_22 (version 0.0.12 or higher) plans to take care of this by running some automated tests upon failure to ensure that these use cases which can't be reproduced in their labs or available hardware can be addressed.