Error handling in FAH
Posted: Thu Jul 16, 2020 12:14 pm
It would seem that the FAH client needs better error handling and recovery, since I have started to see the following happening rather frequently, after upgrading from 4 core 4 thread to 4 core 8 thread, where the WU is finished 100%, gets and error during the core shutdown and just restarts the same WU from 0% in the same directory, even thou it is already full of the 100 finished JSON files and a logfile saying it is finished:
I tried switching to new RAM modules, since the error code looks like a heap corruption thing, and also checked for any OC issues, but none of it has made any difference, the same thing still happens here and there.
As an added issue the FAH client downloads a new WU to a new directory that doesn’t get started, since the client just restarts the old one from 0% again, so you end up with a downloaded WU that is doing nothing, in the hopes that the previous one will finish at some point.
I tried setting the “next-unit-percentage” to “100”, in the hopes that a new download would not happen until the previous one was fully completed, but it just downloads at 100% instead of 99%, so I would suggest adding another value to this configuration option “-1” or something, where a new WU download doesn’t happen until the FahCore returns a correct finishing code “FahCore returned: FINISHED_UNIT (100 = 0x64)”, to avoid any new WU download until the previous one is actually handled.
Code: Select all
09:49:31:WU00:FS03:0x22:Completed 980000 out of 1000000 steps (98%)
09:55:10:WU00:FS03:0x22:Completed 990000 out of 1000000 steps (99%)
10:00:48:WU00:FS03:0x22:Completed 1000000 out of 1000000 steps (100%)
10:00:48:WU00:FS03:0x22:Average performance: 51.1545 ns/day
10:00:53:WU00:FS03:0x22:Saving result file ..\logfile_01.txt
10:00:53:WU00:FS03:0x22:Saving result file checkpointState.xml.bz2
10:00:53:WU00:FS03:0x22:Saving result file globals.csv
10:00:53:WU00:FS03:0x22:Saving result file positions.xtc
10:00:53:WU00:FS03:0x22:Saving result file science.log
10:00:53:WU00:FS03:0x22:Folding@home Core Shutdown: FINISHED_UNIT
10:00:55:WARNING:WU00:FS03:FahCore returned an unknown error code which probably indicates that it crashed
10:00:55:WARNING:WU00:FS03:FahCore returned: UNKNOWN_ENUM (-1073740940 = 0xc0000374)
10:00:55:WU00:FS03:Starting
10:00:55:WU00:FS03:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\Admin\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/win/64bit/22-0.0.11/Core_22.fah/FahCore_22.exe -dir 00 -suffix 01 -version 706 -lifeline 7224 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
10:00:55:WU00:FS03:Started FahCore on PID 7204
10:00:55:WU00:FS03:Core PID:4560
10:00:55:WU00:FS03:FahCore 0x22 started
10:00:56:WU00:FS03:0x22:*********************** Log Started 2020-07-16T10:00:55Z ***********************
10:00:56:WU00:FS03:0x22:*************************** Core22 Folding@home Core ***************************
10:00:56:WU00:FS03:0x22: Core: Core22
10:00:56:WU00:FS03:0x22: Type: 0x22
10:00:56:WU00:FS03:0x22: Version: 0.0.11
10:00:56:WU00:FS03:0x22: Author: Joseph Coffland <[email protected]>
10:00:56:WU00:FS03:0x22: Copyright: 2020 foldingathome.org
10:00:56:WU00:FS03:0x22: Homepage: https://foldingathome.org/
10:00:56:WU00:FS03:0x22: Date: Jun 26 2020
10:00:56:WU00:FS03:0x22: Time: 19:49:16
10:00:56:WU00:FS03:0x22: Revision: 22010df8a4db48db1b35d33e666b64d8ce48689d
10:00:56:WU00:FS03:0x22: Branch: core22-0.0.11
10:00:56:WU00:FS03:0x22: Compiler: Visual C++ 2015
10:00:56:WU00:FS03:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
10:00:56:WU00:FS03:0x22: Platform: win32 10
10:00:56:WU00:FS03:0x22: Bits: 64
10:00:56:WU00:FS03:0x22: Mode: Release
10:00:56:WU00:FS03:0x22:Maintainers: John Chodera <[email protected]> and Peter Eastman
10:00:56:WU00:FS03:0x22: <[email protected]>
10:00:56:WU00:FS03:0x22: Args: -dir 00 -suffix 01 -version 706 -lifeline 7204 -checkpoint 15
10:00:56:WU00:FS03:0x22: -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
10:00:56:WU00:FS03:0x22:************************************ libFAH ************************************
10:00:56:WU00:FS03:0x22: Date: Jun 26 2020
10:00:56:WU00:FS03:0x22: Time: 19:47:12
10:00:56:WU00:FS03:0x22: Revision: 2b383f4f04f38511dff592885d7c0400e72bdf43
10:00:56:WU00:FS03:0x22: Branch: HEAD
10:00:56:WU00:FS03:0x22: Compiler: Visual C++ 2015
10:00:56:WU00:FS03:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
10:00:56:WU00:FS03:0x22: Platform: win32 10
10:00:56:WU00:FS03:0x22: Bits: 64
10:00:56:WU00:FS03:0x22: Mode: Release
10:00:56:WU00:FS03:0x22:************************************ CBang *************************************
10:00:56:WU00:FS03:0x22: Date: Jun 26 2020
10:00:56:WU00:FS03:0x22: Time: 19:46:11
10:00:56:WU00:FS03:0x22: Revision: f8529962055b0e7bde23e429f5072ff758089dee
10:00:56:WU00:FS03:0x22: Branch: master
10:00:56:WU00:FS03:0x22: Compiler: Visual C++ 2015
10:00:56:WU00:FS03:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
10:00:56:WU00:FS03:0x22: Platform: win32 10
10:00:56:WU00:FS03:0x22: Bits: 64
10:00:56:WU00:FS03:0x22: Mode: Release
10:00:56:WU00:FS03:0x22:************************************ System ************************************
10:00:56:WU00:FS03:0x22: CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
10:00:56:WU00:FS03:0x22: CPU ID: GenuineIntel Family 6 Model 94 Stepping 3
10:00:56:WU00:FS03:0x22: CPUs: 8
10:00:56:WU00:FS03:0x22: Memory: 7.69GiB
10:00:56:WU00:FS03:0x22:Free Memory: 3.40GiB
10:00:56:WU00:FS03:0x22: Threads: WINDOWS_THREADS
10:00:56:WU00:FS03:0x22: OS Version: 6.2
10:00:56:WU00:FS03:0x22:Has Battery: false
10:00:56:WU00:FS03:0x22: On Battery: false
10:00:56:WU00:FS03:0x22: UTC Offset: 2
10:00:56:WU00:FS03:0x22: PID: 4560
10:00:56:WU00:FS03:0x22: CWD: C:\Users\Admin\AppData\Roaming\FAHClient\work
10:00:56:WU00:FS03:0x22:********************************************************************************
10:00:56:WU00:FS03:0x22:Project: 13416 (Run 1040, Clone 205, Gen 0)
10:00:56:WU00:FS03:0x22:Unit: 0x0000000012bc7d9a5f0f8f4a3fff7242
10:00:56:WU00:FS03:0x22:Reading tar file core.xml
10:00:56:WU00:FS03:0x22:Reading tar file integrator.xml
10:00:56:WU00:FS03:0x22:Reading tar file state.xml.bz2
10:00:56:WU00:FS03:0x22:Reading tar file system.xml.bz2
10:00:56:WU00:FS03:0x22:Digital signatures verified
10:00:56:WU00:FS03:0x22:Folding@home GPU Core22 Folding@home Core
10:00:56:WU00:FS03:0x22:Version 0.0.11
10:00:59:WU00:FS03:0x22: Checkpoint write interval: 50000 steps (5%) [20 total]
10:00:59:WU00:FS03:0x22: JSON viewer frame write interval: 10000 steps (1%) [100 total]
10:00:59:WU00:FS03:0x22: XTC frame write interval: 250000 steps (25%) [4 total]
10:00:59:WU00:FS03:0x22: Global context and integrator variables write interval: 2500 steps (0.25%) [400 total]
10:01:17:WU00:FS03:0x22:Completed 0 out of 1000000 steps (0%)
10:06:56:WU00:FS03:0x22:Completed 10000 out of 1000000 steps (1%)
10:12:33:WU00:FS03:0x22:Completed 20000 out of 1000000 steps (2%)
10:18:12:WU00:FS03:0x22:Completed 30000 out of 1000000 steps (3%)
10:23:50:WU00:FS03:0x22:Completed 40000 out of 1000000 steps (4%)
10:29:28:WU00:FS03:0x22:Completed 50000 out of 1000000 steps (5%)
10:35:06:WU00:FS03:0x22:Completed 60000 out of 1000000 steps (6%)
10:40:42:WU00:FS03:0x22:Completed 70000 out of 1000000 steps (7%)
10:46:20:WU00:FS03:0x22:Completed 80000 out of 1000000 steps (8%)
10:51:59:WU00:FS03:0x22:Completed 90000 out of 1000000 steps (9%)
10:57:36:WU00:FS03:0x22:Completed 100000 out of 1000000 steps (10%)
As an added issue the FAH client downloads a new WU to a new directory that doesn’t get started, since the client just restarts the old one from 0% again, so you end up with a downloaded WU that is doing nothing, in the hopes that the previous one will finish at some point.
I tried setting the “next-unit-percentage” to “100”, in the hopes that a new download would not happen until the previous one was fully completed, but it just downloads at 100% instead of 99%, so I would suggest adding another value to this configuration option “-1” or something, where a new WU download doesn’t happen until the FahCore returns a correct finishing code “FahCore returned: FINISHED_UNIT (100 = 0x64)”, to avoid any new WU download until the previous one is actually handled.