Page 1 of 2

Detected clock skew[...], I/O delay, laptop hibenation or...

Posted: Wed Dec 30, 2020 4:54 pm
by Foliant
Im currently experimenting with a Raspberry Pi.

After i shut down the Pi (without pause the client) for a short time it started up normally and F@h also restarted to work.
The Client managed to fold another 2% (overall percentage 17%) but than the progress bar went up about 1% per second until reaching 99.99% to stuck there.

I paused, then unpaused.
The client retrieves from 17%, folds to 18% and is ramping up the progress bar again.

I paused, then unpaused.
The client retrieves from 18%, folds to 19% and is running up the progress bar again.

The mentioned clock skew of 8 mins 41 secs should the right amount of time the Pi was off duty.
Down is my log, if I can provide more info just ask with an hint where to find.

Code: Select all

*********************** Log Started 2020-12-30T15:11:46Z ***********************
15:11:46:******************************* libFAH ********************************
15:11:46:       Date: Oct 20 2020
15:11:46:       Time: 20:36:48
15:11:46:   Revision: 5ca109d295a6245e2a2f590b3d0085ad5e567aeb
15:11:46:     Branch: master
15:11:46:   Compiler: GNU 8.3.0
15:11:46:    Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
15:11:46:             -fdata-sections -O3 -funroll-loops -fno-pie
15:11:46:   Platform: linux2 4.19.0-9-arm64
15:11:46:       Bits: 64
15:11:46:       Mode: Release
15:11:46:****************************** FAHClient ******************************
15:11:46:    Version: 7.6.21
15:11:46:     Author: Joseph Coffland <[email protected]>
15:11:46:  Copyright: 2020 foldingathome.org
15:11:46:   Homepage: https://foldingathome.org/
15:11:46:       Date: Oct 20 2020
15:11:46:       Time: 20:39:10
15:11:46:   Revision: 6efbf0e138e22d3963e6a291f78dcb9c6422a278
15:11:46:     Branch: master
15:11:46:   Compiler: GNU 8.3.0
15:11:46:    Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
15:11:46:             -fdata-sections -O3 -funroll-loops -fno-pie
15:11:46:   Platform: linux2 4.19.0-9-arm64
15:11:46:       Bits: 64
15:11:46:       Mode: Release
15:11:46:       Args: --config /home/pi/config.xml
15:11:46:     Config: /home/pi/config.xml
15:11:46:******************************** CBang ********************************
15:11:46:       Date: Oct 20 2020
15:11:46:       Time: 18:38:03
15:11:46:   Revision: 7e4ce85225d7eaeb775e87c31740181ca603de60
15:11:46:     Branch: master
15:11:46:   Compiler: GNU 8.3.0
15:11:46:    Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
15:11:46:             -fdata-sections -O3 -funroll-loops -fno-pie -fPIC
15:11:46:   Platform: linux2 4.19.0-9-arm64
15:11:46:       Bits: 64
15:11:46:       Mode: Release
15:11:46:******************************* System ********************************
15:11:46:        CPU: Unknown
15:11:46:     CPU ID: 
15:11:46:       CPUs: 4
15:11:46:     Memory: 912.63MiB
15:11:46:Free Memory: 682.78MiB
15:11:46:    Threads: POSIX_THREADS
15:11:46: OS Version: 5.4
15:11:46:Has Battery: false
15:11:46: On Battery: false
15:11:46: UTC Offset: 1
15:11:46:        PID: 533
15:11:46:        CWD: /
15:11:46:         OS: Linux 5.4.79-v8+ aarch64
15:11:46:    OS Arch: ARM64
15:11:46:       GPUs: 0
15:11:46:       CUDA: Not detected: Failed to open dynamic library 'libcuda.so':
15:11:46:             libcuda.so: cannot open shared object file: No such file or
15:11:46:             directory
15:11:46:     OpenCL: Not detected: Failed to open dynamic library 'libOpenCL.so':
15:11:46:             libOpenCL.so: cannot open shared object file: No such file or
15:11:46:             directory
15:11:46:***********************************************************************
15:11:46:<config>
15:11:46:  <!-- Folding Slot Configuration -->
15:11:46:  <cause v='CANCER'/>
15:11:46:  <gpu v='false'/>
15:11:46:
15:11:46:  <!-- HTTP Server -->
15:11:46:  <allow v='*****'/>
15:11:46:
15:11:46:  <!-- Network -->
15:11:46:  <proxy v=':8080'/>
15:11:46:
15:11:46:  <!-- Remote Command Server -->
15:11:46:  <command-allow-no-pass v='127.0.0.1 *****'/>
15:11:46:  <password v='*****'/>
15:11:46:
15:11:46:  <!-- User Information -->
15:11:46:  <passkey v='*****'/>
15:11:46:  <team v='*****'/>
15:11:46:  <user v='*****'/>
15:11:46:
15:11:46:  <!-- Folding Slots -->
15:11:46:  <slot id='0' type='CPU'>
15:11:46:    <cpus v='3'/>
15:11:46:  </slot>
15:11:46:</config>
15:11:46:Trying to access database...
15:11:46:Successfully acquired database lock
15:11:46:FS00:Initialized folding slot 00: cpu:3
15:11:46:WU00:FS00:Starting
15:11:46:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper //cores/cores.foldingathome.org/lin/64bit-aarch64/a8-0.0.9/Core_a8.fah/FahCore_a8 -dir 00 -suffix 01 -version 706 -lifeline 533 -checkpoint 15 -np 3
15:11:46:WU00:FS00:Started FahCore on PID 551
15:11:47:WU00:FS00:Core PID:556
15:11:47:WU00:FS00:FahCore 0xa8 started
15:11:50:WU00:FS00:0xa8:*********************** Log Started 2020-12-30T15:11:50Z ***********************
15:11:50:WU00:FS00:0xa8:************************** Gromacs Folding@home Core ***************************
15:11:50:WU00:FS00:0xa8:       Core: Gromacs
15:11:50:WU00:FS00:0xa8:       Type: 0xa8
15:11:50:WU00:FS00:0xa8:    Version: 0.0.9
15:11:50:WU00:FS00:0xa8:     Author: Joseph Coffland <[email protected]>
15:11:50:WU00:FS00:0xa8:  Copyright: 2020 foldingathome.org
15:11:50:WU00:FS00:0xa8:   Homepage: https://foldingathome.org/
15:11:50:WU00:FS00:0xa8:       Date: Oct 28 2020
15:11:50:WU00:FS00:0xa8:       Time: 22:19:53
15:11:50:WU00:FS00:0xa8:   Compiler: GNU 8.3.0
15:11:50:WU00:FS00:0xa8:    Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
15:11:50:WU00:FS00:0xa8:             -fdata-sections -O3 -funroll-loops -fno-pie
15:11:50:WU00:FS00:0xa8:   Platform: linux2 4.15.0-108-generic
15:11:50:WU00:FS00:0xa8:       Bits: 64
15:11:50:WU00:FS00:0xa8:       Mode: Release
15:11:50:WU00:FS00:0xa8:       SIMD: arm_neon_asimd
15:11:50:WU00:FS00:0xa8:     OpenMP: ON
15:11:50:WU00:FS00:0xa8:       CUDA: OFF
15:11:50:WU00:FS00:0xa8:       Args: -dir 00 -suffix 01 -version 706 -lifeline 551 -checkpoint 15 -np 3
15:11:50:WU00:FS00:0xa8:************************************ libFAH ************************************
15:11:50:WU00:FS00:0xa8:       Date: Oct 28 2020
15:11:50:WU00:FS00:0xa8:       Time: 22:19:24
15:11:50:WU00:FS00:0xa8:   Compiler: GNU 8.3.0
15:11:50:WU00:FS00:0xa8:    Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
15:11:50:WU00:FS00:0xa8:             -fdata-sections -O3 -funroll-loops -fno-pie
15:11:50:WU00:FS00:0xa8:   Platform: linux2 4.15.0-108-generic
15:11:50:WU00:FS00:0xa8:       Bits: 64
15:11:50:WU00:FS00:0xa8:       Mode: Release
15:11:50:WU00:FS00:0xa8:************************************ CBang *************************************
15:11:50:WU00:FS00:0xa8:       Date: Oct 28 2020
15:11:50:WU00:FS00:0xa8:       Time: 22:19:08
15:11:50:WU00:FS00:0xa8:   Compiler: GNU 8.3.0
15:11:50:WU00:FS00:0xa8:    Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
15:11:50:WU00:FS00:0xa8:             -fdata-sections -O3 -funroll-loops -fno-pie -fPIC
15:11:50:WU00:FS00:0xa8:   Platform: linux2 4.15.0-108-generic
15:11:50:WU00:FS00:0xa8:       Bits: 64
15:11:50:WU00:FS00:0xa8:       Mode: Release
15:11:50:WU00:FS00:0xa8:************************************ System ************************************
15:11:50:WU00:FS00:0xa8:        CPU: Cortex-A
15:11:50:WU00:FS00:0xa8:     CPU ID: Arm Family 8 Model 53 Stepping 4
15:11:50:WU00:FS00:0xa8:       CPUs: 4
15:11:50:WU00:FS00:0xa8:     Memory: 912.63MiB
15:11:50:WU00:FS00:0xa8:Free Memory: 661.14MiB
15:11:50:WU00:FS00:0xa8:    Threads: POSIX_THREADS
15:11:50:WU00:FS00:0xa8: OS Version: 5.4
15:11:50:WU00:FS00:0xa8:Has Battery: false
15:11:50:WU00:FS00:0xa8: On Battery: false
15:11:50:WU00:FS00:0xa8: UTC Offset: 1
15:11:50:WU00:FS00:0xa8:        PID: 556
15:11:50:WU00:FS00:0xa8:        CWD: /work
15:11:50:WU00:FS00:0xa8:********************************************************************************
15:11:50:WU00:FS00:0xa8:Project: 16933 (Run 57, Clone 74, Gen 46)
15:11:50:WU00:FS00:0xa8:Unit: 0x00000000000000000000000000000000
15:11:50:WU00:FS00:0xa8:Digital signatures verified
15:11:50:WU00:FS00:0xa8:Calling: mdrun -c frame46.gro -s frame46.tpr -x frame46.xtc -cpi state.cpt -cpt 15 -nt 3 -ntmpi 1
15:11:52:WU00:FS00:0xa8:Steps: first=23000000 total=23500000
15:20:51:WARNING:WU00:FS00:Detected clock skew (8 mins 41 secs), I/O delay, laptop hibernation or other slowdown noted, adjusting time estimates
15:21:25:WU00:FS00:0xa8:Completed 81922 out of 500000 steps (16%)
15:39:17:WU00:FS00:0xa8:Completed 85000 out of 500000 steps (17%)
15:44:39:FS00:Paused
15:44:39:FS00:Shutting core down
15:44:39:WU00:FS00:0xa8:Caught signal SIGINT(2) on PID 556
15:44:39:WU00:FS00:0xa8:Exiting, please wait. . .
15:44:41:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
15:44:59:FS00:Unpaused
15:44:59:WU00:FS00:Starting
15:44:59:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper //cores/cores.foldingathome.org/lin/64bit-aarch64/a8-0.0.9/Core_a8.fah/FahCore_a8 -dir 00 -suffix 01 -version 706 -lifeline 533 -checkpoint 15 -np 3
15:44:59:WU00:FS00:Started FahCore on PID 21705
15:44:59:WU00:FS00:Core PID:21709
15:44:59:WU00:FS00:FahCore 0xa8 started
15:45:00:WU00:FS00:0xa8:*********************** Log Started 2020-12-30T15:44:59Z ***********************
15:45:00:WU00:FS00:0xa8:************************** Gromacs Folding@home Core ***************************
15:45:00:WU00:FS00:0xa8:       Core: Gromacs
15:45:00:WU00:FS00:0xa8:       Type: 0xa8
15:45:00:WU00:FS00:0xa8:    Version: 0.0.9
15:45:00:WU00:FS00:0xa8:     Author: Joseph Coffland <[email protected]>
15:45:00:WU00:FS00:0xa8:  Copyright: 2020 foldingathome.org
15:45:00:WU00:FS00:0xa8:   Homepage: https://foldingathome.org/
15:45:00:WU00:FS00:0xa8:       Date: Oct 28 2020
15:45:00:WU00:FS00:0xa8:       Time: 22:19:53
15:45:00:WU00:FS00:0xa8:   Compiler: GNU 8.3.0
15:45:00:WU00:FS00:0xa8:    Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
15:45:00:WU00:FS00:0xa8:             -fdata-sections -O3 -funroll-loops -fno-pie
15:45:00:WU00:FS00:0xa8:   Platform: linux2 4.15.0-108-generic
15:45:00:WU00:FS00:0xa8:       Bits: 64
15:45:00:WU00:FS00:0xa8:       Mode: Release
15:45:00:WU00:FS00:0xa8:       SIMD: arm_neon_asimd
15:45:00:WU00:FS00:0xa8:     OpenMP: ON
15:45:00:WU00:FS00:0xa8:       CUDA: OFF
15:45:00:WU00:FS00:0xa8:       Args: -dir 00 -suffix 01 -version 706 -lifeline 21705 -checkpoint 15 -np
15:45:00:WU00:FS00:0xa8:             3
15:45:00:WU00:FS00:0xa8:************************************ libFAH ************************************
15:45:00:WU00:FS00:0xa8:       Date: Oct 28 2020
15:45:00:WU00:FS00:0xa8:       Time: 22:19:24
15:45:00:WU00:FS00:0xa8:   Compiler: GNU 8.3.0
15:45:00:WU00:FS00:0xa8:    Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
15:45:00:WU00:FS00:0xa8:             -fdata-sections -O3 -funroll-loops -fno-pie
15:45:00:WU00:FS00:0xa8:   Platform: linux2 4.15.0-108-generic
15:45:00:WU00:FS00:0xa8:       Bits: 64
15:45:00:WU00:FS00:0xa8:       Mode: Release
15:45:00:WU00:FS00:0xa8:************************************ CBang *************************************
15:45:00:WU00:FS00:0xa8:       Date: Oct 28 2020
15:45:00:WU00:FS00:0xa8:       Time: 22:19:08
15:45:00:WU00:FS00:0xa8:   Compiler: GNU 8.3.0
15:45:00:WU00:FS00:0xa8:    Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
15:45:00:WU00:FS00:0xa8:             -fdata-sections -O3 -funroll-loops -fno-pie -fPIC
15:45:00:WU00:FS00:0xa8:   Platform: linux2 4.15.0-108-generic
15:45:00:WU00:FS00:0xa8:       Bits: 64
15:45:00:WU00:FS00:0xa8:       Mode: Release
15:45:00:WU00:FS00:0xa8:************************************ System ************************************
15:45:00:WU00:FS00:0xa8:        CPU: Cortex-A
15:45:00:WU00:FS00:0xa8:     CPU ID: Arm Family 8 Model 53 Stepping 4
15:45:00:WU00:FS00:0xa8:       CPUs: 4
15:45:00:WU00:FS00:0xa8:     Memory: 912.63MiB
15:45:00:WU00:FS00:0xa8:Free Memory: 399.00MiB
15:45:00:WU00:FS00:0xa8:    Threads: POSIX_THREADS
15:45:00:WU00:FS00:0xa8: OS Version: 5.4
15:45:00:WU00:FS00:0xa8:Has Battery: false
15:45:00:WU00:FS00:0xa8: On Battery: false
15:45:00:WU00:FS00:0xa8: UTC Offset: 1
15:45:00:WU00:FS00:0xa8:        PID: 21709
15:45:00:WU00:FS00:0xa8:        CWD: /work
15:45:00:WU00:FS00:0xa8:********************************************************************************
15:45:00:WU00:FS00:0xa8:Project: 16933 (Run 57, Clone 74, Gen 46)
15:45:00:WU00:FS00:0xa8:Unit: 0x00000000000000000000000000000000
15:45:00:WU00:FS00:0xa8:Digital signatures verified
15:45:00:WU00:FS00:0xa8:Calling: mdrun -c frame46.gro -s frame46.tpr -x frame46.xtc -cpi state.cpt -cpt 15 -nt 3 -ntmpi 1
15:45:00:WU00:FS00:0xa8:Steps: first=23000000 total=23500000
15:45:51:WU00:FS00:0xa8:Completed 85899 out of 500000 steps (17%)
16:08:36:WU00:FS00:0xa8:Completed 90000 out of 500000 steps (18%)
16:27:42:FS00:Paused
16:27:42:FS00:Shutting core down
16:27:42:WU00:FS00:0xa8:Caught signal SIGINT(2) on PID 21709
16:27:42:WU00:FS00:0xa8:Exiting, please wait. . .
16:27:44:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
16:27:51:FS00:Unpaused
16:27:51:WU00:FS00:Starting
16:27:51:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper //cores/cores.foldingathome.org/lin/64bit-aarch64/a8-0.0.9/Core_a8.fah/FahCore_a8 -dir 00 -suffix 01 -version 706 -lifeline 533 -checkpoint 15 -np 3
16:27:51:WU00:FS00:Started FahCore on PID 13203
16:27:51:WU00:FS00:Core PID:13210
16:27:51:WU00:FS00:FahCore 0xa8 started
16:27:51:WU00:FS00:0xa8:*********************** Log Started 2020-12-30T16:27:51Z ***********************
16:27:51:WU00:FS00:0xa8:************************** Gromacs Folding@home Core ***************************
16:27:51:WU00:FS00:0xa8:       Core: Gromacs
16:27:51:WU00:FS00:0xa8:       Type: 0xa8
16:27:51:WU00:FS00:0xa8:    Version: 0.0.9
16:27:51:WU00:FS00:0xa8:     Author: Joseph Coffland <[email protected]>
16:27:51:WU00:FS00:0xa8:  Copyright: 2020 foldingathome.org
16:27:51:WU00:FS00:0xa8:   Homepage: https://foldingathome.org/
16:27:51:WU00:FS00:0xa8:       Date: Oct 28 2020
16:27:51:WU00:FS00:0xa8:       Time: 22:19:53
16:27:51:WU00:FS00:0xa8:   Compiler: GNU 8.3.0
16:27:51:WU00:FS00:0xa8:    Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
16:27:51:WU00:FS00:0xa8:             -fdata-sections -O3 -funroll-loops -fno-pie
16:27:51:WU00:FS00:0xa8:   Platform: linux2 4.15.0-108-generic
16:27:51:WU00:FS00:0xa8:       Bits: 64
16:27:51:WU00:FS00:0xa8:       Mode: Release
16:27:51:WU00:FS00:0xa8:       SIMD: arm_neon_asimd
16:27:51:WU00:FS00:0xa8:     OpenMP: ON
16:27:51:WU00:FS00:0xa8:       CUDA: OFF
16:27:51:WU00:FS00:0xa8:       Args: -dir 00 -suffix 01 -version 706 -lifeline 13203 -checkpoint 15 -np
16:27:51:WU00:FS00:0xa8:             3
16:27:51:WU00:FS00:0xa8:************************************ libFAH ************************************
16:27:51:WU00:FS00:0xa8:       Date: Oct 28 2020
16:27:51:WU00:FS00:0xa8:       Time: 22:19:24
16:27:51:WU00:FS00:0xa8:   Compiler: GNU 8.3.0
16:27:51:WU00:FS00:0xa8:    Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
16:27:51:WU00:FS00:0xa8:             -fdata-sections -O3 -funroll-loops -fno-pie
16:27:51:WU00:FS00:0xa8:   Platform: linux2 4.15.0-108-generic
16:27:51:WU00:FS00:0xa8:       Bits: 64
16:27:51:WU00:FS00:0xa8:       Mode: Release
16:27:51:WU00:FS00:0xa8:************************************ CBang *************************************
16:27:51:WU00:FS00:0xa8:       Date: Oct 28 2020
16:27:51:WU00:FS00:0xa8:       Time: 22:19:08
16:27:51:WU00:FS00:0xa8:   Compiler: GNU 8.3.0
16:27:51:WU00:FS00:0xa8:    Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
16:27:51:WU00:FS00:0xa8:             -fdata-sections -O3 -funroll-loops -fno-pie -fPIC
16:27:51:WU00:FS00:0xa8:   Platform: linux2 4.15.0-108-generic
16:27:51:WU00:FS00:0xa8:       Bits: 64
16:27:51:WU00:FS00:0xa8:       Mode: Release
16:27:51:WU00:FS00:0xa8:************************************ System ************************************
16:27:51:WU00:FS00:0xa8:        CPU: Cortex-A
16:27:51:WU00:FS00:0xa8:     CPU ID: Arm Family 8 Model 53 Stepping 4
16:27:51:WU00:FS00:0xa8:       CPUs: 4
16:27:51:WU00:FS00:0xa8:     Memory: 912.63MiB
16:27:51:WU00:FS00:0xa8:Free Memory: 360.57MiB
16:27:51:WU00:FS00:0xa8:    Threads: POSIX_THREADS
16:27:51:WU00:FS00:0xa8: OS Version: 5.4
16:27:51:WU00:FS00:0xa8:Has Battery: false
16:27:51:WU00:FS00:0xa8: On Battery: false
16:27:51:WU00:FS00:0xa8: UTC Offset: 1
16:27:51:WU00:FS00:0xa8:        PID: 13210
16:27:51:WU00:FS00:0xa8:        CWD: /work
16:27:51:WU00:FS00:0xa8:********************************************************************************
16:27:51:WU00:FS00:0xa8:Project: 16933 (Run 57, Clone 74, Gen 46)
16:27:51:WU00:FS00:0xa8:Unit: 0x00000000000000000000000000000000
16:27:51:WU00:FS00:0xa8:Digital signatures verified
16:27:51:WU00:FS00:0xa8:Calling: mdrun -c frame46.gro -s frame46.tpr -x frame46.xtc -cpi state.cpt -cpt 15 -nt 3 -ntmpi 1
16:27:51:WU00:FS00:0xa8:Steps: first=23000000 total=23500000
16:28:44:WU00:FS00:0xa8:Completed 93622 out of 500000 steps (18%)
16:36:49:WU00:FS00:0xa8:Completed 95000 out of 500000 steps (19%)

Re: Detected clock skew[...], I/O delay, laptop hibenation o

Posted: Wed Dec 30, 2020 5:39 pm
by Knish
I have not experienced the ramping up of progress %, but I have experienced a few steps in progress before hitting the clock skew which caused me to get locked out of remote monitoring for that VM. I have found it to be a lack of RAM. A lot of WUs happened to be ok with 1.75GB of RAM until I hit a couple of the biggest atom counts. 2GB of RAM seems to have fixed that problem for me. How much ram do u have?
{edit} caveat - i was working with gpu WUs, and just realized ur probably just on cpu WUs, so ur min RAM amount will probably differ.
{edit 2} how I narrowed it down was as soon as I restarted folding, I'd watch my RAM with top and I'd see only 50MB free, which would not be good for certain steps that required more ram

Re: Detected clock skew[...], I/O delay, laptop hibenation o

Posted: Wed Dec 30, 2020 5:56 pm
by Foliant
Good point, RAM indeed is very low. I dont see it fully used but maybe you point the right direction.

I found the old logs from before the shutdown. Seems like a normal exit.

Code: Select all

15:00:15:WU00:FS00:0xa8:Completed 80000 out of 500000 steps (16%)
15:11:28:Caught signal SIGTERM(15) on PID 525
15:11:29:Exiting, please wait. . .
15:11:30:WU00:FS00:0xa8:Caught signal SIGTERM(15) on PID 5446
15:11:30:WU00:FS00:0xa8:Exiting, please wait. . .
15:11:31:FS00:Shutting core down
15:11:33:Clean exit
What catches my eyes is the logs timestamp. The Pi definitively was off grid some minutes!

But 15:11:33:Clean exit and Log Started 2020-12-30T15:11:46Z dont correlate.

Re: Detected clock skew[...], I/O delay, laptop hibenation o

Posted: Wed Dec 30, 2020 8:15 pm
by PantherX
This is a known issue (https://github.com/FoldingAtHome/fah-issues/issues/1240) but I have asked around so let's see what happens :)

Re: Detected clock skew[...], I/O delay, laptop hibenation o

Posted: Wed Dec 30, 2020 9:03 pm
by Foliant
Whatever is causing this might be limited to the Interface that provides the data we can read.

I noticed via htop the CPU was working very hard even if the progress bar shows 99,99%.
So I decided to let it go not doing the pause/unpause.
After quite while the log starts filling again with comprehensible data. Despite the progress bar keeps its 99,99.

I have not stared at the bar all the time but I think one time i came through the door the progress was climbing from 96...98...99,99 without the pause/unpause-thing.

Now after about another hour i did not monitor whats going on, not only the log is back to normal, even the progress bar is showing what the log tells. :shock:

Code: Select all

18:39:41:FS00:Unpaused
18:39:41:WU00:FS00:Starting
18:39:41:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper //cores/cores.foldingathome.org/lin/64bit-aarch64/a8-0.0.9/Core_a8.fah/FahCore_a8 -dir 00 -suffix 01 -version 706 -lifeline 527 -checkpoint 15 -np 3
18:39:41:WU00:FS00:Started FahCore on PID 8067

[Blahblah]

18:39:55:</config>
18:40:31:WU00:FS00:0xa8:Completed 108916 out of 500000 steps (21%)
18:46:39:WU00:FS00:0xa8:Completed 110000 out of 500000 steps (22%)
19:11:53:WU00:FS00:0xa8:Completed 115000 out of 500000 steps (23%)
19:36:09:WU00:FS00:0xa8:Completed 120000 out of 500000 steps (24%)
20:00:27:WU00:FS00:0xa8:Completed 125000 out of 500000 steps (25%)
20:26:24:WU00:FS00:0xa8:Completed 130000 out of 500000 steps (26%)
Best thing is I can still meet the timeout: Base 419; Estimated 545


Technical question that comes up when I read
FahCore: /usr/bin/FAHCoreWrapper //cores/cores.foldingathome.org/lin/64bit-aarch64/a8-0.0.9/Core_a8.fah/FahCore_a8 -dir 00 -suffix 01 -version 706 -lifeline 527 -checkpoint 15 -np 3
Is it somehow possible to force start at a specific checkpoint?


Regards,
Patrick

Re: Detected clock skew[...], I/O delay, laptop hibenation o

Posted: Thu Dec 31, 2020 3:46 am
by PantherX
Foliant wrote:...Technical question that comes up when I read
FahCore: /usr/bin/FAHCoreWrapper //cores/cores.foldingathome.org/lin/64bit-aarch64/a8-0.0.9/Core_a8.fah/FahCore_a8 -dir 00 -suffix 01 -version 706 -lifeline 527 -checkpoint 15 -np 3
Is it somehow possible to force start at a specific checkpoint?...
By default, FahCore would use the most recent checkpoint to resume folding. I do think it maintains 1 or 2 checkpoint that gets reused so there's no history of checkpoints. I do believe that there's an enhancement request to ensure that multiple checkpoints exist to provide a better fallback option: https://github.com/FoldingAtHome/fah-issues/issues/1195

Re: Detected clock skew[...], I/O delay, laptop hibenation o

Posted: Thu Dec 31, 2020 11:55 am
by foldy
Foliant wrote: Now after about another hour i did not monitor whats going on, not only the log is back to normal, even the progress bar is showing what the log tells. :shock:
The 99% really means the UI lost connection to the Fahcore and just increases % every minute or so speculative.
After the UI got connection to Fahcore back then it shows the real %.

The only question is why Fahcore gets blocked in between? Did you try to run only 1 CPU core to test if it is maybe heat issue?

Re: Detected clock skew[...], I/O delay, laptop hibenation o

Posted: Thu Dec 31, 2020 12:24 pm
by Foliant
It did not increase on a minute basis, it increases every second. The PPD estimate also rised up at that moment.
I know the "normal" behaviour of increasing over long periods of time - this was different.

I had two UIs open.
First on the PI, showing "Status".
Second remote, showing the "Log".

The Log restored before the Status.

Heat should not be the problem, a 60mm Fan is cooling the entire Pi. My highest CPU Temp reading is at 62°C

Im now at 64,91% of that WU and all seems fine since then.

Re: Detected clock skew[...], I/O delay, laptop hibenation o

Posted: Thu Dec 31, 2020 4:44 pm
by bruce
-checkpoint 15

That is the time between checkpoints, not the time at which the checkpoint occurred.

FAH only keeps the most recent checkpoint. If it is corrupted, it can only restart from that point or from 0%. If you run out of disk space or run out of virtual memory, you can produce a truncated checkpoint which is useless. Do not restrict the resources that FAH needs to run.

Re: Detected clock skew[...], I/O delay, laptop hibenation o

Posted: Thu Dec 31, 2020 5:16 pm
by Joe_H
PantherX wrote:
Foliant wrote:...Technical question that comes up when I read
FahCore: /usr/bin/FAHCoreWrapper //cores/cores.foldingathome.org/lin/64bit-aarch64/a8-0.0.9/Core_a8.fah/FahCore_a8 -dir 00 -suffix 01 -version 706 -lifeline 527 -checkpoint 15 -np 3
Is it somehow possible to force start at a specific checkpoint?...
By default, FahCore would use the most recent checkpoint to resume folding. I do think it maintains 1 or 2 checkpoint that gets reused so there's no history of checkpoints. I do believe that there's an enhancement request to ensure that multiple checkpoints exist to provide a better fallback option: https://github.com/FoldingAtHome/fah-issues/issues/1195
I have not been running any GPU folding for a while, so will only comment on CPU folding in relation to checkpoints.

The Gromacs code being used for creating the CPU folding cores, A7 & A8, supports at least two generations of checkpoints. However the folding client and the cores have never implemented support for using the prior checkpoint. As mentioned by PantherX, there is an enhancement request to change this.

If you want to look, for example in the work folder for an A8 WU currently processing on my system has a state.cpt file and a state_prev.cpt file. What is missing at the least is a checkpt.crc file for the prior checkpoint file, just one for the current checkpoint exists. I don't know what else would be needed, and I have been involved with computers and programming long enough to know that "it is just a simple matter of programming" is rarely simple or easy.

Re: Detected clock skew[...], I/O delay, laptop hibenation o

Posted: Thu Dec 31, 2020 7:07 pm
by bruce
While I do support getting an enhancement to FAHCore* to use the previous checkpoint if the primary one is corrupt, I (almost) never run into a corrupt checkpoint. Your focus should be on whatever is causing FAH to crash while the checkpoint is still cached in RAM.

Re: Detected clock skew[...], I/O delay, laptop hibenation o

Posted: Sat Jan 02, 2021 2:40 pm
by PantherX
If you have a bit of time, could you potentially poke around to see if you can reliably re-create the issue? This has been an elusive issue so would appreciate if you can do it to allow us to investigate this further and fix it :)

Re: Detected clock skew[...], I/O delay, laptop hibenation o

Posted: Sat Jan 02, 2021 5:34 pm
by Foliant
I downloaded a new WU, let it run to 1%+ and was shutting the RPi down.
I disconnected DC in and let it alone for dinner (about half an hour).

Now I put the DC plug in again, RPi was booting up as normal.
And the log is showing me:

Code: Select all

16:36:12:WARNING:WU00:FS00:Detected clock skew (27 mins 13 secs), I/O delay, laptop hibernation or other slowdown noted, adjusting time estimates
The progress-bar first showed 0,00%, than 1,09% (thats where i stopped the RPi), after that it falled back to 1,00% and starts folding.
Reaching 2,00% the progress-bar is counting fast forward - but its way slower than it was with the 8 minute clock skew. (Now about 1% every 50sec)

As it rises the estimated TPF is 53sec
I searched my old screenshot from the faster and it somehow correlates: 3sec

Folding is going on in background and as the log is showing 3% the progress-bar jumps back to 0, increasing again.
Its now even slower and estimated TPF shows 2min 36sec.

I let it just go and wait what will happen next.

Re: Detected clock skew[...], I/O delay, laptop hibenation o

Posted: Sat Jan 02, 2021 7:02 pm
by Foliant
Part 2

As the log was hitting 4% all went back to normal.
It is weird but seems not to corrupt anything.


In the next couple of days I will report back doing some experimenting.

Re: Detected clock skew[...], I/O delay, laptop hibenation o

Posted: Sat Jan 02, 2021 7:05 pm
by Joe_H
Sounds a lot like the checkpoint had been made just a bit before completing 1.0%. When restarted the time to complete that little bit was used to estimate a very fast TPF value. It then took the client a few samples over the next few percent to push that out of the estimate and get back to normal.