Page 1 of 1

FAH runs fine on one server, dies on another.

Posted: Thu May 06, 2021 11:54 pm
by Instance
I've just configured two servers to run FAH. They're in remote data centres, so all that's running is FAHClient, monitoring from my desktop.

The older one, 8 cores, slower everything is running fine at 17.5K PPD.

The newer one, 12 cores, faster bus/ram/CPU etc. F@h starts, reports about 2.3K PPD and then an hour or so later just stops. Tried turning the number of cores down, same result.

Snippets from the top and bottom of the logs:

Code: Select all

*********************** Log Started 2021-05-06T13:42:36Z ***********************
13:42:36:************************* Folding@home Client *************************
13:42:36:    Website: http://folding.stanford.edu/
13:42:36:  Copyright: (c) 2009-2014 Stanford University
13:42:36:     Author: Joseph Coffland <[email protected]>
13:42:36:       Args: --child --lifeline 3680 /etc/fahclient/config.xml --run-as
13:42:36:             fahclient --pid-file=/var/run/fahclient.pid --daemon
13:42:36:     Config: /etc/fahclient/config.xml
13:42:36:******************************** Build ********************************
13:42:36:    Version: 7.4.4
13:42:36:       Date: Mar 4 2014
13:42:36:       Time: 12:01:17
13:42:36:    SVN Rev: 4130
13:42:36:     Branch: fah/trunk/client
13:42:36:   Compiler: GNU 4.1.2 20080704 (Red Hat 4.1.2-46)
13:42:36:    Options: -std=gnu++98 -O3 -funroll-loops -mfpmath=sse -ffast-math
13:42:36:             -fno-unsafe-math-optimizations -msse2
13:42:36:   Platform: linux2 2.6.18-164.11.1.el5
13:42:36:       Bits: 64
13:42:36:       Mode: Release
13:42:36:******************************* System ********************************
13:42:36:        CPU: Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz
13:42:36:     CPU ID: GenuineIntel Family 6 Model 158 Stepping 10
13:42:36:       CPUs: 12
13:42:36:     Memory: 31.25GiB
13:42:36:Free Memory: 23.98GiB
13:42:36:    Threads: POSIX_THREADS
13:42:36: OS Version: 4.19
13:42:36:Has Battery: false
13:42:36: On Battery: false
13:42:36: UTC Offset: -4
13:42:36:        PID: 3682
13:42:36:        CWD: /var/lib/fahclient
13:42:36:         OS: Linux 4.19.62-mod-std-ipv6-64-rescue x86_64
13:42:36:    OS Arch: AMD64
13:42:36:       GPUs: 0
13:42:36:       CUDA: Not detected
13:42:36:***********************************************************************
13:42:36:<config>
13:42:36:  <!-- Folding Core -->
13:42:36:  <core-priority v='low'/>
13:42:36:
13:42:36:  <!-- Folding Slot Configuration -->
13:42:36:  <gpu v='false'/>
13:42:36:
13:42:36:  <!-- HTTP Server -->
13:42:36:  <allow v='127.0.0.1 148.170.166.209'/>
13:42:36:
13:42:36:  <!-- Network -->
13:42:36:  <proxy v=':8080'/>
13:42:36:
13:42:36:  <!-- Remote Command Server -->
13:42:36:  <command-allow-no-pass v='127.0.0.1 148.170.166.209'/>
13:42:36:  <password v='********************'/>
13:42:36:
13:42:36:  <!-- Slot Control -->
13:42:36:  <power v='full'/>
13:42:36:
13:42:36:  <!-- User Information -->
13:42:36:  <user v='instance'/>
13:42:36:
13:42:36:  <!-- Folding Slots -->
13:42:36:  <slot id='0' type='CPU'>
13:42:36:    <cpus v='10'/>
13:42:36:  </slot>
13:42:36:</config>
13:42:36:Switching to user fahclient
13:42:36:Trying to access database...
13:42:36:Successfully acquired database lock
13:42:36:Enabled folding slot 00: READY cpu:10
13:42:36:WU00:FS00:Starting
13:42:36:WU00:FS00:Removing old file './work/00/logfile_01-20210506-064720.txt'
13:42:36:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx-256/a8-0.0.12/Core_a8.fah/FahCore_a8 -dir 00 -suffix 01 -version 704 -lifeline 3682 -checkpoint 15 -np 10
13:42:36:WU00:FS00:Started FahCore on PID 3691

Code: Select all

14:39:46:WU00:FS00:0xa8:Calling: mdrun -c frame34.gro -s frame34.tpr -x frame34.xtc -cpi state.cpt -cpt 15 -nt 10 -ntmpi 1
14:39:46:WU00:FS00:0xa8:Steps: first=85000000 total=87500000
14:39:46:WU00:FS00:0xa8:Completed 32002 out of 2500000 steps (1%)
14:40:44:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
14:40:45:WU00:FS00:Starting
14:40:45:WU00:FS00:Removing old file './work/00/logfile_01-20210506-140843.txt'
14:40:45:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx-256/a8-0.0.12/Core_a8.fah/FahCore_a8 -dir 00 -suffix 01 -version 704 -lifeline 3682 -checkpoint 15 -np 10
14:40:45:WU00:FS00:Started FahCore on PID 21360
14:40:45:WU00:FS00:Core PID:21364
14:40:45:WU00:FS00:FahCore 0xa8 started
14:40:46:WU00:FS00:0xa8:*********************** Log Started 2021-05-06T14:40:45Z ***********************
14:40:46:WU00:FS00:0xa8:************************** Gromacs Folding@home Core ***************************
14:40:46:WU00:FS00:0xa8:       Core: Gromacs
14:40:46:WU00:FS00:0xa8:       Type: 0xa8
14:40:46:WU00:FS00:0xa8:    Version: 0.0.12
14:40:46:WU00:FS00:0xa8:     Author: Joseph Coffland <[email protected]>
14:40:46:WU00:FS00:0xa8:  Copyright: 2020 foldingathome.org
14:40:46:WU00:FS00:0xa8:   Homepage: https://foldingathome.org/
14:40:46:WU00:FS00:0xa8:       Date: Jan 16 2021
14:40:46:WU00:FS00:0xa8:       Time: 19:23:19
14:40:46:WU00:FS00:0xa8:   Compiler: GNU 8.3.0
14:40:46:WU00:FS00:0xa8:    Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
14:40:46:WU00:FS00:0xa8:             -fdata-sections -O3 -funroll-loops -fno-pie
14:40:46:WU00:FS00:0xa8:   Platform: linux2 4.15.0-128-generic
14:40:46:WU00:FS00:0xa8:       Bits: 64
14:40:46:WU00:FS00:0xa8:       Mode: Release
14:40:46:WU00:FS00:0xa8:       SIMD: avx_256
14:40:46:WU00:FS00:0xa8:     OpenMP: ON
14:40:46:WU00:FS00:0xa8:       CUDA: OFF
14:40:46:WU00:FS00:0xa8:       Args: -dir 00 -suffix 01 -version 704 -lifeline 21360 -checkpoint 15 -np
14:40:46:WU00:FS00:0xa8:             10
14:40:46:WU00:FS00:0xa8:************************************ libFAH ************************************
14:40:46:WU00:FS00:0xa8:       Date: Jan 16 2021
14:40:46:WU00:FS00:0xa8:       Time: 19:21:38
14:40:46:WU00:FS00:0xa8:   Compiler: GNU 8.3.0
14:40:46:WU00:FS00:0xa8:    Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
14:40:46:WU00:FS00:0xa8:             -fdata-sections -O3 -funroll-loops -fno-pie
14:40:46:WU00:FS00:0xa8:   Platform: linux2 4.15.0-128-generic
14:40:46:WU00:FS00:0xa8:       Bits: 64
14:40:46:WU00:FS00:0xa8:       Mode: Release
14:40:46:WU00:FS00:0xa8:************************************ CBang *************************************
14:40:46:WU00:FS00:0xa8:       Date: Jan 16 2021
14:40:46:WU00:FS00:0xa8:       Time: 19:21:24
14:40:46:WU00:FS00:0xa8:   Compiler: GNU 8.3.0
14:40:46:WU00:FS00:0xa8:    Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
14:40:46:WU00:FS00:0xa8:             -fdata-sections -O3 -funroll-loops -fno-pie -fPIC
14:40:46:WU00:FS00:0xa8:   Platform: linux2 4.15.0-128-generic
14:40:46:WU00:FS00:0xa8:       Bits: 64
14:40:46:WU00:FS00:0xa8:       Mode: Release
14:40:46:WU00:FS00:0xa8:************************************ System ************************************
14:40:46:WU00:FS00:0xa8:        CPU: Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz
14:40:46:WU00:FS00:0xa8:     CPU ID: GenuineIntel Family 6 Model 158 Stepping 10
14:40:46:WU00:FS00:0xa8:       CPUs: 12
14:40:46:WU00:FS00:0xa8:     Memory: 31.25GiB
14:40:46:WU00:FS00:0xa8:Free Memory: 23.97GiB
14:40:46:WU00:FS00:0xa8:    Threads: POSIX_THREADS
14:40:46:WU00:FS00:0xa8: OS Version: 4.19
14:40:46:WU00:FS00:0xa8:Has Battery: false
14:40:46:WU00:FS00:0xa8: On Battery: false
14:40:46:WU00:FS00:0xa8: UTC Offset: -4
14:40:46:WU00:FS00:0xa8:        PID: 21364
14:40:46:WU00:FS00:0xa8:        CWD: /var/lib/fahclient/work
14:40:46:WU00:FS00:0xa8:********************************************************************************
14:40:46:WU00:FS00:0xa8:Project: 16959 (Run 31, Clone 401, Gen 34)
14:40:46:WU00:FS00:0xa8:Unit: 0x00000000000000000000000000000000
14:40:46:WU00:FS00:0xa8:Digital signatures verified
14:40:46:WU00:FS00:0xa8:Calling: mdrun -c frame34.gro -s frame34.tpr -x frame34.xtc -cpi state.cpt -cpt 15 -nt 10 -ntmpi 1
14:40:46:WU00:FS00:0xa8:Steps: first=85000000 total=87500000
14:40:46:WU00:FS00:0xa8:Completed 32002 out of 2500000 steps (1%)
14:41:45:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
14:41:45:WU00:FS00:Starting
14:41:45:WU00:FS00:Removing old file './work/00/logfile_01-20210506-140944.txt'
14:41:45:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx-256/a8-0.0.12/Core_a8.fah/FahCore_a8 -dir 00 -suffix 01 -version 704 -lifeline 3682 -checkpoint 15 -np 10
14:41:45:WU00:FS00:Started FahCore on PID 21439
14:41:45:WU00:FS00:Core PID:21443
14:41:45:WU00:FS00:FahCore 0xa8 started
14:41:46:WU00:FS00:0xa8:*********************** Log Started 2021-05-06T14:41:45Z ***********************
14:41:46:WU00:FS00:0xa8:************************** Gromacs Folding@home Core ***************************
14:41:46:WU00:FS00:0xa8:       Core: Gromacs
14:41:46:WU00:FS00:0xa8:       Type: 0xa8
14:41:46:WU00:FS00:0xa8:    Version: 0.0.12
14:41:46:WU00:FS00:0xa8:     Author: Joseph Coffland <[email protected]>
14:41:46:WU00:FS00:0xa8:  Copyright: 2020 foldingathome.org
14:41:46:WU00:FS00:0xa8:   Homepage: https://foldingathome.org/
14:41:46:WU00:FS00:0xa8:       Date: Jan 16 2021
14:41:46:WU00:FS00:0xa8:       Time: 19:23:19
14:41:46:WU00:FS00:0xa8:   Compiler: GNU 8.3.0
14:41:46:WU00:FS00:0xa8:    Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
14:41:46:WU00:FS00:0xa8:             -fdata-sections -O3 -funroll-loops -fno-pie
14:41:46:WU00:FS00:0xa8:   Platform: linux2 4.15.0-128-generic
14:41:46:WU00:FS00:0xa8:       Bits: 64
14:41:46:WU00:FS00:0xa8:       Mode: Release
14:41:46:WU00:FS00:0xa8:       SIMD: avx_256
14:41:46:WU00:FS00:0xa8:     OpenMP: ON
14:41:46:WU00:FS00:0xa8:       CUDA: OFF
14:41:46:WU00:FS00:0xa8:       Args: -dir 00 -suffix 01 -version 704 -lifeline 21439 -checkpoint 15 -np
14:41:46:WU00:FS00:0xa8:             10
14:41:46:WU00:FS00:0xa8:************************************ libFAH ************************************
14:41:46:WU00:FS00:0xa8:       Date: Jan 16 2021
14:41:46:WU00:FS00:0xa8:       Time: 19:21:38
14:41:46:WU00:FS00:0xa8:   Compiler: GNU 8.3.0
14:41:46:WU00:FS00:0xa8:    Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
14:41:46:WU00:FS00:0xa8:             -fdata-sections -O3 -funroll-loops -fno-pie
14:41:46:WU00:FS00:0xa8:   Platform: linux2 4.15.0-128-generic
14:41:46:WU00:FS00:0xa8:       Bits: 64
14:41:46:WU00:FS00:0xa8:       Mode: Release
14:41:46:WU00:FS00:0xa8:************************************ CBang *************************************
14:41:46:WU00:FS00:0xa8:       Date: Jan 16 2021
14:41:46:WU00:FS00:0xa8:       Time: 19:21:24
14:41:46:WU00:FS00:0xa8:   Compiler: GNU 8.3.0
14:41:46:WU00:FS00:0xa8:    Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
14:41:46:WU00:FS00:0xa8:             -fdata-sections -O3 -funroll-loops -fno-pie -fPIC
14:41:46:WU00:FS00:0xa8:   Platform: linux2 4.15.0-128-generic
14:41:46:WU00:FS00:0xa8:       Bits: 64
14:41:46:WU00:FS00:0xa8:       Mode: Release
14:41:46:WU00:FS00:0xa8:************************************ System ************************************
14:41:46:WU00:FS00:0xa8:        CPU: Intel(R) Xeon(R) E-2136 CPU @ 3.30GHz
14:41:46:WU00:FS00:0xa8:     CPU ID: GenuineIntel Family 6 Model 158 Stepping 10
14:41:46:WU00:FS00:0xa8:       CPUs: 12
14:41:46:WU00:FS00:0xa8:     Memory: 31.25GiB
14:41:46:WU00:FS00:0xa8:Free Memory: 23.97GiB
14:41:46:WU00:FS00:0xa8:    Threads: POSIX_THREADS
14:41:46:WU00:FS00:0xa8: OS Version: 4.19
14:41:46:WU00:FS00:0xa8:Has Battery: false
14:41:46:WU00:FS00:0xa8: On Battery: false
14:41:46:WU00:FS00:0xa8: UTC Offset: -4
14:41:46:WU00:FS00:0xa8:        PID: 21443
14:41:46:WU00:FS00:0xa8:        CWD: /var/lib/fahclient/work
14:41:46:WU00:FS00:0xa8:********************************************************************************
14:41:46:WU00:FS00:0xa8:Project: 16959 (Run 31, Clone 401, Gen 34)
14:41:46:WU00:FS00:0xa8:Unit: 0x00000000000000000000000000000000
14:41:46:WU00:FS00:0xa8:Digital signatures verified
14:41:46:WU00:FS00:0xa8:Calling: mdrun -c frame34.gro -s frame34.tpr -x frame34.xtc -cpi state.cpt -cpt 15 -nt 10 -ntmpi 1
14:41:46:WU00:FS00:0xa8:Steps: first=85000000 total=87500000
14:41:46:WU00:FS00:0xa8:Completed 32002 out of 2500000 steps (1%)
All other processes are stable.

I'm paying for this beast until the end of the month and would like to put it to good use...

Re: FAH runs fine on one server, dies on another.

Posted: Fri May 07, 2021 4:41 am
by Joe_H
There is occasionally issues with CPU thread counts that are multiples of the prime number 5. This is usually identified during internal and beta testing, but may not show up then. Try setting the CPU thread count to 8 or 9 after pausing the folding slot.

Re: FAH runs fine on one server, dies on another.

Posted: Fri May 07, 2021 4:40 pm
by bruce
Instance wrote: ... and then an hour or so later just stops.

All other processes are stable.
That suggests that you are overclocking (yes, you might not be). FAH tends to produce more heat than the traditional overclocking benchmarking programs, so one of the FAHCores is probably unstable. Check the temperatures :!:

Re: FAH runs fine on one server, dies on another.

Posted: Tue May 11, 2021 2:09 pm
by Instance
bruce wrote: That suggests that you are overclocking (yes, you might not be). FAH tends to produce more heat than the traditional overclocking benchmarking programs, so one of the FAHCores is probably unstable. Check the temperatures :!:
We have a winner! Not overclocking AFIK, but since this is a production server (now retired) I've got lots of monitors on it and sure enough the CPU temp is off the charts. I'll decrement the number of cores until it can handle the load.

Thanks!

Re: FAH runs fine on one server, dies on another.

Posted: Tue May 11, 2021 3:11 pm
by gunnarre
Have you checked if some of the server fans have died? A server should be able to keep its CPUs under the panic limit even at 100% utilization. It might be a good idea to replace the heat pad/paste as well.

[solved] Re: FAH runs fine on one server, dies on another.

Posted: Tue May 11, 2021 3:34 pm
by Instance
gunnarre wrote:Have you checked if some of the server fans have died? A server should be able to keep its CPUs under the panic limit even at 100% utilization. It might be a good idea to replace the heat pad/paste as well.
At 6 cores it wasn't hitting max temp, but the process was still dying an hour in. That's the clue... exactly an hour. A hung process monitor was killing it!