Project: 12419 (Run 77, Clone 0, Gen 257) crashes and restarts endlessly
Posted: Tue Aug 27, 2024 4:00 pm
Hello there,
i noticed an issue with the Project: 12419 (Run 77, Clone 0, Gen 257) WU
The WU is crashing (signal 11/SEGV) and restarting endlessly (maybe a similar error like mentioned in this thread)
According to the WU status page this WU has already failed on a different system.
Excerpt from the system log:
Excerpt from the FAHClient log:
best regards,
YInMn
i noticed an issue with the Project: 12419 (Run 77, Clone 0, Gen 257) WU
The WU is crashing (signal 11/SEGV) and restarting endlessly (maybe a similar error like mentioned in this thread)
According to the WU status page this WU has already failed on a different system.
Excerpt from the system log:
Code: Select all
Aug 27 17:33:10 ****** systemd-coredump[36046]: Process 36039 (FahCore_a8) of user 62464 terminated abnormally with signal 11/SEGV, processing...
Aug 27 17:33:10 ****** systemd[1]: Started Process Core Dump (PID 36046/UID 0).
Aug 27 17:33:10 ****** FAHClient[2416]: 15:33:10:WU04:FS04:0xa8:Completed 1 out of 5000000 steps (0%)
Aug 27 17:33:11 ****** systemd-coredump[36047]: Process 36039 (FahCore_a8) of user 62464 dumped core.
Stack trace of thread 36044:
#0 0x000000000071b329 n/a (/var/lib/private/fah/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.12/Core_a8.fah/FahCore_a8 + 0x31b329)
ELF object binary architecture: AMD x86-64
Aug 27 17:34:10 ****** systemd-coredump[36130]: Process 36122 (FahCore_a8) of user 62464 terminated abnormally with signal 11/SEGV, processing...
Aug 27 17:34:10 ****** systemd[1]: Started Process Core Dump (PID 36130/UID 0).
Aug 27 17:34:10 ****** FAHClient[2416]: 15:34:10:WU04:FS04:0xa8:Completed 1 out of 5000000 steps (0%)
Aug 27 17:34:11 ****** systemd-coredump[36131]: Process 36122 (FahCore_a8) of user 62464 dumped core.
Stack trace of thread 36128:
#0 0x000000000071b329 n/a (/var/lib/private/fah/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.12/Core_a8.fah/FahCore_a8 + 0x31b329)
ELF object binary architecture: AMD x86-64
Aug 27 17:35:10 ****** systemd-coredump[36200]: Process 36192 (FahCore_a8) of user 62464 terminated abnormally with signal 11/SEGV, processing...
Aug 27 17:35:10 ****** systemd[1]: Started Process Core Dump (PID 36200/UID 0).
Aug 27 17:35:10 ****** FAHClient[2416]: 15:35:10:WU04:FS04:0xa8:Completed 1 out of 5000000 steps (0%)
Aug 27 17:35:11 ****** systemd-coredump[36201]: Process 36192 (FahCore_a8) of user 62464 dumped core.
Stack trace of thread 36197:
#0 0x000000000071b329 n/a (/var/lib/private/fah/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.12/Core_a8.fah/FahCore_a8 + 0x31b329)
ELF object binary architecture: AMD x86-64
Aug 27 17:36:10 ****** systemd-coredump[36282]: Process 36274 (FahCore_a8) of user 62464 terminated abnormally with signal 11/SEGV, processing...
Aug 27 17:36:10 ****** systemd[1]: Started Process Core Dump (PID 36282/UID 0).
Aug 27 17:36:11 ****** FAHClient[2416]: 15:36:11:WU04:FS04:0xa8:Completed 1 out of 5000000 steps (0%)
Aug 27 17:36:11 ****** systemd-coredump[36283]: Process 36274 (FahCore_a8) of user 62464 dumped core.
Stack trace of thread 36280:
#0 0x000000000071b329 n/a (/var/lib/private/fah/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.12/Core_a8.fah/FahCore_a8 + 0x31b329)
ELF object binary architecture: AMD x86-64
Code: Select all
15:32:10:WU04:FS04:0xa8:Project: 12419 (Run 77, Clone 0, Gen 257)
15:32:10:WU04:FS04:0xa8:Unit: 0x00000000000000000000000000000000
15:32:10:WU04:FS04:0xa8:Digital signatures verified
15:32:10:WU04:FS04:0xa8:Calling: mdrun -c frame257.gro -s frame257.tpr -x frame257.xtc -cpt 3 -nt 4 -ntmpi 1
15:32:10:WU04:FS04:0xa8:Steps: first=1280000000 total=1285000000
15:32:10:WU04:FS04:0xa8:Completed 1 out of 5000000 steps (0%)
15:32:11:WU04:FS04:FahCore returned: INTERRUPTED (102 = 0x66)
15:33:09:WU04:FS04:Starting
15:33:09:WU04:FS04:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/private/fah/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.12/Core_a8.fah/FahCore_a8 -dir 04 -suffix 01 -version 706 -lifeline 2416 -checkpoint 3 -np 4
15:33:09:WU04:FS04:Started FahCore on PID 36035
15:33:09:WU04:FS04:Core PID:36039
15:33:09:WU04:FS04:FahCore 0xa8 started
15:33:10:WU04:FS04:0xa8:*********************** Log Started 2024-08-27T15:33:09Z ***********************
15:33:10:WU04:FS04:0xa8:************************** Gromacs Folding@home Core ***************************
15:33:10:WU04:FS04:0xa8: Core: Gromacs
15:33:10:WU04:FS04:0xa8: Type: 0xa8
15:33:10:WU04:FS04:0xa8: Version: 0.0.12
15:33:10:WU04:FS04:0xa8: Author: Joseph Coffland <[email protected]>
15:33:10:WU04:FS04:0xa8: Copyright: 2020 foldingathome.org
15:33:10:WU04:FS04:0xa8: Homepage: https://foldingathome.org/
15:33:10:WU04:FS04:0xa8: Date: Jan 16 2021
15:33:10:WU04:FS04:0xa8: Time: 19:24:44
15:33:10:WU04:FS04:0xa8: Compiler: GNU 8.3.0
15:33:10:WU04:FS04:0xa8: Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
15:33:10:WU04:FS04:0xa8: -fdata-sections -O3 -funroll-loops -fno-pie
15:33:10:WU04:FS04:0xa8: Platform: linux2 4.15.0-128-generic
15:33:10:WU04:FS04:0xa8: Bits: 64
15:33:10:WU04:FS04:0xa8: Mode: Release
15:33:10:WU04:FS04:0xa8: SIMD: avx2_256
15:33:10:WU04:FS04:0xa8: OpenMP: ON
15:33:10:WU04:FS04:0xa8: CUDA: OFF
15:33:10:WU04:FS04:0xa8: Args: -dir 04 -suffix 01 -version 706 -lifeline 36035 -checkpoint 3 -np 4
15:33:10:WU04:FS04:0xa8:************************************ libFAH ************************************
15:33:10:WU04:FS04:0xa8: Date: Jan 16 2021
15:33:10:WU04:FS04:0xa8: Time: 19:21:38
15:33:10:WU04:FS04:0xa8: Compiler: GNU 8.3.0
15:33:10:WU04:FS04:0xa8: Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
15:33:10:WU04:FS04:0xa8: -fdata-sections -O3 -funroll-loops -fno-pie
15:33:10:WU04:FS04:0xa8: Platform: linux2 4.15.0-128-generic
15:33:10:WU04:FS04:0xa8: Bits: 64
15:33:10:WU04:FS04:0xa8: Mode: Release
15:33:10:WU04:FS04:0xa8:************************************ CBang *************************************
15:33:10:WU04:FS04:0xa8: Date: Jan 16 2021
15:33:10:WU04:FS04:0xa8: Time: 19:21:24
15:33:10:WU04:FS04:0xa8: Compiler: GNU 8.3.0
15:33:10:WU04:FS04:0xa8: Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
15:33:10:WU04:FS04:0xa8: -fdata-sections -O3 -funroll-loops -fno-pie -fPIC
15:33:10:WU04:FS04:0xa8: Platform: linux2 4.15.0-128-generic
15:33:10:WU04:FS04:0xa8: Bits: 64
15:33:10:WU04:FS04:0xa8: Mode: Release
15:33:10:WU04:FS04:0xa8:************************************ System ************************************
15:33:10:WU04:FS04:0xa8: CPU: AMD Ryzen 9 5950X 16-Core Processor
15:33:10:WU04:FS04:0xa8: CPU ID: AuthenticAMD Family 25 Model 33 Stepping 2
15:33:10:WU04:FS04:0xa8: CPUs: 32
15:33:10:WU04:FS04:0xa8: Memory: 31.26GiB
15:33:10:WU04:FS04:0xa8:Free Memory: 20.67GiB
15:33:10:WU04:FS04:0xa8: Threads: POSIX_THREADS
15:33:10:WU04:FS04:0xa8: OS Version: 6.10
15:33:10:WU04:FS04:0xa8:Has Battery: false
15:33:10:WU04:FS04:0xa8: On Battery: false
15:33:10:WU04:FS04:0xa8: UTC Offset: 2
15:33:10:WU04:FS04:0xa8: PID: 36039
15:33:10:WU04:FS04:0xa8: CWD: /var/lib/private/fah/work
15:33:10:WU04:FS04:0xa8:********************************************************************************
15:33:10:WU04:FS04:0xa8:Project: 12419 (Run 77, Clone 0, Gen 257)
15:33:10:WU04:FS04:0xa8:Unit: 0x00000000000000000000000000000000
15:33:10:WU04:FS04:0xa8:Digital signatures verified
15:33:10:WU04:FS04:0xa8:Calling: mdrun -c frame257.gro -s frame257.tpr -x frame257.xtc -cpt 3 -nt 4 -ntmpi 1
15:33:10:WU04:FS04:0xa8:Steps: first=1280000000 total=1285000000
15:33:10:WU04:FS04:0xa8:Completed 1 out of 5000000 steps (0%)
15:33:11:WU04:FS04:FahCore returned: INTERRUPTED (102 = 0x66)
15:34:09:WU04:FS04:Starting
15:34:09:WU04:FS04:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/private/fah/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.12/Core_a8.fah/FahCore_a8 -dir 04 -suffix 01 -version 706 -lifeline 2416 -checkpoint 3 -np 4
15:34:09:WU04:FS04:Started FahCore on PID 36118
15:34:09:WU04:FS04:Core PID:36122
15:34:09:WU04:FS04:FahCore 0xa8 started
[... log started message ...]
15:34:10:WU04:FS04:0xa8:Project: 12419 (Run 77, Clone 0, Gen 257)
15:34:10:WU04:FS04:0xa8:Unit: 0x00000000000000000000000000000000
15:34:10:WU04:FS04:0xa8:Digital signatures verified
15:34:10:WU04:FS04:0xa8:Calling: mdrun -c frame257.gro -s frame257.tpr -x frame257.xtc -cpt 3 -nt 4 -ntmpi 1
15:34:10:WU04:FS04:0xa8:Steps: first=1280000000 total=1285000000
15:34:10:WU04:FS04:0xa8:Completed 1 out of 5000000 steps (0%)
15:34:11:WU04:FS04:FahCore returned: INTERRUPTED (102 = 0x66)
15:35:09:WU04:FS04:Starting
15:35:09:WU04:FS04:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/private/fah/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.12/Core_a8.fah/FahCore_a8 -dir 04 -suffix 01 -version 706 -lifeline 2416 -checkpoint 3 -np 4
15:35:09:WU04:FS04:Started FahCore on PID 36188
15:35:09:WU04:FS04:Core PID:36192
15:35:09:WU04:FS04:FahCore 0xa8 started
[... log started message ...]
15:35:10:WU04:FS04:0xa8:Project: 12419 (Run 77, Clone 0, Gen 257)
15:35:10:WU04:FS04:0xa8:Unit: 0x00000000000000000000000000000000
15:35:10:WU04:FS04:0xa8:Digital signatures verified
15:35:10:WU04:FS04:0xa8:Calling: mdrun -c frame257.gro -s frame257.tpr -x frame257.xtc -cpt 3 -nt 4 -ntmpi 1
15:35:10:WU04:FS04:0xa8:Steps: first=1280000000 total=1285000000
15:35:10:WU04:FS04:0xa8:Completed 1 out of 5000000 steps (0%)
15:35:11:WU04:FS04:FahCore returned: INTERRUPTED (102 = 0x66)
15:36:09:WU04:FS04:Starting
15:36:09:WU04:FS04:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/private/fah/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.12/Core_a8.fah/FahCore_a8 -dir 04 -suffix 01 -version 706 -lifeline 2416 -checkpoint 3 -np 4
15:36:09:WU04:FS04:Started FahCore on PID 36270
15:36:09:WU04:FS04:Core PID:36274
15:36:09:WU04:FS04:FahCore 0xa8 started
[... log started message ...]
15:36:10:WU04:FS04:0xa8:Project: 12419 (Run 77, Clone 0, Gen 257)
15:36:10:WU04:FS04:0xa8:Unit: 0x00000000000000000000000000000000
15:36:10:WU04:FS04:0xa8:Digital signatures verified
15:36:10:WU04:FS04:0xa8:Calling: mdrun -c frame257.gro -s frame257.tpr -x frame257.xtc -cpt 3 -nt 4 -ntmpi 1
15:36:10:WU04:FS04:0xa8:Steps: first=1280000000 total=1285000000
15:36:11:WU04:FS04:0xa8:Completed 1 out of 5000000 steps (0%)
15:36:11:WU04:FS04:FahCore returned: INTERRUPTED (102 = 0x66)
YInMn