Corrupted / bad job 18237/1069/0/71 (failing for all users)

Moderators: Site Moderators, FAHC Science Team

Post Reply
PaulTV
Posts: 207
Joined: Mon Jan 25, 2021 4:53 pm
Location: Netherlands

Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by PaulTV »

Hi,

Job https://apps.foldingathome.org/wu#proje ... e=0&gen=71 is failing all the time on different systems, please pull it
Image

Ryzen 5800X / RTX 4090 / Windows 11
Ryzen 5600X / RTX 3070 Ti / Ubuntu 22.04
Ryzen 5600 / RTX 3060 Ti / Windows 11
Nicolas_orleans
Posts: 111
Joined: Wed Aug 08, 2012 3:08 am

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by Nicolas_orleans »

Hi Paul,
On my system, out of the 17 different GPU projects being assigned to my system since October, P18237 is the only one failing regularly (but not for 100% of WUs) with Force RMSE errors.
The issue may be wider than your particular WU ?
Best regards
Nicolas
PaulTV
Posts: 207
Joined: Mon Jan 25, 2021 4:53 pm
Location: Netherlands

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by PaulTV »

Hola,

So far I've done 120 jobs from P18237 succesfully, this particular one is the first one failing.

I don't know the science behind the cores and the jobs. If this project fails more often on your machine, while other projects all run fine, it makes me wonder what it's doing that's so special...
Image

Ryzen 5800X / RTX 4090 / Windows 11
Ryzen 5600X / RTX 3070 Ti / Ubuntu 22.04
Ryzen 5600 / RTX 3060 Ti / Windows 11
PaulTV
Posts: 207
Joined: Mon Jan 25, 2021 4:53 pm
Location: Netherlands

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by PaulTV »

See below the full log for this particular job. It appears the potential energy is off at the starting point for this job.

Code: Select all

12:30:27:WU00:FS01:Connecting to assign1.foldingathome.org:80
12:30:27:WU00:FS01:Assigned to work server 158.130.118.23
12:30:27:WU00:FS01:Requesting new work unit for slot 01: gpu:7:0 AD102 [GeForce RTX 4090] from 158.130.118.23
12:30:27:WU00:FS01:Connecting to 158.130.118.23:8080
12:30:28:WU00:FS01:Downloading 10.83MiB
12:30:29:WU00:FS01:Download complete
12:30:29:WU00:FS01:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:18237 run:1069 clone:0 gen:71 core:0x24 unit:0x00000000000000470000473d0000042d
12:30:59:WU00:FS01:Starting
12:30:59:WU00:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\ProgramData\FAHClient\cores/cores.foldingathome.org/openmm-core-24/windows-10-64bit/release/0x24-8.1.4/Core_24.fah/FahCore_24.exe -dir 00 -suffix 01 -version 706 -lifeline 21196 -checkpoint 15 -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
12:30:59:WU00:FS01:Started FahCore on PID 8148
12:30:59:WU00:FS01:Core PID:9244
12:30:59:WU00:FS01:FahCore 0x24 started
12:30:59:WU00:FS01:0x24:*********************** Log Started 2024-11-05T12:30:59Z ***********************
12:30:59:WU00:FS01:0x24:*************************** Core24 Folding@home Core ***************************
12:30:59:WU00:FS01:0x24:       Core: Core24
12:30:59:WU00:FS01:0x24:       Type: 0x24
12:30:59:WU00:FS01:0x24:    Version: 8.1.4
12:30:59:WU00:FS01:0x24:     Author: Joseph Coffland <[email protected]>
12:30:59:WU00:FS01:0x24:  Copyright: 2022 foldingathome.org
12:30:59:WU00:FS01:0x24:   Homepage: https://foldingathome.org/
12:30:59:WU00:FS01:0x24:       Date: Jul 25 2024
12:30:59:WU00:FS01:0x24:       Time: 05:42:49
12:30:59:WU00:FS01:0x24:   Revision: cf9f0139862b8945a2091772770e4631aac37792
12:30:59:WU00:FS01:0x24:     Branch: HEAD
12:30:59:WU00:FS01:0x24:   Compiler: Visual C++
12:30:59:WU00:FS01:0x24:    Options: $( /TP $) /std:c++14 /nologo /EHa /wd4297 /wd4103 /O2
12:30:59:WU00:FS01:0x24:             /Zc:throwingNew /MT -DOPENMM_VERSION="\"8.1.1\"" /Ox /std:c++14
12:30:59:WU00:FS01:0x24:   Platform: win32 10
12:30:59:WU00:FS01:0x24:       Bits: 64
12:30:59:WU00:FS01:0x24:       Mode: Release
12:30:59:WU00:FS01:0x24:Maintainers: John Chodera <[email protected]> and Peter Eastman
12:30:59:WU00:FS01:0x24:             <[email protected]>
12:30:59:WU00:FS01:0x24:       Args: -dir 00 -suffix 01 -version 706 -lifeline 8148 -checkpoint 15
12:30:59:WU00:FS01:0x24:             -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor
12:30:59:WU00:FS01:0x24:             nvidia -gpu 0 -gpu-usage 100
12:30:59:WU00:FS01:0x24:************************************ libFAH ************************************
12:30:59:WU00:FS01:0x24:       Date: Jul 25 2024
12:30:59:WU00:FS01:0x24:       Time: 05:23:50
12:30:59:WU00:FS01:0x24:   Revision: c7d2824a47eb025fa8cda8968c7a5e971585d90c
12:30:59:WU00:FS01:0x24:     Branch: HEAD
12:30:59:WU00:FS01:0x24:   Compiler: Visual C++
12:30:59:WU00:FS01:0x24:    Options: $( /TP $) /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
12:30:59:WU00:FS01:0x24:   Platform: win32 10
12:30:59:WU00:FS01:0x24:       Bits: 64
12:30:59:WU00:FS01:0x24:       Mode: Release
12:30:59:WU00:FS01:0x24:************************************ CBang *************************************
12:30:59:WU00:FS01:0x24:    Version: 1.7.2
12:30:59:WU00:FS01:0x24:     Author: Joseph Coffland <[email protected]>
12:30:59:WU00:FS01:0x24:        Org: Cauldron Development LLC
12:30:59:WU00:FS01:0x24:  Copyright: Cauldron Development LLC, 2003-2024
12:30:59:WU00:FS01:0x24:   Homepage: https://cauldrondevelopment.com/
12:30:59:WU00:FS01:0x24:    License: LGPL-2.1-or-later
12:30:59:WU00:FS01:0x24:       Date: Jul 25 2024
12:30:59:WU00:FS01:0x24:       Time: 05:22:43
12:30:59:WU00:FS01:0x24:   Revision: f1cd4c791e8c40a35dcfeab3ab85d910949cc0cb
12:30:59:WU00:FS01:0x24:     Branch: HEAD
12:30:59:WU00:FS01:0x24:   Compiler: Visual C++
12:30:59:WU00:FS01:0x24:    Options: $( /TP $) /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
12:30:59:WU00:FS01:0x24:   Platform: win32 10
12:30:59:WU00:FS01:0x24:       Bits: 64
12:30:59:WU00:FS01:0x24:       Mode: Release
12:30:59:WU00:FS01:0x24:************************************ System ************************************
12:30:59:WU00:FS01:0x24:        CPU: AMD Ryzen 7 5800X 8-Core Processor
12:30:59:WU00:FS01:0x24:     CPU ID: AuthenticAMD Family 25 Model 33 Stepping 0
12:30:59:WU00:FS01:0x24:       CPUs: 16
12:30:59:WU00:FS01:0x24:     Memory: 31.89GiB
12:30:59:WU00:FS01:0x24:Free Memory: 25.69GiB
12:30:59:WU00:FS01:0x24: OS Version: 10.0
12:30:59:WU00:FS01:0x24:Has Battery: false
12:30:59:WU00:FS01:0x24: On Battery: false
12:30:59:WU00:FS01:0x24:   Hostname: Desktop
12:30:59:WU00:FS01:0x24: UTC Offset: 1
12:30:59:WU00:FS01:0x24:        PID: 9244
12:30:59:WU00:FS01:0x24:        CWD: C:\ProgramData\FAHClient\work
12:30:59:WU00:FS01:0x24:       Exec: C:\ProgramData\FAHClient\cores\cores.foldingathome.org\openmm-core-24\windows-10-64bit\release\0x24-8.1.4\Core_24.fah\FahCore_24.exe
12:30:59:WU00:FS01:0x24:************************************ OpenMM ************************************
12:30:59:WU00:FS01:0x24:    Version: 8.1.1
12:30:59:WU00:FS01:0x24:********************************************************************************
12:30:59:WU00:FS01:0x24:Project: 18237 (Run 1069, Clone 0, Gen 71)
12:30:59:WU00:FS01:0x24:Reading tar file core.xml
12:30:59:WU00:FS01:0x24:Reading tar file integrator.xml
12:30:59:WU00:FS01:0x24:Reading tar file state.xml.bz2
12:30:59:WU00:FS01:0x24:Reading tar file system.xml.bz2
12:30:59:WU00:FS01:0x24:Digital signatures verified
12:30:59:WU00:FS01:0x24:Folding@home GPU Core24 Folding@home Core
12:30:59:WU00:FS01:0x24:Version 8.1.4
12:30:59:WU00:FS01:0x24:  Checkpoint write interval: 50000 steps (2%) [50 total]
12:30:59:WU00:FS01:0x24:  JSON viewer frame write interval: 25000 steps (1%) [100 total]
12:30:59:WU00:FS01:0x24:  XTC frame write interval: 10000 steps (0.4%) [250 total]
12:30:59:WU00:FS01:0x24:  TRR frame write interval: disabled
12:30:59:WU00:FS01:0x24:  Global context and integrator variables write interval: disabled
12:30:59:WU00:FS01:0x24:There are 4 platforms available.
12:30:59:WU00:FS01:0x24:Platform 0: Reference
12:30:59:WU00:FS01:0x24:Platform 1: CPU
12:30:59:WU00:FS01:0x24:Platform 2: OpenCL
12:30:59:WU00:FS01:0x24:  opencl-device 0 specified
12:30:59:WU00:FS01:0x24:Platform 3: CUDA
12:30:59:WU00:FS01:0x24:  cuda-device 0 specified
12:31:07:WU00:FS01:0x24:Attempting to create CUDA context:
12:31:07:WU00:FS01:0x24:  Configuring platform CUDA
12:31:09:WU00:FS01:0x24:ERROR:Potential energy error of 296.63, threshold of 20
12:31:09:WU00:FS01:0x24:ERROR:Reference Potential Energy: -1.94858e+06 | Given Potential Energy: -1.94887e+06
12:31:09:WU00:FS01:0x24:Saving result file ..\logfile_01.txt
12:31:10:WU00:FS01:0x24:Saving result file science.log
12:31:10:WU00:FS01:0x24:Saving result file state.xml.bz2
12:31:10:WU00:FS01:0x24:Folding@home Core Shutdown: BAD_WORK_UNIT
12:31:10:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
12:31:10:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:18237 run:1069 clone:0 gen:71 core:0x24 unit:0x00000000000000470000473d0000042d
12:31:10:WU00:FS01:Uploading 9.68MiB to 158.130.118.23
12:31:10:WU00:FS01:Connecting to 158.130.118.23:8080
12:31:12:WU00:FS01:Upload complete
12:31:12:WU00:FS01:Server responded WORK_ACK (400)
12:31:12:WU00:FS01:Cleaning up
Image

Ryzen 5800X / RTX 4090 / Windows 11
Ryzen 5600X / RTX 3070 Ti / Ubuntu 22.04
Ryzen 5600 / RTX 3060 Ti / Windows 11
PaulTV
Posts: 207
Joined: Mon Jan 25, 2021 4:53 pm
Location: Netherlands

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by PaulTV »

Oh wow... this is coincidence. A similar issue with 16780/17/0/107, and I'm not the first who encoutered this either: https://apps.foldingathome.org/wu#proje ... =0&gen=107. I literally ran thousands of jobs on this machine since the last time jobs blew up (and that was not my setup's fault either).

Code: Select all

20:07:48:WU01:FS01:Connecting to assign1.foldingathome.org:80
20:07:48:WU01:FS01:Assigned to work server 128.104.69.82
20:07:48:WU01:FS01:Requesting new work unit for slot 01: gpu:7:0 AD102 [GeForce RTX 4090] from 128.104.69.82
20:07:48:WU01:FS01:Connecting to 128.104.69.82:8080
20:08:00:WU01:FS01:Downloading 50.07MiB
20:08:05:WU01:FS01:Download complete
20:08:05:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:16780 run:17 clone:0 gen:107 core:0x23 unit:0x6b00000000000000110000008c410000
20:08:21:WU01:FS01:Starting
20:08:21:WU01:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\ProgramData\FAHClient\cores/cores.foldingathome.org/openmm-core-23/windows-10-64bit/release/0x23-8.0.3/Core_23.fah/FahCore_23.exe -dir 01 -suffix 01 -version 706 -lifeline 21196 -checkpoint 15 -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
20:08:21:WU01:FS01:Started FahCore on PID 26128
20:08:21:WU01:FS01:Core PID:26220
20:08:21:WU01:FS01:FahCore 0x23 started
20:08:21:WU01:FS01:0x23:*********************** Log Started 2024-11-06T20:08:21Z ***********************
20:08:21:WU01:FS01:0x23:*************************** Core23 Folding@home Core ***************************
20:08:21:WU01:FS01:0x23:       Core: Core23
20:08:21:WU01:FS01:0x23:       Type: 0x23
20:08:21:WU01:FS01:0x23:    Version: 8.0.3
20:08:21:WU01:FS01:0x23:     Author: Joseph Coffland <[email protected]>
20:08:21:WU01:FS01:0x23:  Copyright: 2022 foldingathome.org
20:08:21:WU01:FS01:0x23:   Homepage: https://foldingathome.org/
20:08:21:WU01:FS01:0x23:       Date: Aug 3 2023
20:08:21:WU01:FS01:0x23:       Time: 08:39:06
20:08:21:WU01:FS01:0x23:   Compiler: Visual C++
20:08:21:WU01:FS01:0x23:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
20:08:21:WU01:FS01:0x23:             -DOPENMM_VERSION="\"8.0.0\""
20:08:21:WU01:FS01:0x23:   Platform: win32 10
20:08:21:WU01:FS01:0x23:       Bits: 64
20:08:21:WU01:FS01:0x23:       Mode: Release
20:08:21:WU01:FS01:0x23:Maintainers: John Chodera <[email protected]> and Peter Eastman
20:08:21:WU01:FS01:0x23:             <[email protected]>
20:08:21:WU01:FS01:0x23:       Args: -dir 01 -suffix 01 -version 706 -lifeline 26128 -checkpoint 15
20:08:21:WU01:FS01:0x23:             -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor
20:08:21:WU01:FS01:0x23:             nvidia -gpu 0 -gpu-usage 100
20:08:21:WU01:FS01:0x23:************************************ libFAH ************************************
20:08:21:WU01:FS01:0x23:       Date: Aug 3 2023
20:08:21:WU01:FS01:0x23:       Time: 08:37:55
20:08:21:WU01:FS01:0x23:   Compiler: Visual C++
20:08:21:WU01:FS01:0x23:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
20:08:21:WU01:FS01:0x23:   Platform: win32 10
20:08:21:WU01:FS01:0x23:       Bits: 64
20:08:21:WU01:FS01:0x23:       Mode: Release
20:08:21:WU01:FS01:0x23:************************************ CBang *************************************
20:08:21:WU01:FS01:0x23:    Version: 1.7.2
20:08:21:WU01:FS01:0x23:     Author: Joseph Coffland <[email protected]>
20:08:21:WU01:FS01:0x23:        Org: Cauldron Development LLC
20:08:21:WU01:FS01:0x23:  Copyright: Cauldron Development LLC, 2003-2023
20:08:21:WU01:FS01:0x23:   Homepage: https://cauldrondevelopment.com/
20:08:21:WU01:FS01:0x23:    License: GPL 2+
20:08:21:WU01:FS01:0x23:       Date: Aug 3 2023
20:08:21:WU01:FS01:0x23:       Time: 08:37:14
20:08:21:WU01:FS01:0x23:   Compiler: Visual C++
20:08:21:WU01:FS01:0x23:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
20:08:21:WU01:FS01:0x23:   Platform: win32 10
20:08:21:WU01:FS01:0x23:       Bits: 64
20:08:21:WU01:FS01:0x23:       Mode: Release
20:08:21:WU01:FS01:0x23:************************************ System ************************************
20:08:21:WU01:FS01:0x23:        CPU: AMD Ryzen 7 5800X 8-Core Processor
20:08:21:WU01:FS01:0x23:     CPU ID: AuthenticAMD Family 25 Model 33 Stepping 0
20:08:21:WU01:FS01:0x23:       CPUs: 16
20:08:21:WU01:FS01:0x23:     Memory: 31.89GiB
20:08:21:WU01:FS01:0x23:Free Memory: 23.85GiB
20:08:21:WU01:FS01:0x23:    Threads: WINDOWS_THREADS
20:08:21:WU01:FS01:0x23: OS Version: 6.2
20:08:21:WU01:FS01:0x23:Has Battery: false
20:08:21:WU01:FS01:0x23: On Battery: false
20:08:21:WU01:FS01:0x23: UTC Offset: 1
20:08:21:WU01:FS01:0x23:        PID: 26220
20:08:21:WU01:FS01:0x23:        CWD: C:\ProgramData\FAHClient\work
20:08:21:WU01:FS01:0x23:       Exec: C:\ProgramData\FAHClient\cores\cores.foldingathome.org\openmm-core-23\windows-10-64bit\release\0x23-8.0.3\Core_23.fah\FahCore_23.exe
20:08:21:WU01:FS01:0x23:************************************ OpenMM ************************************
20:08:21:WU01:FS01:0x23:    Version: 8.0.0
20:08:21:WU01:FS01:0x23:********************************************************************************
20:08:21:WU01:FS01:0x23:Project: 16780 (Run 17, Clone 0, Gen 107)
20:08:21:WU01:FS01:0x23:Reading tar file core.xml
20:08:21:WU01:FS01:0x23:Reading tar file integrator.xml
20:08:21:WU01:FS01:0x23:Reading tar file state.xml
20:08:22:WU01:FS01:0x23:Reading tar file system.xml
20:08:22:WU01:FS01:0x23:Digital signatures verified
20:08:22:WU01:FS01:0x23:Folding@home GPU Core23 Folding@home Core
20:08:22:WU01:FS01:0x23:Version 8.0.3
20:08:22:WU01:FS01:0x23:  Checkpoint write interval: 50000 steps (2%) [50 total]
20:08:22:WU01:FS01:0x23:  JSON viewer frame write interval: 25000 steps (1%) [100 total]
20:08:22:WU01:FS01:0x23:  XTC frame write interval: 25000 steps (1%) [100 total]
20:08:22:WU01:FS01:0x23:  Global context and integrator variables write interval: disabled
20:08:23:WU01:FS01:0x23:There are 4 platforms available.
20:08:23:WU01:FS01:0x23:Platform 0: Reference
20:08:23:WU01:FS01:0x23:Platform 1: CPU
20:08:23:WU01:FS01:0x23:Platform 2: OpenCL
20:08:23:WU01:FS01:0x23:  opencl-device 0 specified
20:08:23:WU01:FS01:0x23:Platform 3: CUDA
20:08:23:WU01:FS01:0x23:  cuda-device 0 specified
20:08:51:WU01:FS01:0x23:Attempting to create CUDA context:
20:08:51:WU01:FS01:0x23:  Configuring platform CUDA
20:08:56:WU01:FS01:0x23:ERROR:Discrepancy: Forces are blowing up! 132637 0
20:08:56:WU01:FS01:0x23:Saving result file ..\logfile_01.txt
20:08:56:WU01:FS01:0x23:Saving result file science.log
20:08:56:WU01:FS01:0x23:Saving result file state.xml
20:09:01:WU01:FS01:0x23:Folding@home Core Shutdown: BAD_WORK_UNIT
20:09:02:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
20:09:02:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:16780 run:17 clone:0 gen:107 core:0x23 unit:0x6b00000000000000110000008c410000
20:09:02:WU01:FS01:Uploading 41.84MiB to 128.104.69.82
20:09:02:WU01:FS01:Connecting to 128.104.69.82:8080
20:09:07:WU01:FS01:Upload complete
20:09:07:WU01:FS01:Server responded WORK_ACK (400)
20:09:07:WU01:FS01:Cleaning up
Image

Ryzen 5800X / RTX 4090 / Windows 11
Ryzen 5600X / RTX 3070 Ti / Ubuntu 22.04
Ryzen 5600 / RTX 3060 Ti / Windows 11
Joe_H
Site Admin
Posts: 7926
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by Joe_H »

I have reported the Project 18237 WU to the researcher, it should have stopped being assigned after multiple failures. It looks like that happened back in October, then started being assigned again 2 days ago. The 5 failures on the Project 16780 WU should be enough to automatically keep it from reassigning, will check on that in a day or so to see if that happens.

As for what can be different between WUs for the same project, each Run starts with a different set of initial conditions. The trajectory calculated from there will be different for each, the final results use statistical analysis to determine the most likely states and pathways between them. Some trajectories do "blow up" and can not proceed further.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Nicolas_orleans
Posts: 111
Joined: Wed Aug 08, 2012 3:08 am

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by Nicolas_orleans »

Hi Paul

I have browsed my logs and here is one sample of the first Force RMSE error I saw mid-October. I have dozens like this, only for this particular project.

Code: Select all

17:41:08:I3:WU40:Started FahCore on PID 8605
17:41:09:I1:WU40:*********************** Log Started 2024-10-18T17:41:09Z ***********************
17:41:09:I1:WU40:*************************** Core24 Folding@home Core ***************************
17:41:09:I1:WU40:       Core: Core24
17:41:09:I1:WU40:       Type: 0x24
17:41:09:I1:WU40:    Version: 8.1.4
17:41:09:I1:WU40:     Author: Joseph Coffland <[email protected]>
17:41:09:I1:WU40:  Copyright: 2022 foldingathome.org
17:41:09:I1:WU40:   Homepage: https://foldingathome.org/
17:41:09:I1:WU40:       Date: Jul 25 2024
17:41:09:I1:WU40:       Time: 05:19:51
17:41:09:I1:WU40:   Revision: cf9f0139862b8945a2091772770e4631aac37792
17:41:09:I1:WU40:     Branch: HEAD
17:41:09:I1:WU40:   Compiler: GNU 7.5.0
17:41:09:I1:WU40:    Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
17:41:09:I1:WU40:             -fdata-sections -O3 -funroll-loops -fno-pie
17:41:09:I1:WU40:             -DOPENMM_VERSION="\"8.1.1\""
17:41:09:I1:WU40:   Platform: linux 6.5.0-1024-azure
17:41:09:I1:WU40:       Bits: 64
17:41:09:I1:WU40:       Mode: Release
17:41:09:I1:WU40:Maintainers: John Chodera <[email protected]> and Peter Eastman
17:41:09:I1:WU40:             <[email protected]>
17:41:09:I1:WU40:       Args: -dir B0nhuCVFSLERWJi2TZDzOpMXnDN1YnKynaIiF7aX4OU -suffix 01
17:41:09:I1:WU40:             -version 8.3.18 -lifeline 1299 -gpu-vendor nvidia -opencl-platform
17:41:09:I1:WU40:             0 -opencl-device 0 -cuda-platform 0 -cuda-device 0 -gpu 0
17:41:09:I1:WU40:************************************ libFAH ************************************
17:41:09:I1:WU40:       Date: Jul 25 2024
17:41:09:I1:WU40:       Time: 05:13:14
17:41:09:I1:WU40:   Revision: c7d2824a47eb025fa8cda8968c7a5e971585d90c
17:41:09:I1:WU40:     Branch: HEAD
17:41:09:I1:WU40:   Compiler: GNU 7.5.0
17:41:09:I1:WU40:    Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
17:41:09:I1:WU40:             -fdata-sections -O3 -funroll-loops -fno-pie
17:41:09:I1:WU40:   Platform: linux 6.5.0-1024-azure
17:41:09:I1:WU40:       Bits: 64
17:41:09:I1:WU40:       Mode: Release
17:41:09:I1:WU40:************************************ CBang *************************************
17:41:09:I1:WU40:    Version: 1.7.2
17:41:09:I1:WU40:     Author: Joseph Coffland <[email protected]>
17:41:09:I1:WU40:        Org: Cauldron Development LLC
17:41:09:I1:WU40:  Copyright: Cauldron Development LLC, 2003-2024
17:41:09:I1:WU40:   Homepage: https://cauldrondevelopment.com/
17:41:09:I1:WU40:    License: LGPL-2.1-or-later
17:41:09:I1:WU40:       Date: Jul 25 2024
17:41:09:I1:WU40:       Time: 05:12:47
17:41:09:I1:WU40:   Revision: f1cd4c791e8c40a35dcfeab3ab85d910949cc0cb
17:41:09:I1:WU40:     Branch: HEAD
17:41:09:I1:WU40:   Compiler: GNU 7.5.0
17:41:09:I1:WU40:    Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
17:41:09:I1:WU40:             -fdata-sections -O3 -funroll-loops -fno-pie -fPIC
17:41:09:I1:WU40:   Platform: linux 6.5.0-1024-azure
17:41:09:I1:WU40:       Bits: 64
17:41:09:I1:WU40:       Mode: Release
17:41:09:I1:WU40:************************************ System ************************************
17:41:09:I1:WU40:        CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz
17:41:09:I1:WU40:     CPU ID: GenuineIntel Family 6 Model 58 Stepping 9
17:41:09:I1:WU40:       CPUs: 4
17:41:09:I1:WU40:     Memory: 15.57GiB
17:41:09:I1:WU40:Free Memory: 10.00GiB
17:41:09:I1:WU40: OS Version: 6.8
17:41:09:I1:WU40:Has Battery: false
17:41:09:I1:WU40: On Battery: false
17:41:09:I1:WU40:   Hostname: amandine-MS-7751
17:41:09:I1:WU40: UTC Offset: 2
17:41:09:I1:WU40:        PID: 8605
17:41:09:I1:WU40:        CWD: /var/lib/fah-client/work
17:41:09:I1:WU40:       Exec: /var/lib/fah-client/cores/openmm-core-24/centos-7.9.2009-64bit/release/fahcore-24-centos-7.9.2009-64bit-release-8.1.4/FahCore_24
17:41:09:I1:WU40:************************************ OpenMM ************************************
17:41:09:I1:WU40:    Version: 8.1.1
17:41:09:I1:WU40:********************************************************************************
17:41:09:I1:WU40:Project: 18237 (Run 712, Clone 0, Gen 40)
17:41:09:I1:WU40:Reading tar file core.xml
17:41:09:I1:WU40:Reading tar file integrator.xml
17:41:09:I1:WU40:Reading tar file state.xml.bz2
17:41:09:I1:WU40:Reading tar file system.xml.bz2
17:41:09:I1:WU40:Digital signatures verified
17:41:09:I1:WU40:Folding@home GPU Core24 Folding@home Core
17:41:09:I1:WU40:Version 8.1.4
17:41:09:I1:WU40:  Checkpoint write interval: 50000 steps (2%) [50 total]
17:41:09:I1:WU40:  JSON viewer frame write interval: 25000 steps (1%) [100 total]
17:41:09:I1:WU40:  XTC frame write interval: 10000 steps (0.4%) [250 total]
17:41:09:I1:WU40:  TRR frame write interval: disabled
17:41:09:I1:WU40:  Global context and integrator variables write interval: disabled
17:41:09:I1:WU40:There are 4 platforms available.
17:41:09:I1:WU40:Platform 0: Reference
17:41:09:I1:WU40:Platform 1: CPU
17:41:09:I1:WU40:Platform 2: OpenCL
17:41:09:I1:WU40:  opencl-device 0 specified
17:41:09:I1:WU40:Platform 3: CUDA
17:41:09:I1:WU40:  cuda-device 0 specified
17:41:15:I1:WU40:Attempting to create CUDA context:
17:41:15:I1:WU40:  Configuring platform CUDA
17:41:21:I1:WU40:  Using CUDA on CUDA Platform and gpu 0
17:41:21:I1:WU40:  GPU info: Platform: CUDA
17:41:21:I1:WU40:  GPU info: PlatformIndex: 0
17:41:21:I1:WU40:  GPU info: Device: NVIDIA GeForce RTX 4080 SUPER
17:41:21:I1:WU40:  GPU info: DeviceIndex: 0
17:41:21:I1:WU40:  GPU info: Vendor: 0x10de
17:41:21:I1:WU40:  GPU info: PCI: 01:00:00
17:41:21:I1:WU40:  GPU info: Compute: 8.9
17:41:21:I1:WU40:  GPU info: Driver: 12.4
17:41:21:I1:WU40:  GPU info: GPU: true
17:41:21:I1:WU40:Completed 0 out of 2500000 steps (0%)
17:41:21:I1:WU40:Checkpoint completed at step 0
17:41:54:I1:WU40:Completed 25000 out of 2500000 steps (1%)
17:42:27:I1:WU40:Completed 50000 out of 2500000 steps (2%)
[…]
17:55:32:I1:WU40:Checkpoint completed at step 650000
17:56:04:I1:WU40:Completed 675000 out of 2500000 steps (27%)
17:56:37:I1:WU40:Completed 700000 out of 2500000 steps (28%)
17:56:37:I1:WU40:An exception occurred at step 700000: Force RMSE error of 11.7448 with threshold of 10
17:56:37:I1:WU40:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
17:56:37:I1:WU40:Folding@home Core Shutdown: CORE_RESTART
#[93m17:56:38:W :WU40:Core returned CORE_RESTART (98)#[0m
17:56:38:I1:Default:Added new work unit: cpus:1 gpus:gpu:01:00:00
17:56:38:I1:WU40:Sending dump report
17:56:38:I1:WU41:Requesting WU assignment for user Nicolas_orleans team 33
17:56:38:I1:OUT14:> POST https://highland1.seas.upenn.edu/api/results HTTP/1.1
17:56:38:I1:OUT15:> POST https://assign5.foldingathome.org/api/assign HTTP/1.1
17:56:38:I1:OUT14:< HTTP/1.1 200 HTTP_OK
17:56:38:I1:WU40:Dumped
17:56:38:I1:OUT15:< HTTP/1.1 200 HTTP_OK
We see here https://apps.foldingathome.org/wu#proje ... e=0&gen=40 it failed 8 times before being completed, though 6 failed with no runtime so could be driver/CUDA 12 not available related, meaning it failed best case twice before being completed. Will look in the other logs...

I don't know why this happens "only" (for my machine) with this specific project.
PaulTV
Posts: 207
Joined: Mon Jan 25, 2021 4:53 pm
Location: Netherlands

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by PaulTV »

Hey Nicolas

If you frequently see 'attempting to restart from last good checkpoint' and then see the job continue, that may indicate the rig needs some maintenance (e.g. cleaning), or a possible hardware issue. I saw those messages now and then on another rig than my main one, but after a folding pause in the summer, and thorough cleaning, it's folding fine the last couple weeks.

If a job encounters an error too often this way, it'll be dropped (I don't know the threshold).

Hey Joe,

Thanks for that!
Image

Ryzen 5800X / RTX 4090 / Windows 11
Ryzen 5600X / RTX 3070 Ti / Ubuntu 22.04
Ryzen 5600 / RTX 3060 Ti / Windows 11
jjmiller
Scientist
Posts: 136
Joined: Fri Apr 09, 2021 4:43 pm

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by jjmiller »

Hi all,

Thanks for the reports. We're in a bit of a pickle with core 0x24 projects at the moment. As folks have suggested above, after 5 failed attempts a WU will no longer be sent out. Currently on 0x24 projects we're seeing many WUs failing because there's what seems to be a mismatch between how the FAH Client and OpenMM talk to one another. In these cases, the WU fails as FAH Client attempts to initialize the WU, not because the WU itself is bad. The error codes are predominantly one of the following:
  • ERROR:125: Failed to create a GPU-enabled OpenMM context
  • ERROR:126: Neither CUDA nor OpenCL is available
These failures accumulate rapidly and tank otherwise stable projects before any data can be collected. Accordingly, I have been periodically resetting the error counts on my 0x24 projects to try and actually collect data on the WUs that are stable but fell victim to ERROR125/126s. Unfortunately, there are a few WUs that have legitimately reached problematic/unstable states (e.g. 18237/1069/0/71). At the moment, it's very hard on our end to discriminate between legitimate failures and failures that are due to ERROR125/126. We have both the FAH developer and the OpenMM core developers working to get a fix out on this, but it's proven a bit difficult.

I'll go in and manually pull 18237/1069/0/71. If folks see other instances of unstable states I'm happy to go in and manually pull them as well. Apologies for the problematic WUs and thanks for folding.
Last edited by jjmiller on Thu Nov 07, 2024 8:34 pm, edited 1 time in total.
Nicolas_orleans
Posts: 111
Joined: Wed Aug 08, 2012 3:08 am

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by Nicolas_orleans »

Hi Paul,

It's a brand new card and, again, it's only for this particular project, I would be more hardware-focused if it happened with the 16 other projects I am currently being assigned ? It only happens with this one.

Regarding Core24, it runs without any error on all P18230 WUs received so far, but not with P18237 on my rig.

I don't want to hijack this thread, but sharing a candid question with you: Core22 scales great on my 4080 Super (like 92-94% GPU utilization), Core23 scales fantastically (96-100% GPU utilization), but Core24 does not (like < 90% GPU utilization most of the time). Any reason for that, is it a "regression" in recent OpenMM versions ? Do you see it also with your 4090 ?
Post Reply