Project 13460 - two bad work units

Moderators: Site Moderators, FAHC Science Team

Post Reply
bikeaddict
Posts: 210
Joined: Sun May 03, 2020 1:20 am

Project 13460 - two bad work units

Post by bikeaddict »

Two WUs have failed like this this morning, which stops my client from downloading another WU:

09:54:55:WU01:FS01:0x22:ERROR:Force RMSE error of 500834 with threshold of 5
09:54:55:WU01:FS01:0x22:Saving result file ../logfile_01.txt
09:54:55:WU01:FS01:0x22:Saving result file science.log
09:54:55:WU01:FS01:0x22:Saving result file state.xml.bz2
09:54:55:WU01:FS01:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
09:54:56:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
09:54:56:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13460 run:674 clone:77 gen:0 core:0x22 unit:0x0000004d0000000000003494000002a2

16:14:21:WU01:FS01:0x22:ERROR:Force RMSE error of 3.36234e+06 with threshold of 5
16:14:21:WU01:FS01:0x22:Saving result file ../logfile_01.txt
16:14:21:WU01:FS01:0x22:Saving result file science.log
16:14:21:WU01:FS01:0x22:Saving result file state.xml.bz2
16:14:21:WU01:FS01:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
16:14:22:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
16:14:22:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13460 run:637 clone:96 gen:0 core:0x22 unit:0x0000006000000000000034940000027d

Edit: Forgot to add that they fail for all clients:

https://apps.foldingathome.org/wu#proje ... e=77&gen=0
https://apps.foldingathome.org/wu#proje ... e=96&gen=0
XanderF
Posts: 42
Joined: Thu Aug 11, 2011 12:25 am

Re: Project 13460 - two bad work units

Post by XanderF »

Same here.

16:51:23:WU01:FS00:0x22:Folding@home GPU Core22 Folding@home Core
16:51:23:WU01:FS00:0x22:Version 0.0.18
16:51:23:WU01:FS00:0x22: Checkpoint write interval: 50000 steps (5%) [20 total]
16:51:23:WU01:FS00:0x22: JSON viewer frame write interval: 10000 steps (1%) [100 total]
16:51:23:WU01:FS00:0x22: XTC frame write interval: 250000 steps (25%) [4 total]
16:51:23:WU01:FS00:0x22: Global context and integrator variables write interval: 2500 steps (0.25%) [400 total]
16:51:23:WU01:FS00:0x22:There are 4 platforms available.
16:51:23:WU01:FS00:0x22:Platform 0: Reference
16:51:23:WU01:FS00:0x22:Platform 1: CPU
16:51:23:WU01:FS00:0x22:Platform 2: OpenCL
16:51:23:WU01:FS00:0x22: opencl-device 0 specified
16:51:23:WU01:FS00:0x22:Platform 3: CUDA
16:51:23:WU01:FS00:0x22: cuda-device 0 specified
16:51:23:WU00:FS00:Upload complete
16:51:23:WU00:FS00:Server responded WORK_ACK (400)
16:51:23:WU00:FS00:Cleaning up
16:51:33:WU01:FS00:0x22:Attempting to create CUDA context:
16:51:33:WU01:FS00:0x22: Configuring platform CUDA
16:51:37:WU01:FS00:0x22:ERROR:Force RMSE error of 556590 with threshold of 5
16:51:37:WU01:FS00:0x22:Saving result file ..\logfile_01.txt
16:51:37:WU01:FS00:0x22:Saving result file science.log
16:51:37:WU01:FS00:0x22:Saving result file state.xml.bz2
16:51:37:WU01:FS00:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
16:51:38:WARNING:WU01:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
16:51:38:WU01:FS00:Sending unit results: id:01 state:SEND error:FAULTY project:13460 run:661 clone:68 gen:0 core:0x22 unit:0x00000044000000000000349400000295

Geforce 3060, nVidia driver 496.49
JimboPalmer
Posts: 2522
Joined: Mon Feb 16, 2009 4:12 am
Location: Greenwood MS USA

Re: Project 13460 - two bad work units

Post by JimboPalmer »

There is a new Nvidia Driver available today.

Neither of you mention updating to it or show the configuration portion of your log, but it is a possible issue.
Tsar of all the Rushers
I tried to remain childlike, all I achieved was childish.
A friend to those who want no friends
HaloJones
Posts: 906
Joined: Thu Jul 24, 2008 10:16 am

Re: Project 13460 - two bad work units

Post by HaloJones »

I'm getting failures on this project as well including Linux machines that do not update Geforce drivers automatically. Restarting the client/machine starts folding successfully again.
single 1070

Image
pcwolf
Posts: 62
Joined: Fri Apr 03, 2020 4:49 pm
Hardware configuration: Manjaro Linux - AsRock B550 Taichi - Ryzen 5950X - NVidia RTX 4070ti
FAH v8-4.3
Location: Yorktown, Virginia, USA

Re: Project 13460 - two bad work units

Post by pcwolf »

I woke up to two WU on separate Nvidia cards running Manjaro Linux with Nvidia drivers showing in red FAILED.

Restarting the foldingathome service in Systemd forced download of new WU which began folding.
Image
XanderF
Posts: 42
Joined: Thu Aug 11, 2011 12:25 am

Re: Project 13460 - two bad work units

Post by XanderF »

HaloJones wrote:I'm getting failures on this project as well including Linux machines that do not update Geforce drivers automatically. Restarting the client/machine starts folding successfully again.
Re-starting did not resolve the issue for me on this project. I had my preferences set to COVID research, which this is, so probably explains why I keep getting them.

I then tried updating the nVidia drivers to today's release (511.23). This SEEMS to be working - I got this project again, and I'm so far 3% into the project and no errors (previously it was crashing out immediately). So that's good. However...I dunno about these drivers. Only had a couple of projects on them, so far, but my PPD has more than cut in half. Could just be the projects I'm getting? Not sure. Oddly my GPU is showing full utilization, but its temperature is WAY lower than expected - my fans are barely running, now. (Just anecdotal note, so far - will keep an eye on it over next few days)
bikeaddict
Posts: 210
Joined: Sun May 03, 2020 1:20 am

Re: Project 13460 - two bad work units

Post by bikeaddict »

My drivers have not changed, still 495.46 on Linux. After switching from COVID as preference, it downloaded an Alzheimer's project and stopped giving errors. Now I've switched back and have two project 13460 WUs processing OK.
gunnarre
Posts: 559
Joined: Sun May 24, 2020 7:23 pm
Location: Norway

Re: Project 13460 - two bad work units

Post by gunnarre »

Same problem on Linux Mint and Nvidia driver 470.86

Code: Select all

*********************** Log Started 2022-01-13T06:05:51Z ***********************
06:05:51:******************************* libFAH ********************************
06:05:51:           Date: Oct 20 2020
06:05:51:           Time: 20:36:39
06:05:51:       Revision: 5ca109d295a6245e2a2f590b3d0085ad5e567aeb
06:05:51:         Branch: master
06:05:51:       Compiler: GNU 8.3.0
06:05:51:        Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
06:05:51:                 -fdata-sections -O3 -funroll-loops -fno-pie
06:05:51:       Platform: linux2 5.8.0-1-amd64
06:05:51:           Bits: 64
06:05:51:           Mode: Release
06:05:51:****************************** FAHClient ******************************
06:05:51:        Version: 7.6.21
06:05:51:         Author: Joseph Coffland <[email protected]>
06:05:51:      Copyright: 2020 foldingathome.org
06:05:51:       Homepage: https://foldingathome.org/
06:05:51:           Date: Oct 20 2020
06:05:51:           Time: 20:39:00
06:05:51:       Revision: 6efbf0e138e22d3963e6a291f78dcb9c6422a278
06:05:51:         Branch: master
06:05:51:       Compiler: GNU 8.3.0
06:05:51:        Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
06:05:51:                 -fdata-sections -O3 -funroll-loops -fno-pie
06:05:51:       Platform: linux2 5.8.0-1-amd64
06:05:51:           Bits: 64
06:05:51:           Mode: Release
06:05:51:           Args: --child /etc/fahclient/config.xml
06:05:51:                 --pid-file=/var/run/fahclient/fahclient.pid --daemon
06:05:51:         Config: /etc/fahclient/config.xml
06:05:51:******************************** CBang ********************************
06:05:51:           Date: Oct 20 2020
06:05:51:           Time: 18:37:59
06:05:51:       Revision: 7e4ce85225d7eaeb775e87c31740181ca603de60
06:05:51:         Branch: master
06:05:51:       Compiler: GNU 8.3.0
06:05:51:        Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
06:05:51:                 -fdata-sections -O3 -funroll-loops -fno-pie -fPIC
06:05:51:       Platform: linux2 5.8.0-1-amd64
06:05:51:           Bits: 64
06:05:51:           Mode: Release
06:05:51:******************************* System ********************************
06:05:51:            CPU: AMD Ryzen 5 3600 6-Core Processor
06:05:51:         CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
06:05:51:           CPUs: 12
06:05:51:         Memory: 15.54GiB
06:05:51:    Free Memory: 2.50GiB
06:05:51:        Threads: POSIX_THREADS
06:05:51:     OS Version: 5.11
06:05:51:    Has Battery: false
06:05:51:     On Battery: false
06:05:51:     UTC Offset: 1
06:05:51:            PID: 23403
06:05:51:            CWD: /var/lib/fahclient
06:05:51:             OS: Linux 5.11.0-46-generic x86_64
06:05:51:        OS Arch: AMD64
06:05:51:           GPUs: 1
06:05:51:          GPU 0: Bus:8 Slot:0 Func:0 NVIDIA:7 TU116 [GeForce GTX 1660 SUPER]
06:05:51:  CUDA Device 0: Platform:0 Device:0 Bus:8 Slot:0 Compute:7.5 Driver:11.4
06:05:51:OpenCL Device 0: Platform:0 Device:0 Bus:8 Slot:0 Compute:3.0 Driver:470.86
06:05:51:***********************************************************************
.....
18:12:23:WU01:FS01:0x22:************************************ OpenMM ************************************
18:12:23:WU01:FS01:0x22:    Version: 7.6.0
18:12:23:WU01:FS01:0x22:********************************************************************************
18:12:23:WU01:FS01:0x22:Project: 13460 (Run 677, Clone 30, Gen 0)
18:12:23:WU01:FS01:0x22:Unit: 0x00000000000000000000000000000000
18:12:23:WU01:FS01:0x22:Reading tar file core.xml
18:12:23:WU01:FS01:0x22:Reading tar file integrator.xml.bz2
18:12:23:WU01:FS01:0x22:Reading tar file state.xml.bz2
18:12:23:WU01:FS01:0x22:Reading tar file system.xml.bz2
18:12:23:WU01:FS01:0x22:Digital signatures verified
18:12:23:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
18:12:23:WU01:FS01:0x22:Version 0.0.18
18:12:23:WU01:FS01:0x22:  Checkpoint write interval: 50000 steps (5%) [20 total]
18:12:23:WU01:FS01:0x22:  JSON viewer frame write interval: 10000 steps (1%) [100 total]
18:12:23:WU01:FS01:0x22:  XTC frame write interval: 250000 steps (25%) [4 total]
18:12:23:WU01:FS01:0x22:  Global context and integrator variables write interval: 2500 steps (0.25%) [400 total]
18:12:23:WU01:FS01:0x22:There are 4 platforms available.
18:12:23:WU01:FS01:0x22:Platform 0: Reference
18:12:23:WU01:FS01:0x22:Platform 1: CPU
18:12:23:WU01:FS01:0x22:Platform 2: OpenCL
18:12:23:WU01:FS01:0x22:  opencl-device 0 specified
18:12:23:WU01:FS01:0x22:Platform 3: CUDA
18:12:23:WU01:FS01:0x22:  cuda-device 0 specified
18:12:27:WU01:FS01:0x22:Attempting to create CUDA context:
18:12:27:WU01:FS01:0x22:  Configuring platform CUDA
18:12:30:WU01:FS01:0x22:ERROR:Force RMSE error of 557533 with threshold of 5
18:12:30:WU01:FS01:0x22:Saving result file ../logfile_01.txt
18:12:30:WU01:FS01:0x22:Saving result file science.log
18:12:30:WU01:FS01:0x22:Saving result file state.xml.bz2
18:12:30:WU01:FS01:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
18:12:31:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:12:31:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13460 run:677 clone:30 gen:0 core:0x22 unit:0x0000001e0000000000003494000002a5
....
18:12:59:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:12:59:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13460 run:677 clone:88 gen:0 core:0x22 unit:0x000000580000000000003494000002a5
....
18:13:31:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:13:31:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13460 run:677 clone:11 gen:0 core:0x22 unit:0x0000000b0000000000003494000002a5
.....
18:13:57:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:13:57:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13460 run:677 clone:87 gen:0 core:0x22 unit:0x000000570000000000003494000002a5
18:14:24:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:14:25:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13460 run:676 clone:91 gen:0 core:0x22 unit:0x0000005b0000000000003494000002a4
.....
18:14:53:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:14:53:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13460 run:676 clone:77 gen:0 core:0x22 unit:0x0000004d0000000000003494000002a4
......
18:15:22:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:15:22:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13460 run:677 clone:92 gen:0 core:0x22 unit:0x0000005c0000000000003494000002a5
....
18:15:49:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:15:49:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13460 run:678 clone:57 gen:0 core:0x22 unit:0x000000390000000000003494000002a6
......
18:16:16:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:16:16:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13460 run:677 clone:99 gen:0 core:0x22 unit:0x000000630000000000003494000002a5
......
18:16:40:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:16:40:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13460 run:678 clone:9 gen:0 core:0x22 unit:0x000000090000000000003494000002a6

Restarting the client made it pick up a WU from project 13460 run 838 without issues.

Edit: These fail on all GPUs, it seems like both AMD and Nvidia.
Image
Online: GTX 1660 Super + occasional CPU folding in the cold.
Offline: Radeon HD 7770, GTX 1050 Ti 4G OC, RX580
HaloJones
Posts: 906
Joined: Thu Jul 24, 2008 10:16 am

Re: Project 13460 - two bad work units

Post by HaloJones »

Agreed, this is not a driver or OS issue.What's consistent is that a failure stops the client instead of it just getting a new work unit. Pausing doesn't work, the whole client needs to be restarted.
single 1070

Image
toTOW
Site Moderator
Posts: 6359
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: Project 13460 - two bad work units

Post by toTOW »

Yes, there's a batch of bad p13460 WUs ... :(

Until John Chodera can have a look at it, two things to do :
- if you want to avoid this project, select something different from COVID for cause preference
- if the slot is in FAILED state, it will remains like this for 24 hours. You can Pause/Fold the slot to resume folding before the 24 hours delay.
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
JohnChodera
Pande Group Member
Posts: 467
Joined: Fri Feb 22, 2013 9:59 pm

Re: Project 13460 - two bad work units

Post by JohnChodera »

Folks: Huge apologies for the issues here.

An unexpected event appears to have corrupted some of the `system` or `state` XML files that are packaged up into WUs of some RUNs < 1000. These were not intended to be sent out, but WUs were sent out when I increased the number of CLONEs without realizing this had happened.

Fortunately, we had already collected most of the data we needed for these, so I have restricted the number of CLONEs back to 60. Only RUNs > 1000 should be going out now, and I am monitoring to see if these are corrupted---these should be OK.

Again, huge apologies for the disruptions here. We still managed to get a ton of useful data! I'll post the results of the dashboards this weekend.

~ John Chodera // MSKCC
JohnChodera
Pande Group Member
Posts: 467
Joined: Fri Feb 22, 2013 9:59 pm

Re: Project 13460 - two bad work units

Post by JohnChodera »

Update: It looks like I can't halt the corrupted RUNs from being sent out, so I am halting the project.

We still got a large amount of useful data, and I will set up and test Sprint 12 over the weekend so we're ready to launch Monday!

Enormous thanks, all!

~ John Chodera // MSKCC
gordonbb
Posts: 511
Joined: Mon May 21, 2018 4:12 pm
Hardware configuration: Ubuntu 22.04.2 LTS; NVidia 525.60.11; 2 x 4070ti; 4070; 4060ti; 3x 3080; 3070ti; 3070
Location: Great White North

Re: Project 13460 - two bad work units

Post by gordonbb »

JohnChodera wrote:Update: It looks like I can't halt the corrupted RUNs from being sent out, so I am halting the project.

We still got a large amount of useful data, and I will set up and test Sprint 12 over the weekend so we're ready to launch Monday!

Enormous thanks, all!

~ John Chodera // MSKCC
And an enormous thanks to you and your team. We eagerly await the opportunity to help and I think I can say we can accept the odd hiccup along the path.
Image
Post Reply