Page 1 of 1

Project 13460 - two bad work units

Posted: Fri Jan 14, 2022 4:48 pm
by bikeaddict
Two WUs have failed like this this morning, which stops my client from downloading another WU:

09:54:55:WU01:FS01:0x22:ERROR:Force RMSE error of 500834 with threshold of 5
09:54:55:WU01:FS01:0x22:Saving result file ../logfile_01.txt
09:54:55:WU01:FS01:0x22:Saving result file science.log
09:54:55:WU01:FS01:0x22:Saving result file state.xml.bz2
09:54:55:WU01:FS01:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
09:54:56:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
09:54:56:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13460 run:674 clone:77 gen:0 core:0x22 unit:0x0000004d0000000000003494000002a2

16:14:21:WU01:FS01:0x22:ERROR:Force RMSE error of 3.36234e+06 with threshold of 5
16:14:21:WU01:FS01:0x22:Saving result file ../logfile_01.txt
16:14:21:WU01:FS01:0x22:Saving result file science.log
16:14:21:WU01:FS01:0x22:Saving result file state.xml.bz2
16:14:21:WU01:FS01:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
16:14:22:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
16:14:22:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13460 run:637 clone:96 gen:0 core:0x22 unit:0x0000006000000000000034940000027d

Edit: Forgot to add that they fail for all clients:

https://apps.foldingathome.org/wu#proje ... e=77&gen=0
https://apps.foldingathome.org/wu#proje ... e=96&gen=0

Re: Project 13460 - two bad work units

Posted: Fri Jan 14, 2022 4:54 pm
by XanderF
Same here.

16:51:23:WU01:FS00:0x22:Folding@home GPU Core22 Folding@home Core
16:51:23:WU01:FS00:0x22:Version 0.0.18
16:51:23:WU01:FS00:0x22: Checkpoint write interval: 50000 steps (5%) [20 total]
16:51:23:WU01:FS00:0x22: JSON viewer frame write interval: 10000 steps (1%) [100 total]
16:51:23:WU01:FS00:0x22: XTC frame write interval: 250000 steps (25%) [4 total]
16:51:23:WU01:FS00:0x22: Global context and integrator variables write interval: 2500 steps (0.25%) [400 total]
16:51:23:WU01:FS00:0x22:There are 4 platforms available.
16:51:23:WU01:FS00:0x22:Platform 0: Reference
16:51:23:WU01:FS00:0x22:Platform 1: CPU
16:51:23:WU01:FS00:0x22:Platform 2: OpenCL
16:51:23:WU01:FS00:0x22: opencl-device 0 specified
16:51:23:WU01:FS00:0x22:Platform 3: CUDA
16:51:23:WU01:FS00:0x22: cuda-device 0 specified
16:51:23:WU00:FS00:Upload complete
16:51:23:WU00:FS00:Server responded WORK_ACK (400)
16:51:23:WU00:FS00:Cleaning up
16:51:33:WU01:FS00:0x22:Attempting to create CUDA context:
16:51:33:WU01:FS00:0x22: Configuring platform CUDA
16:51:37:WU01:FS00:0x22:ERROR:Force RMSE error of 556590 with threshold of 5
16:51:37:WU01:FS00:0x22:Saving result file ..\logfile_01.txt
16:51:37:WU01:FS00:0x22:Saving result file science.log
16:51:37:WU01:FS00:0x22:Saving result file state.xml.bz2
16:51:37:WU01:FS00:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
16:51:38:WARNING:WU01:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
16:51:38:WU01:FS00:Sending unit results: id:01 state:SEND error:FAULTY project:13460 run:661 clone:68 gen:0 core:0x22 unit:0x00000044000000000000349400000295

Geforce 3060, nVidia driver 496.49

Re: Project 13460 - two bad work units

Posted: Fri Jan 14, 2022 5:14 pm
by JimboPalmer
There is a new Nvidia Driver available today.

Neither of you mention updating to it or show the configuration portion of your log, but it is a possible issue.

Re: Project 13460 - two bad work units

Posted: Fri Jan 14, 2022 7:15 pm
by HaloJones
I'm getting failures on this project as well including Linux machines that do not update Geforce drivers automatically. Restarting the client/machine starts folding successfully again.

Re: Project 13460 - two bad work units

Posted: Fri Jan 14, 2022 7:30 pm
by pcwolf
I woke up to two WU on separate Nvidia cards running Manjaro Linux with Nvidia drivers showing in red FAILED.

Restarting the foldingathome service in Systemd forced download of new WU which began folding.

Re: Project 13460 - two bad work units

Posted: Fri Jan 14, 2022 8:38 pm
by XanderF
HaloJones wrote:I'm getting failures on this project as well including Linux machines that do not update Geforce drivers automatically. Restarting the client/machine starts folding successfully again.
Re-starting did not resolve the issue for me on this project. I had my preferences set to COVID research, which this is, so probably explains why I keep getting them.

I then tried updating the nVidia drivers to today's release (511.23). This SEEMS to be working - I got this project again, and I'm so far 3% into the project and no errors (previously it was crashing out immediately). So that's good. However...I dunno about these drivers. Only had a couple of projects on them, so far, but my PPD has more than cut in half. Could just be the projects I'm getting? Not sure. Oddly my GPU is showing full utilization, but its temperature is WAY lower than expected - my fans are barely running, now. (Just anecdotal note, so far - will keep an eye on it over next few days)

Re: Project 13460 - two bad work units

Posted: Fri Jan 14, 2022 8:53 pm
by bikeaddict
My drivers have not changed, still 495.46 on Linux. After switching from COVID as preference, it downloaded an Alzheimer's project and stopped giving errors. Now I've switched back and have two project 13460 WUs processing OK.

Re: Project 13460 - two bad work units

Posted: Fri Jan 14, 2022 9:35 pm
by gunnarre
Same problem on Linux Mint and Nvidia driver 470.86

Code: Select all

*********************** Log Started 2022-01-13T06:05:51Z ***********************
06:05:51:******************************* libFAH ********************************
06:05:51:           Date: Oct 20 2020
06:05:51:           Time: 20:36:39
06:05:51:       Revision: 5ca109d295a6245e2a2f590b3d0085ad5e567aeb
06:05:51:         Branch: master
06:05:51:       Compiler: GNU 8.3.0
06:05:51:        Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
06:05:51:                 -fdata-sections -O3 -funroll-loops -fno-pie
06:05:51:       Platform: linux2 5.8.0-1-amd64
06:05:51:           Bits: 64
06:05:51:           Mode: Release
06:05:51:****************************** FAHClient ******************************
06:05:51:        Version: 7.6.21
06:05:51:         Author: Joseph Coffland <[email protected]>
06:05:51:      Copyright: 2020 foldingathome.org
06:05:51:       Homepage: https://foldingathome.org/
06:05:51:           Date: Oct 20 2020
06:05:51:           Time: 20:39:00
06:05:51:       Revision: 6efbf0e138e22d3963e6a291f78dcb9c6422a278
06:05:51:         Branch: master
06:05:51:       Compiler: GNU 8.3.0
06:05:51:        Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
06:05:51:                 -fdata-sections -O3 -funroll-loops -fno-pie
06:05:51:       Platform: linux2 5.8.0-1-amd64
06:05:51:           Bits: 64
06:05:51:           Mode: Release
06:05:51:           Args: --child /etc/fahclient/config.xml
06:05:51:                 --pid-file=/var/run/fahclient/fahclient.pid --daemon
06:05:51:         Config: /etc/fahclient/config.xml
06:05:51:******************************** CBang ********************************
06:05:51:           Date: Oct 20 2020
06:05:51:           Time: 18:37:59
06:05:51:       Revision: 7e4ce85225d7eaeb775e87c31740181ca603de60
06:05:51:         Branch: master
06:05:51:       Compiler: GNU 8.3.0
06:05:51:        Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
06:05:51:                 -fdata-sections -O3 -funroll-loops -fno-pie -fPIC
06:05:51:       Platform: linux2 5.8.0-1-amd64
06:05:51:           Bits: 64
06:05:51:           Mode: Release
06:05:51:******************************* System ********************************
06:05:51:            CPU: AMD Ryzen 5 3600 6-Core Processor
06:05:51:         CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
06:05:51:           CPUs: 12
06:05:51:         Memory: 15.54GiB
06:05:51:    Free Memory: 2.50GiB
06:05:51:        Threads: POSIX_THREADS
06:05:51:     OS Version: 5.11
06:05:51:    Has Battery: false
06:05:51:     On Battery: false
06:05:51:     UTC Offset: 1
06:05:51:            PID: 23403
06:05:51:            CWD: /var/lib/fahclient
06:05:51:             OS: Linux 5.11.0-46-generic x86_64
06:05:51:        OS Arch: AMD64
06:05:51:           GPUs: 1
06:05:51:          GPU 0: Bus:8 Slot:0 Func:0 NVIDIA:7 TU116 [GeForce GTX 1660 SUPER]
06:05:51:  CUDA Device 0: Platform:0 Device:0 Bus:8 Slot:0 Compute:7.5 Driver:11.4
06:05:51:OpenCL Device 0: Platform:0 Device:0 Bus:8 Slot:0 Compute:3.0 Driver:470.86
06:05:51:***********************************************************************
.....
18:12:23:WU01:FS01:0x22:************************************ OpenMM ************************************
18:12:23:WU01:FS01:0x22:    Version: 7.6.0
18:12:23:WU01:FS01:0x22:********************************************************************************
18:12:23:WU01:FS01:0x22:Project: 13460 (Run 677, Clone 30, Gen 0)
18:12:23:WU01:FS01:0x22:Unit: 0x00000000000000000000000000000000
18:12:23:WU01:FS01:0x22:Reading tar file core.xml
18:12:23:WU01:FS01:0x22:Reading tar file integrator.xml.bz2
18:12:23:WU01:FS01:0x22:Reading tar file state.xml.bz2
18:12:23:WU01:FS01:0x22:Reading tar file system.xml.bz2
18:12:23:WU01:FS01:0x22:Digital signatures verified
18:12:23:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
18:12:23:WU01:FS01:0x22:Version 0.0.18
18:12:23:WU01:FS01:0x22:  Checkpoint write interval: 50000 steps (5%) [20 total]
18:12:23:WU01:FS01:0x22:  JSON viewer frame write interval: 10000 steps (1%) [100 total]
18:12:23:WU01:FS01:0x22:  XTC frame write interval: 250000 steps (25%) [4 total]
18:12:23:WU01:FS01:0x22:  Global context and integrator variables write interval: 2500 steps (0.25%) [400 total]
18:12:23:WU01:FS01:0x22:There are 4 platforms available.
18:12:23:WU01:FS01:0x22:Platform 0: Reference
18:12:23:WU01:FS01:0x22:Platform 1: CPU
18:12:23:WU01:FS01:0x22:Platform 2: OpenCL
18:12:23:WU01:FS01:0x22:  opencl-device 0 specified
18:12:23:WU01:FS01:0x22:Platform 3: CUDA
18:12:23:WU01:FS01:0x22:  cuda-device 0 specified
18:12:27:WU01:FS01:0x22:Attempting to create CUDA context:
18:12:27:WU01:FS01:0x22:  Configuring platform CUDA
18:12:30:WU01:FS01:0x22:ERROR:Force RMSE error of 557533 with threshold of 5
18:12:30:WU01:FS01:0x22:Saving result file ../logfile_01.txt
18:12:30:WU01:FS01:0x22:Saving result file science.log
18:12:30:WU01:FS01:0x22:Saving result file state.xml.bz2
18:12:30:WU01:FS01:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
18:12:31:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:12:31:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13460 run:677 clone:30 gen:0 core:0x22 unit:0x0000001e0000000000003494000002a5
....
18:12:59:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:12:59:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13460 run:677 clone:88 gen:0 core:0x22 unit:0x000000580000000000003494000002a5
....
18:13:31:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:13:31:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13460 run:677 clone:11 gen:0 core:0x22 unit:0x0000000b0000000000003494000002a5
.....
18:13:57:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:13:57:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13460 run:677 clone:87 gen:0 core:0x22 unit:0x000000570000000000003494000002a5
18:14:24:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:14:25:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13460 run:676 clone:91 gen:0 core:0x22 unit:0x0000005b0000000000003494000002a4
.....
18:14:53:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:14:53:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13460 run:676 clone:77 gen:0 core:0x22 unit:0x0000004d0000000000003494000002a4
......
18:15:22:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:15:22:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13460 run:677 clone:92 gen:0 core:0x22 unit:0x0000005c0000000000003494000002a5
....
18:15:49:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:15:49:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13460 run:678 clone:57 gen:0 core:0x22 unit:0x000000390000000000003494000002a6
......
18:16:16:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:16:16:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13460 run:677 clone:99 gen:0 core:0x22 unit:0x000000630000000000003494000002a5
......
18:16:40:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:16:40:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:13460 run:678 clone:9 gen:0 core:0x22 unit:0x000000090000000000003494000002a6

Restarting the client made it pick up a WU from project 13460 run 838 without issues.

Edit: These fail on all GPUs, it seems like both AMD and Nvidia.

Re: Project 13460 - two bad work units

Posted: Fri Jan 14, 2022 10:01 pm
by HaloJones
Agreed, this is not a driver or OS issue.What's consistent is that a failure stops the client instead of it just getting a new work unit. Pausing doesn't work, the whole client needs to be restarted.

Re: Project 13460 - two bad work units

Posted: Fri Jan 14, 2022 10:36 pm
by toTOW
Yes, there's a batch of bad p13460 WUs ... :(

Until John Chodera can have a look at it, two things to do :
- if you want to avoid this project, select something different from COVID for cause preference
- if the slot is in FAILED state, it will remains like this for 24 hours. You can Pause/Fold the slot to resume folding before the 24 hours delay.

Re: Project 13460 - two bad work units

Posted: Fri Jan 14, 2022 11:40 pm
by JohnChodera
Folks: Huge apologies for the issues here.

An unexpected event appears to have corrupted some of the `system` or `state` XML files that are packaged up into WUs of some RUNs < 1000. These were not intended to be sent out, but WUs were sent out when I increased the number of CLONEs without realizing this had happened.

Fortunately, we had already collected most of the data we needed for these, so I have restricted the number of CLONEs back to 60. Only RUNs > 1000 should be going out now, and I am monitoring to see if these are corrupted---these should be OK.

Again, huge apologies for the disruptions here. We still managed to get a ton of useful data! I'll post the results of the dashboards this weekend.

~ John Chodera // MSKCC

Re: Project 13460 - two bad work units

Posted: Sat Jan 15, 2022 12:06 am
by JohnChodera
Update: It looks like I can't halt the corrupted RUNs from being sent out, so I am halting the project.

We still got a large amount of useful data, and I will set up and test Sprint 12 over the weekend so we're ready to launch Monday!

Enormous thanks, all!

~ John Chodera // MSKCC

Re: Project 13460 - two bad work units

Posted: Sat Jan 15, 2022 4:15 am
by gordonbb
JohnChodera wrote:Update: It looks like I can't halt the corrupted RUNs from being sent out, so I am halting the project.

We still got a large amount of useful data, and I will set up and test Sprint 12 over the weekend so we're ready to launch Monday!

Enormous thanks, all!

~ John Chodera // MSKCC
And an enormous thanks to you and your team. We eagerly await the opportunity to help and I think I can say we can accept the odd hiccup along the path.