CPU Folding: BAD_FRAME_CHECKSUM on 6 cores

If you're new to FAH and need help getting started or you have very basic questions, start here.

Moderators: Site Moderators, FAHC Science Team

Post Reply
sgnsajgon
Posts: 3
Joined: Thu May 07, 2020 1:15 am

CPU Folding: BAD_FRAME_CHECKSUM on 6 cores

Post by sgnsajgon »

Hello,

I have reported this issue on Github, then I was asked to post it here.

FAHClient 7.6.9

I'm folding on Google Cloud Platform, using preemptible virtual machines with persistent storage.
Such VM can be killed any time, without graceful shutdown, thus in that case FAHClient is forcily killed, then after some minutes, the VM is spawned again, and the folding is resumed from the checkpoint.

In the past I have used 2 VMs: First with 1 vCPU for CPU folding, second with 1 vCPU and GPU for GPU folding. These VMs worked correctly - preemption, checkpoint restore and work resume worked without errors.

Now I'm using only one VM with 8 vCPUs and GPU, folding on 2 slots:

1) CPU slot utilizing 6 vCPUs.
2) GPU slot.

Slots are folding as intended, but in case of preemption and resumption, GPU slot is resumed correctly, but CPU slot always fail with the same scenario:

Code: Select all

10:16:28: CPU: Intel(R) Xeon(R) CPU @ 2.20GHz
10:16:28: CPU ID: GenuineIntel Family 6 Model 79 Stepping 0
10:16:28: CPUs: 8
10:16:28: Memory: 29.45GiB
10:16:28: Free Memory: 27.47GiB
10:16:28: Threads: POSIX_THREADS
10:16:28: OS Version: 4.19
10:16:28: Has Battery: false
10:16:28: On Battery: false
10:16:28: UTC Offset: 0
10:16:28: PID: 10
10:16:28: CWD: /var/lib/fahclient
10:16:28: OS: Linux 4.19.112+ x86_64
10:16:28: OS Arch: AMD64
10:16:28: GPUs: 1
10:16:28: GPU 0: Bus:0 Slot:4 Func:0 NVIDIA:7 TU104GL [Tesla T4]
10:16:28: CUDA Device 0: Platform:0 Device:0 Bus:0 Slot:4 Compute:7.5 Driver:10.1
10:16:28:OpenCL Device 0: Platform:0 Device:0 Bus:0 Slot:4 Compute:1.2 Driver:418.67
(...)
10:16:29:WU02:FS00:0xa7:ERROR:Guru Meditation #885995a80cc46232.818eb4166c330456 (5455872.5459308) '02/01/state.cpt'
10:16:29:WU02:FS00:0xa7:WARNING:Unexpected exit() call
10:16:29:WU02:FS00:0xa7:WARNING:Unexpected exit from science code
10:16:29:WU02:FS00:0xa7:Saving result file ../logfile_01.txt
10:16:29:WU02:FS00:0xa7:Saving result file frame51.trr
10:16:29:WU02:FS00:0xa7:ERROR:Guru Meditation #0.d6a0f109730da89c (0.5457120) '02/01/frame51.trr'
10:16:34:WARNING:WU02:FS00:FahCore returned: BAD_FRAME_CHECKSUM (112 = 0x70)
10:16:34:WARNING:WU02:FS00:Fatal error, dumping
10:16:34:WU02:FS00:Sending unit results: id:02 state:SEND error:DUMPED project:14570 run:0 clone:1034 gen:51 core:0xa7 unit:0x00000040287234c95e7ee8a1d36c7740
10:16:34:WU02:FS00:Connecting to 40.114.52.201:8080
10:16:34:WU00:FS00:Connecting to 65.254.110.245:80
10:16:35:WU02:FS00:Server responded WORK_ACK (400)
10:16:35:WU02:FS00:Cleaning up
sgnsajgon
Posts: 3
Joined: Thu May 07, 2020 1:15 am

Re: CPU Folding: BAD_FRAME_CHECKSUM on 6 cores

Post by sgnsajgon »

I have reproduced this issue using VM with 8 vCPUs and no GPU. All vCPUs are working as single folding slot.
uyaem
Posts: 219
Joined: Sat Mar 21, 2020 7:35 pm
Location: Esslingen, Germany

Re: CPU Folding: BAD_FRAME_CHECKSUM on 6 cores

Post by uyaem »

I guess that when the client is currently creating a checkpoint, and gets preempted then, the work will be lost (corrupt file).

I doubt that this is something that can be addressed by software, much like you cannot address someone pulling the power cable.
Other than maybe creating the checkpoint in a separate file, and keeping a series of checkpoints that the cores can revert to... but seems an awful lot of overhead for a rare scenario.
Image
CPU: Ryzen 9 3900X (1x21 CPUs) ~ GPU: nVidia GeForce GTX 1660 Super (Asus)
Rel25917
Posts: 303
Joined: Wed Aug 15, 2012 2:31 am

Re: CPU Folding: BAD_FRAME_CHECKSUM on 6 cores

Post by Rel25917 »

You could try setting the checkpoint frequency to 30 min, reduce the chance it shuts it off during a checkpoint. Will not affect gpu slots and they too dont like bad shutdowns, just random chance on when the vm gets killed. Also how long do they normally run between getting killed? If they aren't getting at least a few hours between getting killed I would say that's probably a bad way to try running.
Joe_H
Site Admin
Posts: 7989
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Studio M1 Max 32 GB smp6
Mac Hack i7-7700K 48 GB smp4
Location: W. MA

Re: CPU Folding: BAD_FRAME_CHECKSUM on 6 cores

Post by Joe_H »

For the CPU Core_A7, it writes a checkpoint based on the interval set in th FAHControl, and attempts to do one as well when folding is stopped. That can take 30-60 seconds, depend on the size of the system in the WU and speed of computer and storage. One other factor is how the OS handles cached writes to storage, that varies between Linux, macOS and Windows and can also vary depending on filesystem being used.
Image
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: CPU Folding: BAD_FRAME_CHECKSUM on 6 cores

Post by bruce »

There's a reason why operating systems have an orderly shutdown procedure. Programs need to be notified that a shutdown is imminent so that they can close the files and leave a coherent image on the permanent storage. I would certainly discuss it with their sales department as it does corrupt the science that you're doing. I don't see how FAH can really do anything about it inasmuch as the file-system caches disk writes in memory and if that memory is lost, the disk image is incomplete.

I may be incorrect, however. Suppose FAH software was revided so as to keep TWO checkpoints. If it discovers that a checkpoint is corrupt (which is exactly what his happening) the previous checkpoint could be used. Let's assume a checkpoint is written every 5%. If the VM is interrupt when it happens to be at 43%, we can assume that the checkpoint that was written at 40% is not recoverable, then it still should be able to resume work from the checkpoint written at 35%. Then 8% of your work would have to be repeated which, to me, seems like a relatively poor return but it's not as bad as redoing all 43%.

The GROMACS software is Open Source code. If somebody wants to code an enhancement and submit it to gromacs.org, I suppose you can do that and maybe someday it will be incorporated into a future version of the production code.
sgnsajgon
Posts: 3
Joined: Thu May 07, 2020 1:15 am

Re: CPU Folding: BAD_FRAME_CHECKSUM on 6 cores

Post by sgnsajgon »

Thank you for response.

Some additional details.

VM preemption occurs on average once per day, so only 1 of 5-8 WUs is interrupted and lost during ~24 hours of processing. Preemptive vCPUs and GPUs are more or less 3x cheaper than regular ones, so I think it is a good deal. As I said there in no problem with GPU folding, checkpoint are always correctly restored in case of preemption, so its possible to persist state transactionally when properly implemented.

By default CPU slot creates checkpoint every 15 minutes, but BAD_FRAME_CHECKSUM bug is 100% reproducible, so I'm sure it is rarely a case of coincidence of creating checkpoint and killing VM.

The topic of reliable and durable data persistence is present in industry since decades, and nowadays almost all modern DB engines provide transactional persistence - relational databases have ACID transactions, NoSQL databases assure transactional writes at least on the scope of single data collection.

I would like to investigate the problem on the source code level, but there are 2 issues:

1) Gromacs fork used by F@H (hosted on Github at FoldingAtHome /gromacs ) is based on very old Gromacs release (version 5.0.4, six years old), and is 5252 commits behind Gromacs main upstream source. There is a chance that a lot of bugs have been already fixed on the Gromacs main fork. So I'm not sure on which Gromacs fork should I investigate code.

BTW are there any plans to incorporate the newest version on Gromacs into Fahclient?

2) If I'm not wrong, the Fahclient persistence layer is implemented either in FahCoreWrapper or FahCore modules, but they are closed source as I read on home site. I cannot find their source codes on Github.
PantherX
Site Moderator
Posts: 6986
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: CPU Folding: BAD_FRAME_CHECKSUM on 6 cores

Post by PantherX »

Welcome to the F@H Forum sgnsajgon,

IMO, start with the the GROMACS Main fork. If the fix is there, then the next time FahCore is updated, this issue will be fixed. If the fix is not present, you can contribute to it helping out the scientific community using it.

The time to use the newer version of GROMACS is based upon the scientific features that F@H needs/uses/wants. If those features aren't of use, then there's no need to update as development resources are heavily constrained.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Post Reply