CPU Folding: BAD_FRAME_CHECKSUM on 6 cores

sgnsajgon · Post by **sgnsajgon** » Thu May 07, 2020 1:49 am

Hello,

I have reported this issue on Github, then I was asked to post it here.

FAHClient 7.6.9

I'm folding on Google Cloud Platform, using preemptible virtual machines with persistent storage.
Such VM can be killed any time, without graceful shutdown, thus in that case FAHClient is forcily killed, then after some minutes, the VM is spawned again, and the folding is resumed from the checkpoint.

In the past I have used 2 VMs: First with 1 vCPU for CPU folding, second with 1 vCPU and GPU for GPU folding. These VMs worked correctly - preemption, checkpoint restore and work resume worked without errors.

Now I'm using only one VM with 8 vCPUs and GPU, folding on 2 slots:

1) CPU slot utilizing 6 vCPUs.
2) GPU slot.

Slots are folding as intended, but in case of preemption and resumption, GPU slot is resumed correctly, but CPU slot always fail with the same scenario:

Code: Select all

10:16:28: CPU: Intel(R) Xeon(R) CPU @ 2.20GHz
10:16:28: CPU ID: GenuineIntel Family 6 Model 79 Stepping 0
10:16:28: CPUs: 8
10:16:28: Memory: 29.45GiB
10:16:28: Free Memory: 27.47GiB
10:16:28: Threads: POSIX_THREADS
10:16:28: OS Version: 4.19
10:16:28: Has Battery: false
10:16:28: On Battery: false
10:16:28: UTC Offset: 0
10:16:28: PID: 10
10:16:28: CWD: /var/lib/fahclient
10:16:28: OS: Linux 4.19.112+ x86_64
10:16:28: OS Arch: AMD64
10:16:28: GPUs: 1
10:16:28: GPU 0: Bus:0 Slot:4 Func:0 NVIDIA:7 TU104GL [Tesla T4]
10:16:28: CUDA Device 0: Platform:0 Device:0 Bus:0 Slot:4 Compute:7.5 Driver:10.1
10:16:28:OpenCL Device 0: Platform:0 Device:0 Bus:0 Slot:4 Compute:1.2 Driver:418.67
(...)
10:16:29:WU02:FS00:0xa7:ERROR:Guru Meditation #885995a80cc46232.818eb4166c330456 (5455872.5459308) '02/01/state.cpt'
10:16:29:WU02:FS00:0xa7:WARNING:Unexpected exit() call
10:16:29:WU02:FS00:0xa7:WARNING:Unexpected exit from science code
10:16:29:WU02:FS00:0xa7:Saving result file ../logfile_01.txt
10:16:29:WU02:FS00:0xa7:Saving result file frame51.trr
10:16:29:WU02:FS00:0xa7:ERROR:Guru Meditation #0.d6a0f109730da89c (0.5457120) '02/01/frame51.trr'
10:16:34:WARNING:WU02:FS00:FahCore returned: BAD_FRAME_CHECKSUM (112 = 0x70)
10:16:34:WARNING:WU02:FS00:Fatal error, dumping
10:16:34:WU02:FS00:Sending unit results: id:02 state:SEND error:DUMPED project:14570 run:0 clone:1034 gen:51 core:0xa7 unit:0x00000040287234c95e7ee8a1d36c7740
10:16:34:WU02:FS00:Connecting to 40.114.52.201:8080
10:16:34:WU00:FS00:Connecting to 65.254.110.245:80
10:16:35:WU02:FS00:Server responded WORK_ACK (400)
10:16:35:WU02:FS00:Cleaning up

sgnsajgon · Post by **sgnsajgon** » Fri May 08, 2020 11:44 pm

I have reproduced this issue using VM with 8 vCPUs and no GPU. All vCPUs are working as single folding slot.

uyaem · Post by **uyaem** » Sat May 09, 2020 12:28 pm

I guess that when the client is currently creating a checkpoint, and gets preempted then, the work will be lost (corrupt file).

I doubt that this is something that can be addressed by software, much like you cannot address someone pulling the power cable.
Other than maybe creating the checkpoint in a separate file, and keeping a series of checkpoints that the cores can revert to... but seems an awful lot of overhead for a rare scenario.

Rel25917 · Post by **Rel25917** » Sat May 09, 2020 6:36 pm

You could try setting the checkpoint frequency to 30 min, reduce the chance it shuts it off during a checkpoint. Will not affect gpu slots and they too dont like bad shutdowns, just random chance on when the vm gets killed. Also how long do they normally run between getting killed? If they aren't getting at least a few hours between getting killed I would say that's probably a bad way to try running.

Post by **Joe_H** » Sat May 09, 2020 7:28 pm

For the CPU Core_A7, it writes a checkpoint based on the interval set in th FAHControl, and attempts to do one as well when folding is stopped. That can take 30-60 seconds, depend on the size of the system in the WU and speed of computer and storage. One other factor is how the OS handles cached writes to storage, that varies between Linux, macOS and Windows and can also vary depending on filesystem being used.

Post by **bruce** » Sat May 09, 2020 11:19 pm

There's a reason why operating systems have an orderly shutdown procedure. Programs need to be notified that a shutdown is imminent so that they can close the files and leave a coherent image on the permanent storage. I would certainly discuss it with their sales department as it does corrupt the science that you're doing. I don't see how FAH can really do anything about it inasmuch as the file-system caches disk writes in memory and if that memory is lost, the disk image is incomplete.

I may be incorrect, however. Suppose FAH software was revided so as to keep TWO checkpoints. If it discovers that a checkpoint is corrupt (which is exactly what his happening) the previous checkpoint could be used. Let's assume a checkpoint is written every 5%. If the VM is interrupt when it happens to be at 43%, we can assume that the checkpoint that was written at 40% is not recoverable, then it still should be able to resume work from the checkpoint written at 35%. Then 8% of your work would have to be repeated which, to me, seems like a relatively poor return but it's not as bad as redoing all 43%.

The GROMACS software is Open Source code. If somebody wants to code an enhancement and submit it to gromacs.org, I suppose you can do that and maybe someday it will be incorporated into a future version of the production code.

sgnsajgon · Post by **sgnsajgon** » Mon May 11, 2020 12:00 am

Thank you for response.

Some additional details.

VM preemption occurs on average once per day, so only 1 of 5-8 WUs is interrupted and lost during ~24 hours of processing. Preemptive vCPUs and GPUs are more or less 3x cheaper than regular ones, so I think it is a good deal. As I said there in no problem with GPU folding, checkpoint are always correctly restored in case of preemption, so its possible to persist state transactionally when properly implemented.

By default CPU slot creates checkpoint every 15 minutes, but BAD_FRAME_CHECKSUM bug is 100% reproducible, so I'm sure it is rarely a case of coincidence of creating checkpoint and killing VM.

The topic of reliable and durable data persistence is present in industry since decades, and nowadays almost all modern DB engines provide transactional persistence - relational databases have ACID transactions, NoSQL databases assure transactional writes at least on the scope of single data collection.

I would like to investigate the problem on the source code level, but there are 2 issues:

1) Gromacs fork used by F@H (hosted on Github at FoldingAtHome /gromacs ) is based on very old Gromacs release (version 5.0.4, six years old), and is 5252 commits behind Gromacs main upstream source. There is a chance that a lot of bugs have been already fixed on the Gromacs main fork. So I'm not sure on which Gromacs fork should I investigate code.

BTW are there any plans to incorporate the newest version on Gromacs into Fahclient?

2) If I'm not wrong, the Fahclient persistence layer is implemented either in FahCoreWrapper or FahCore modules, but they are closed source as I read on home site. I cannot find their source codes on Github.

Post by **PantherX** » Mon May 11, 2020 7:03 am

Welcome to the F@H Forum sgnsajgon,

IMO, start with the the GROMACS Main fork. If the fix is there, then the next time FahCore is updated, this issue will be fixed. If the fix is not present, you can contribute to it helping out the scientific community using it.

The time to use the newer version of GROMACS is based upon the scientific features that F@H needs/uses/wants. If those features aren't of use, then there's no need to update as development resources are heavily constrained.

Folding Forum

CPU Folding: BAD_FRAME_CHECKSUM on 6 cores

CPU Folding: BAD_FRAME_CHECKSUM on 6 cores

Re: CPU Folding: BAD_FRAME_CHECKSUM on 6 cores

Re: CPU Folding: BAD_FRAME_CHECKSUM on 6 cores

Re: CPU Folding: BAD_FRAME_CHECKSUM on 6 cores

Re: CPU Folding: BAD_FRAME_CHECKSUM on 6 cores

Re: CPU Folding: BAD_FRAME_CHECKSUM on 6 cores

Re: CPU Folding: BAD_FRAME_CHECKSUM on 6 cores

Re: CPU Folding: BAD_FRAME_CHECKSUM on 6 cores