CPU Folding: BAD_FRAME_CHECKSUM on 6 cores
Posted: Thu May 07, 2020 1:49 am
Hello,
I have reported this issue on Github, then I was asked to post it here.
FAHClient 7.6.9
I'm folding on Google Cloud Platform, using preemptible virtual machines with persistent storage.
Such VM can be killed any time, without graceful shutdown, thus in that case FAHClient is forcily killed, then after some minutes, the VM is spawned again, and the folding is resumed from the checkpoint.
In the past I have used 2 VMs: First with 1 vCPU for CPU folding, second with 1 vCPU and GPU for GPU folding. These VMs worked correctly - preemption, checkpoint restore and work resume worked without errors.
Now I'm using only one VM with 8 vCPUs and GPU, folding on 2 slots:
1) CPU slot utilizing 6 vCPUs.
2) GPU slot.
Slots are folding as intended, but in case of preemption and resumption, GPU slot is resumed correctly, but CPU slot always fail with the same scenario:
I have reported this issue on Github, then I was asked to post it here.
FAHClient 7.6.9
I'm folding on Google Cloud Platform, using preemptible virtual machines with persistent storage.
Such VM can be killed any time, without graceful shutdown, thus in that case FAHClient is forcily killed, then after some minutes, the VM is spawned again, and the folding is resumed from the checkpoint.
In the past I have used 2 VMs: First with 1 vCPU for CPU folding, second with 1 vCPU and GPU for GPU folding. These VMs worked correctly - preemption, checkpoint restore and work resume worked without errors.
Now I'm using only one VM with 8 vCPUs and GPU, folding on 2 slots:
1) CPU slot utilizing 6 vCPUs.
2) GPU slot.
Slots are folding as intended, but in case of preemption and resumption, GPU slot is resumed correctly, but CPU slot always fail with the same scenario:
Code: Select all
10:16:28: CPU: Intel(R) Xeon(R) CPU @ 2.20GHz
10:16:28: CPU ID: GenuineIntel Family 6 Model 79 Stepping 0
10:16:28: CPUs: 8
10:16:28: Memory: 29.45GiB
10:16:28: Free Memory: 27.47GiB
10:16:28: Threads: POSIX_THREADS
10:16:28: OS Version: 4.19
10:16:28: Has Battery: false
10:16:28: On Battery: false
10:16:28: UTC Offset: 0
10:16:28: PID: 10
10:16:28: CWD: /var/lib/fahclient
10:16:28: OS: Linux 4.19.112+ x86_64
10:16:28: OS Arch: AMD64
10:16:28: GPUs: 1
10:16:28: GPU 0: Bus:0 Slot:4 Func:0 NVIDIA:7 TU104GL [Tesla T4]
10:16:28: CUDA Device 0: Platform:0 Device:0 Bus:0 Slot:4 Compute:7.5 Driver:10.1
10:16:28:OpenCL Device 0: Platform:0 Device:0 Bus:0 Slot:4 Compute:1.2 Driver:418.67
(...)
10:16:29:WU02:FS00:0xa7:ERROR:Guru Meditation #885995a80cc46232.818eb4166c330456 (5455872.5459308) '02/01/state.cpt'
10:16:29:WU02:FS00:0xa7:WARNING:Unexpected exit() call
10:16:29:WU02:FS00:0xa7:WARNING:Unexpected exit from science code
10:16:29:WU02:FS00:0xa7:Saving result file ../logfile_01.txt
10:16:29:WU02:FS00:0xa7:Saving result file frame51.trr
10:16:29:WU02:FS00:0xa7:ERROR:Guru Meditation #0.d6a0f109730da89c (0.5457120) '02/01/frame51.trr'
10:16:34:WARNING:WU02:FS00:FahCore returned: BAD_FRAME_CHECKSUM (112 = 0x70)
10:16:34:WARNING:WU02:FS00:Fatal error, dumping
10:16:34:WU02:FS00:Sending unit results: id:02 state:SEND error:DUMPED project:14570 run:0 clone:1034 gen:51 core:0xa7 unit:0x00000040287234c95e7ee8a1d36c7740
10:16:34:WU02:FS00:Connecting to 40.114.52.201:8080
10:16:34:WU00:FS00:Connecting to 65.254.110.245:80
10:16:35:WU02:FS00:Server responded WORK_ACK (400)
10:16:35:WU02:FS00:Cleaning up