Too many Core Dumped on GPU

arisu · Post by **arisu** » Sun Mar 16, 2025 12:08 pm

azhad wrote: ↑Sun Mar 16, 2025 11:39 am Issue resolved. FAH GPU workloads are different from my previous project workloads. The bump in voltage from 850mW to 887mW works, with power consumption going from 170W -> 210-220W. Still underclocked but it is stable now.

But I still wonder why you can't have 2 checkpoint saves (say CKpoint48 at 48000 and CKpoint50 at 50000). Error occurs at 51000. Load Ckpoint48 and resume work (as if it got paused). Once Ckpoint50 is reached, check if the work result matches -> if so, resume the work, else dump. My GPU was able to work with another new workunit immediately after the crash - it should be good to resume at the previous checkpoint as well (unless the saves consume exorbitant space on SSD, etc).

There are a few reasons why that wouldn't be a good idea. Folding is not deterministic (although OpenMM has an optional deterministic mode), so you can't simply perform the same steps twice and always expect the same exact results. Additionally, there could have been a problem that occurred much earlier that didn't trigger an abort, so the checkpoint may already be "tainted". Better to start from scratch on an unrelated system.

I agree that there should be better error checking since some types of errors can be clearly shown to be transient and not impact the reliability of the simulation, but the cores and client do not communicate detailed enough information for that to be possible at this time. There are only a small number of error codes that the core returns to the client (https://github.com/FoldingAtHome/fah-cl ... ExitCode.h), and some error codes are returned for both serious issues that mandate the WU being dumped as well as benign and harmless issues that indicate nothing more than a configuration problem.

muziqaz · Post by **muziqaz** » Sun Mar 16, 2025 1:54 pm

Frequent checkpoints with GPU projects is extremely expensive size wise

On why sh*t doesn't have more redundancy, that is a question to the dev

azhad · Post by **azhad** » Sun Mar 16, 2025 2:31 pm

@arisu Good points there. I understand a bit more now. Previously was doing maths where the results would and should be the same.

@muziqaz The method I suggested does not increase the frequency of checkpoints - only the storage of checkpoints increase (two checkpoints are kept instead of one - and they would be generated anyway). It can be made an option in the program for those who want it and have the necessary disk space. (Anyways, like i said - issue resolved for now).

muziqaz · Post by **muziqaz** » Sun Mar 16, 2025 2:45 pm

All the checkpoints are also sent to the server, and that is the main limit, not your hard drive space

azhad · Post by **azhad** » Sun Mar 16, 2025 6:10 pm

@muziqaz Only 1% more storage is needed on the server - to store the duplicate checkpoint CKpoint50 in my example. You can omit that 1% too if you don't want to compare it or if the checkpoints will totally differ.

Let me make it simple for you - no more space is needed.

muziqaz · Post by **muziqaz** » Sun Mar 16, 2025 7:20 pm

azhad wrote: ↑Sun Mar 16, 2025 6:10 pm @muziqaz Only 1% more storage is needed on the server - to store the duplicate checkpoint CKpoint50 in my example. You can omit that 1% too if you don't want to compare it or if the checkpoints will totally differ. Let me make it simple for you - no more space is needed.

There are two reasons for checkpoints being the way they are for GPUs right now:
Compute resource wasting and file size. That is that simple

Currently we can choose to have checkpoints every 1%, 2%, etc. Default on the server is every 5%. 99.9% of researchers stay with 5%, due to those reasons mentioned above. The only time we suggest more frequent checkpoints is when 5% for loads of GPUs means 1h+ compute time.
You can come up with elaborate plans on how to do error checking and checkpointing which suit you all you want. It will not change.

arisu · Post by **arisu** » Mon Mar 17, 2025 4:15 am

muziqaz wrote: ↑Sun Mar 16, 2025 2:45 pm All the checkpoints are also sent to the server, and that is the main limit, not your hard drive space

The OpenMM binary checkpoint file ("checkpoint") is never sent to the server because it's non-portable and device-dependent. It's converted into a portable XML format once at the very end and that is sent to the server ("checkpointState.xml.bz2").

XTC frames are also sent to the server but those are small and only taken half as often as checkpoints.

But that makes me curious. If network limits are an issue for servers then why do the CPU cores send the redundant .gro file? It's by and far the biggest file and yet it's 100% redundant and can be recalculated exactly from the other files that are sent. Or do the GPU servers just have less available bandwidth?

muziqaz · Post by **muziqaz** » Mon Mar 17, 2025 5:53 am

CPU projects tend to be smaller, than GPU projects

Folding Forum

Too many Core Dumped on GPU

Re: Too many Core Dumped on GPU

Re: Too many Core Dumped on GPU

Re: Too many Core Dumped on GPU

Re: Too many Core Dumped on GPU

Re: Too many Core Dumped on GPU

Re: Too many Core Dumped on GPU

Re: Too many Core Dumped on GPU

Re: Too many Core Dumped on GPU