There are a few reasons why that wouldn't be a good idea. Folding is not deterministic (although OpenMM has an optional deterministic mode), so you can't simply perform the same steps twice and always expect the same exact results. Additionally, there could have been a problem that occurred much earlier that didn't trigger an abort, so the checkpoint may already be "tainted". Better to start from scratch on an unrelated system.azhad wrote: ↑Sun Mar 16, 2025 11:39 am Issue resolved. FAH GPU workloads are different from my previous project workloads. The bump in voltage from 850mW to 887mW works, with power consumption going from 170W -> 210-220W. Still underclocked but it is stable now.
But I still wonder why you can't have 2 checkpoint saves (say CKpoint48 at 48000 and CKpoint50 at 50000). Error occurs at 51000. Load Ckpoint48 and resume work (as if it got paused). Once Ckpoint50 is reached, check if the work result matches -> if so, resume the work, else dump. My GPU was able to work with another new workunit immediately after the crash - it should be good to resume at the previous checkpoint as well (unless the saves consume exorbitant space on SSD, etc).
I agree that there should be better error checking since some types of errors can be clearly shown to be transient and not impact the reliability of the simulation, but the cores and client do not communicate detailed enough information for that to be possible at this time. There are only a small number of error codes that the core returns to the client (https://github.com/FoldingAtHome/fah-cl ... ExitCode.h), and some error codes are returned for both serious issues that mandate the WU being dumped as well as benign and harmless issues that indicate nothing more than a configuration problem.