I'm noticing a waste of GPU resources on 18601 as it checkpoints every 25000 steps. On a 4090 that's under a minute, on a 3080 its about every 2 minutes. It seems to take a few seconds each time which surely adds up as a lot of wasted time over 24 hours.
Why have the option to set the checkpointing frequency if its ignored?
Conversely, I'm not seeing any checkpointing in the logs at all for aarch64 WUs although looking in the data folder they do seem to be written.
18601 checkpoints too often
Moderators: Site Moderators, FAHC Science Team
-
- Site Admin
- Posts: 7926
- Joined: Tue Apr 21, 2009 4:41 pm
- Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2 - Location: W. MA
Re: 18601 checkpoints too often
The checkpoints on GPU projects are set by the researcher. They happen when important data is collected and retained for later analysis after the WU is returned. It is also when a sanity check is done on the data returned to that point on the CPU to verify the GPU is properly calculating. That was found necessary as unstable GPUs may not give any indication that there are errors in the processing of the WU data.
The algorithms used in the CPU processing cores based on GROMACS are different and can be interrupted on a timed basis. In the latest versions they also will attempt to write out a checkpoint when folding is paused. The OpenMM code used in the GPU folding core needs to be interrupted at certain points to be able to write out a usable checkpoint.
The algorithms used in the CPU processing cores based on GROMACS are different and can be interrupted on a timed basis. In the latest versions they also will attempt to write out a checkpoint when folding is paused. The OpenMM code used in the GPU folding core needs to be interrupted at certain points to be able to write out a usable checkpoint.
iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
-
- Posts: 40
- Joined: Mon Oct 24, 2022 4:32 am
Re: 18601 checkpoints too often
Thanks, that's obviously more important than getting it done a little faster.
-
- Site Moderator
- Posts: 6349
- Joined: Sun Dec 02, 2007 10:38 am
- Location: Bordeaux, France
- Contact:
Re: 18601 checkpoints too often
The checkpoints are usually set to not waste too much compute time when low end GPUs are interrupted ...
Re: 18601 checkpoints too often
Would be nice if:
- Checkpoints could be written without interrupting calculations (dunno how hard that would be if at all possible), or
- There would be something like 'if last checkpoint was within x minutes, skip this one', with default of 5 or 15m, configurable with an advanced setting - that way there are still checkpoints on whole percentages but it would auto adjust to the speed of the card
I know, most effort is put in building the new client, so this might end up somewhere on the backlog with lower priority
- Checkpoints could be written without interrupting calculations (dunno how hard that would be if at all possible), or
- There would be something like 'if last checkpoint was within x minutes, skip this one', with default of 5 or 15m, configurable with an advanced setting - that way there are still checkpoints on whole percentages but it would auto adjust to the speed of the card
I know, most effort is put in building the new client, so this might end up somewhere on the backlog with lower priority
Ryzen 5800X / RTX 4090 / Windows 11
Ryzen 5600X / RTX 3070 Ti / Ubuntu 22.04
Ryzen 5600 / RTX 3060 Ti / Windows 11
-
- Site Moderator
- Posts: 6349
- Joined: Sun Dec 02, 2007 10:38 am
- Location: Bordeaux, France
- Contact:
Re: 18601 checkpoints too often
Unfortunately, OpenMM core used on GPUs doesn't support triggeed checkpoints : it only works at a predefined frequency. OpenMM also performs checks (we call them sanity checks) between data computed on the GPU and data computed on the CPU before it writes a checkpoint, which explain why there are some interruptions in GPU load.
Gromacs core used on CPU is more flexible : you can set the checkpoint frequency, and it can write a checkpoint when the core is interrupted.
Gromacs core used on CPU is more flexible : you can set the checkpoint frequency, and it can write a checkpoint when the core is interrupted.