Page 1 of 1

Confirmation before automatic dumping

Posted: Sat Mar 01, 2025 7:17 am
by arisu
If I switch virtual terminals while folding and an OpenGL context is open, the GPU driver resets itself. When folding is subsequently paused, the client sends SIGINT to the running cores. The GPU core does not respond in time (since the GPU was ripped out under it causing it to become unresponsive), and the client then kills the core forcibly and dumps the WU. But nothing at all was wrong with the WU and it could have easily continued from the last checkpoint.

How do I stop the client from automatically dumping the WU from a "Core did not shutdown gracefully" warning when the GPU resets? Asking for confirmation, or at least asking for confirmation with an automatic timeout, would help when the core dies and it's clearly not due to a bad WU. Is there any way to do this in the v8 client, or would I have to patch and recompile it? I've lost quite a few perfectly good WUs because of this.

Re: Confirmation before automatic dumping

Posted: Sat Mar 01, 2025 7:57 am
by calxalot
I think this might be a good enhancement request.

Both client and web control would need to support this.

https://github.com/FoldingAtHome/fah-cl ... tet/issues

Workaround is, of course, to manually pause before doing something that resets the GPU.

Re: Confirmation before automatic dumping

Posted: Sat Mar 01, 2025 7:58 am
by calxalot
Hmm. Might require a reboot to get a GPU driver happy again. In some cases.

Re: Confirmation before automatic dumping

Posted: Sat Mar 01, 2025 8:14 am
by arisu
It's not always possible to know what will reset the GPU. If it resets though then a reboot isn't needed, but yeah there are some cases where it crashes and is unable to reset and in those cases the WU would end up getting dumped pretty soon.

Maybe I'll write and test a patch and if it works well, I'll submit a PR.

Re: Confirmation before automatic dumping

Posted: Sat Mar 01, 2025 8:39 am
by calxalot
Only thing I have heard is that windows RDP can reset drivers.

Re: Confirmation before automatic dumping

Posted: Sat Mar 01, 2025 8:56 am
by arisu
Linux amdgpu can also reset and recover itself:

Code: Select all

[533240.381281] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:8 pasid:32770, for process FahCore_26 pid 479221 thread FahCore_26 pid 479221)
[533240.381345] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x0000760000000000 from client 10
[533240.381373] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00801A30
[533240.381397] amdgpu 0000:03:00.0: amdgpu: 	 Faulty UTCL2 client ID: SDMA0 (0xd)
[533240.381419] amdgpu 0000:03:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[533240.381437] amdgpu 0000:03:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[533240.381456] amdgpu 0000:03:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
[533240.381475] amdgpu 0000:03:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[533240.381494] amdgpu 0000:03:00.0: amdgpu: 	 RW: 0x0
[533240.488899] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=2
[533240.489502] amdgpu: failed to add hardware queue to MES, doorbell=0x1202
[533240.489529] amdgpu: MES might be in unrecoverable state, issue a GPU reset
[533240.489556] amdgpu: Failed to restore queue 2
[533240.489575] amdgpu: Failed to restore process queues
[533240.489595] amdgpu: Failed to restore queues of pasid 0x8002
[533240.493583] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
[533240.653995] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[533240.654165] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[533240.760268] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[533240.760418] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[533240.867906] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[533240.868065] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[533240.974299] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[533240.974447] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[533241.080514] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[533241.080659] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[533241.188208] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[533241.188369] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[533241.294664] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[533241.294815] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[533241.401317] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[533241.401601] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[533241.508607] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[533241.508873] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[533241.525954] amdgpu 0000:03:00.0: amdgpu: MODE2 reset
[533241.558631] amdgpu 0000:03:00.0: amdgpu: GPU reset succeeded, trying to resume
[533241.559697] [drm] PCIE GART of 512M enabled (table at 0x000000801FD00000).
[533241.559788] [drm] VRAM is lost due to GPU reset!
[533241.559794] amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
[533241.562017] amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
[533241.562346] [drm] kiq ring mec 3 pipe 1 q 0
[533241.564223] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[533241.564592] amdgpu 0000:03:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
[533241.567010] [drm] DMUB hardware initialized: version=0x08004800
[533241.936214] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[533241.936228] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[533241.936235] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[533241.936241] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[533241.936246] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[533241.936251] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[533241.936257] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[533241.936262] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[533241.936267] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[533241.936273] amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[533241.936279] amdgpu 0000:03:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 1
[533241.936284] amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 1
[533241.936290] amdgpu 0000:03:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[533242.205967] amdgpu 0000:03:00.0: amdgpu: recover vram bo from shadow start
[533242.205987] amdgpu 0000:03:00.0: amdgpu: recover vram bo from shadow done
[533242.206648] [drm] ring gfx_32776.1.1 was added
[533242.206972] [drm] ring compute_32776.2.2 was added
[533242.207338] [drm] ring sdma_32776.3.3 was added
[533242.207362] [drm] ring gfx_32776.1.1 test pass
[533242.207388] [drm] ring gfx_32776.1.1 ib test pass
[533242.207404] [drm] ring compute_32776.2.2 test pass
[533242.207432] [drm] ring compute_32776.2.2 ib test pass
[533242.207449] [drm] ring sdma_32776.3.3 test pass
[533242.207483] [drm] ring sdma_32776.3.3 ib test pass
[533242.208645] amdgpu 0000:03:00.0: amdgpu: GPU reset(9) succeeded!

Re: Confirmation before automatic dumping

Posted: Sat Mar 01, 2025 12:26 pm
by muziqaz
Recovery only brings back basic driver functionality, it never loads back all the required modules for GPU to continue working as before the crash. This has been witnessed with games (they run like crap on recovered drivers), and FAH (it outright cannot fold on recovered drivers).
Driver recovery is there for you to save your work (not fah), and restart the PC gracefully.
Now why FAH cannot restart from the previous checkpoint and continue folding? It can't because driver is in recovery mode, which is not fully functional, thus stopping folding on such hardware/software combination.
Now, how about pausing if driver crash happens and let user reboot and resume the WU? Who is to say that driver crash did not introduce an error in scientific simulation? But can't fah check if previous bit of simulation is correct?
I think Joe would just quit as a dev if we expected FAHClient to hold hand for every possible user move.

Running 2 compute tasks at the same time is BAD. Run one only
Running FAH on unstable hardware is bad. Please make sure your hardware is 101% stable before folding
Running FAH on unstable drivers is bad. Make sure to fix the drivers before folding.

I still urge the request made in github, but caution, some of us are still waiting for more severe issue fixes to the client to not dump work left and right :)