Page 2 of 8
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Thu Aug 06, 2020 9:11 pm
by NormalDiffusion
muziqaz wrote:I have Vega64 liquid edition. Replaced stock fan with two noctua fans for push pull, and it is running out of its mind. temps are super low, and clock are insane
That's my plan! But not with the prices now... I could get another rvii for the same money...
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Fri Aug 07, 2020 1:00 am
by ViTe
I have the same issue. Dedicated machine, Win7 64bit, AMD RX570 4gb (not overclocked), Adrenalin 20.2.2, Client ver. 7.6.9, OpenCL 2.0 AMD-APP Driver 3004.8
16600 keep crushing with the same error message "Particle coordinate is nan". Other projects that use Core 22 are fine.
12:50:36:WU00:FS01:0x22:Project: 16600 (Run 0, Clone 796, Gen 387)
12:50:36:WU00:FS01:0x22:Unit: 0x000001aa8f59f36f5ec369110b1585af
12:50:36:WU00:FS01:0x22:Digital signatures verified
12:50:36:WU00:FS01:0x22:Folding@home GPU Core22 Folding@home Core
12:50:36:WU00:FS01:0x22:Version 0.0.11
12:50:36:WU00:FS01:0x22: Checkpoint write interval: 25000 steps (5%) [20 total]
12:50:36:WU00:FS01:0x22: JSON viewer frame write interval: 5000 steps (1%) [100 total]
12:50:36:WU00:FS01:0x22: XTC frame write interval: 20000 steps (4%) [25 total]
12:50:36:WU00:FS01:0x22: Global context and integrator variables write interval: disabled
12:51:07:WU00:FS01:0x22:Completed 0 out of 500000 steps (0%)
12:52:56:WU00:FS01:0x22:Completed 5000 out of 500000 steps (1%)
12:54:44:WU00:FS01:0x22:Completed 10000 out of 500000 steps (2%)
12:56:32:WU00:FS01:0x22:Completed 15000 out of 500000 steps (3%)
12:58:21:WU00:FS01:0x22:An exception occurred at step 18071: Particle coordinate is nan
12:58:21:WU00:FS01:0x22:Max number of attempts to resume from last checkpoint (2) reached. Aborting.
12:58:21:WU00:FS01:0x22:ERROR:114: Max number of attempts to resume from last checkpoint reached.
12:58:21:WU00:FS01:0x22:Saving result file ..\logfile_01.txt
12:58:21:WU00:FS01:0x22:Saving result file science.log
12:58:21:WU00:FS01:0x22:Saving result file state.xml
12:58:25:WU00:FS01:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
12:58:26:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Fri Aug 07, 2020 1:02 am
by Neil-B
Just once or every time you get a 16600?
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Fri Aug 07, 2020 1:12 am
by ViTe
Not once for sure.
12:42:39:WU01:FS01:0x22:Project: 16600 (Run 0, Clone 1483, Gen 38)
12:43:10:WU01:FS01:0x22:Completed 275000 out of 500000 steps (55%)
12:44:39:WU01:FS01:0x22:An exception occurred at step 278860: Particle coordinate is nan
.........
**********************************
10:57:09:WU00:FS01:0x22:Project: 16600 (Run 0, Clone 1579, Gen 29)
10:57:40:WU00:FS01:0x22:Completed 25000 out of 500000 steps (5%)
10:59:21:WU00:FS01:0x22:An exception occurred at step 28362: Particle coordinate is nan
**********************************
02:05:44:WU00:FS01:0x22:Project: 16600 (Run 0, Clone 1096, Gen 319)
02:19:01:WU00:FS01:0x22:Completed 35000 out of 500000 steps (7%)
02:19:58:WU00:FS01:0x22:An exception occurred at step 35641: Particle coordinate is nan
**********************************
01:53:02:WU01:FS01:0x22:Project: 16600 (Run 0, Clone 1983, Gen 19)
02:00:42:WU01:FS01:0x22:Completed 320000 out of 500000 steps (64%)
02:01:41:WU01:FS01:0x22:An exception occurred at step 322534: Particle coordinate is nan
Only few runs reached 100% sucsessfully. Most of 16600 are faulty.
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Fri Aug 07, 2020 3:35 pm
by bruce
@muziqaz/neil-b/etc.
Maybe we should check with the owner of p16600 and see if they can establish a pattern. I'm
guessing that there are a number of unidentified "bad WUs" floating around. I'm also guessing they don't work with specific drivers -- or maybe they don't work in Linux (presumably also a driver issue).
Many of these AMD devices have been poorly serviced by GPUs.txt if they're GPUs and many have been turned off out of frustration if they're CPUs. The total number of individual devices probably represents a pretty wide variety of binary codes and the population of individual members is probably low. How do we find a representative spectrum of useful devices and identify their collective problems?
Many people, including myself, have migrated to nV devices but I have several AMD GPU cards sitting on my workbench which just need (A) a M/B and (B) the time and energy to configure a kit that can run them.
How should we attack this problem? I think we need to systematically gather more data.\ but that's not the only thing that needs to be done. How many are off-line because of the 192.0.2.1 blacklisting process? Representative samples DO need to get data into project 17100.
See also
Folding is not fun right now - lots of trouble, no result and others.
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Fri Aug 07, 2020 4:13 pm
by muziqaz
@Bruce, I did contact owner of the project, they are yet to respond
I received an answer from Sukrit in regards to 16448. He is saying it has very high failure rate. Though overnight testing on 3 different AMD GPUs did not come up with any errors at all
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Fri Aug 07, 2020 11:35 pm
by gunnarre
The work units have a core log file (logfile_01.txt) in them. Does this file get uploaded to the WS/CS, even if the WU gets dumped? If so, this file could be parsed server-side for the errors in question. But it doesn't look like this log file has the required information about which GPU, OpenCL and CUDA drivers are installed - only the CPU and OS version.
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Sat Aug 08, 2020 1:22 am
by bruce
For support, we ask you to post the first ~100 lines of FAH's primary log file. It shows the installation configuration of the GPU(s). Mine looks like this. (Yours will be different, of course).
18:19:51: OS Arch: AMD64
18:19:51: GPUs: 1
18:19:51: GPU 0: Bus:1 Slot:0 Func:0 NVIDIA:7 GP107 [GeForce GTX 1050 Ti] 2138
18:19:51: CUDA Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:6.1 Driver:11.0
18:19:51:OpenCL Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:1.2 Driver:446.14
My GTX 1050 Ti gpu is running with Driver:446.14 and with CUDA 11.0
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Sun Aug 09, 2020 5:47 am
by Nuitari
06:28:12: GPU 0: Bus:0 Slot:2 Func:0 INTEL:1 Gen9p5/GT2 [UHD Graphics 630]
06:28:12: GPU 1: Bus:1 Slot:0 Func:0 AMD:5 Baffin XT [Radeon RX 460]
Card is an RX560
Lots of failures for both p16600 and p13421.
Out of 39 WU for 16600, only 6 succeeded...
Of particular note, this one project:16600 run:0 clone:1314 gen:197 has a driver error displayed in dmesg
Code: Select all
[Sat Aug 8 04:55:18 2020] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0d780402 for process FahCore_22 pid 1343 thread FahCore_22 pid 1343
[Sat Aug 8 04:55:18 2020] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0090AFAF
[Sat Aug 8 04:55:18 2020] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E004002
[Sat Aug 8 04:55:18 2020] amdgpu 0000:01:00.0: VM fault (0x02, vmid 7, pasid 32769) at page 9482159, read from 'TC3' (0x54433300) (4)
Its the only one with that error message where I could match it up with the entry.
There are a few more very similar errors on one of my RX570 based rig
Code: Select all
[Thu Aug 6 23:25:24 2020] amdgpu 0000:07:00.0: GPU fault detected: 147 0x0cf80402 for process FahCore_22 pid 3728 thread FahCore_22 pid 3728
[Thu Aug 6 23:25:24 2020] amdgpu 0000:07:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00C0AD9F
[Thu Aug 6 23:25:24 2020] amdgpu 0000:07:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A004002
[Thu Aug 6 23:25:24 2020] amdgpu 0000:07:00.0: VM fault (0x02, vmid 5, pasid 32800) at page 12627359, read from 'TC3' (0x54433300) (4)
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Sun Aug 09, 2020 6:42 am
by muziqaz
Nuitari wrote:06:28:12: GPU 0: Bus:0 Slot:2 Func:0 INTEL:1 Gen9p5/GT2 [UHD Graphics 630]
06:28:12: GPU 1: Bus:1 Slot:0 Func:0 AMD:5 Baffin XT [Radeon RX 460]
Card is an RX560
Lots of failures for both p16600 and p13421.
Out of 39 WU for 16600, only 6 succeeded...
Of particular note, this one project:16600 run:0 clone:1314 gen:197 has a driver error displayed in dmesg
Code: Select all
[Sat Aug 8 04:55:18 2020] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0d780402 for process FahCore_22 pid 1343 thread FahCore_22 pid 1343
[Sat Aug 8 04:55:18 2020] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0090AFAF
[Sat Aug 8 04:55:18 2020] amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E004002
[Sat Aug 8 04:55:18 2020] amdgpu 0000:01:00.0: VM fault (0x02, vmid 7, pasid 32769) at page 9482159, read from 'TC3' (0x54433300) (4)
Its the only one with that error message where I could match it up with the entry.
There are a few more very similar errors on one of my RX570 based rig
Code: Select all
[Thu Aug 6 23:25:24 2020] amdgpu 0000:07:00.0: GPU fault detected: 147 0x0cf80402 for process FahCore_22 pid 3728 thread FahCore_22 pid 3728
[Thu Aug 6 23:25:24 2020] amdgpu 0000:07:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00C0AD9F
[Thu Aug 6 23:25:24 2020] amdgpu 0000:07:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A004002
[Thu Aug 6 23:25:24 2020] amdgpu 0000:07:00.0: VM fault (0x02, vmid 5, pasid 32800) at page 12627359, read from 'TC3' (0x54433300) (4)
In fahcontrol, you have to delete Intel GPU slot. Best would be to manually remove both GPU slots. And then add a new slot for just Intel iGPU. Also your rx560 is weirdly recognised as rx460. Is the OS in VM?
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Sun Aug 09, 2020 7:11 am
by Joe_H
muziqaz wrote:In fahcontrol, you have to delete Intel GPU slot. Best would be to manually remove both GPU slots. And then add a new slot for just Intel iGPU. Also your rx560 is weirdly recognised as rx460. Is the OS in VM?
Do Not Delete the slot for the Intel GPU if it is detected, especially if running v7.6.13. There is a bug in the F@h client, it will just recreate the slot as long as that Intel GPU is enabled as a test platform for the internal testers. I don't know if the bug also causes the same problem on older versions than 7.6.13.
Deleting the drivers and OpenCL support for the Intel iGPU may keep it from being detected by the client.
The RX 460 and RX 560 are the same device, AMD just reused the GPU chip at a slightly different configuration but with the same Device ID.
To keep the Intel GPU from requesting work since it will not get any, first pause the slot by right-clicking on it in FAHControl. Then in Configure select the Slots tab and click on the GPU slot for the Intel GPU and then Edit. Add the Extra Slot Option 'pause-on-start' and set its value to 'true'. OK the changes and Save.
Afterwards if you pause folding, start the slots by right-clicking on any but the one for the Intel GPU.
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Mon Aug 10, 2020 12:00 am
by Nuitari
No where was it ever mentioned that the Intel GPU slot was causing any issue. Its part of what the client sees, however it cannot use it as its not included in the opencl devices.
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Mon Aug 10, 2020 8:05 am
by NormalDiffusion
Some more info about 16600 WUs over the week end:
It seems to be an AMD only problem:
Code: Select all
- Titan Xp on i9-7900x: 24 from 24 completed -- 0% failure rate
- RTX 2070 Super on dual Xeon E5-2690: 13 from 13 completed -- 0% failure rate
- Radeon VII on Xeon E5-1650v4: 4 from 19 completed -- 79% failure rate
- Radeon VII on Xeon E5-1620v2: 2 from 26 completed -- 92% failure rate
- Radeon 290x on Xeon E5-1620v2: 2 from 20 completed -- 90% failure rate
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Mon Aug 10, 2020 8:23 am
by muziqaz
Project owners hasn't replied yet. Until they do there is nothing we can do, I'm afraid
Re: 16600 consistently crashing on AMD Radeon VII
Posted: Mon Aug 10, 2020 11:02 pm
by ViTe
muziqaz wrote:Project owners hasn't replied yet. Until they do there is nothing we can do, I'm afraid
We can. Ban the WS at your machine and you never see 16600 again