AMD GPU Error sortShortList on some projects

If you think it might be a driver problem, see viewforum.php?f=79

Moderators: Site Moderators, FAHC Science Team

MrFrizzy
Posts: 123
Joined: Fri Feb 14, 2020 4:48 am

Re: AMD GPU Error sortShortList on some projects

Post by MrFrizzy »

muziqaz wrote:Researchers are looking into disabling those projects on AMD GPUs until fix has been found.
Just to add to this discussion, I have a 5700 XT and have had no failures on any of the projects mentioned in this thread. Perhaps the source of the error isn't present on Navi cards?

Successful projects (tracked in the spreadsheet in my sig): 11741-11752, 11755, 11759, 11762-11764, 11776-11778, 11780, 11781

Driver: 20.2.2

Similar post here: viewtopic.php?f=81&t=32771
S1: AMD R5 3600 & Sapphire RX 5700 XT Reference @2.1GHz under water
S2: Intel Xeon E5-2620v3 & MSI GTX 1650

RX 5700 XT Project & PPD Tracking Spreadsheet

Image
muziqaz
Posts: 946
Joined: Sun Dec 16, 2007 6:22 pm
Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP
Location: London
Contact:

Re: AMD GPU Error sortShortList on some projects

Post by muziqaz »

MrFrizzy wrote:
muziqaz wrote:Researchers are looking into disabling those projects on AMD GPUs until fix has been found.
Just to add to this discussion, I have a 5700 XT and have had no failures on any of the projects mentioned in this thread. Perhaps the source of the error isn't present on Navi cards?

Successful projects (tracked in the spreadsheet in my sig): 11741-11752, 11755, 11759, 11762-11764, 11776-11778, 11780, 11781

Driver: 20.2.2

Similar post here: viewtopic.php?f=81&t=32771
Thank you for information. Seems that GCN based cards are influenced.
Big Navi can't come quick enough :D
FAH Omega tester
alxbelu
Posts: 105
Joined: Sat Mar 14, 2020 6:28 pm

Re: AMD GPU Error sortShortList on some projects

Post by alxbelu »

muziqaz wrote:
MrFrizzy wrote:
muziqaz wrote:Researchers are looking into disabling those projects on AMD GPUs until fix has been found.
Just to add to this discussion, I have a 5700 XT and have had no failures on any of the projects mentioned in this thread. Perhaps the source of the error isn't present on Navi cards?

Successful projects (tracked in the spreadsheet in my sig): 11741-11752, 11755, 11759, 11762-11764, 11776-11778, 11780, 11781

Driver: 20.2.2

Similar post here: viewtopic.php?f=81&t=32771
Thank you for information. Seems that GCN based cards are influenced.
Big Navi can't come quick enough :D
Yep, and yep! (Was planning on upgrading my desktop this year, my 290x just turned 6 and deserves retirement, but I guess we'll see if launches actually happen as planned this year..)
Official F@H Twitter (frequently updated): https://twitter.com/foldingathome
Official F@H Facebook: https://www.facebook.com/Foldinghome-136059519794607/

(I'm not affiliated with the F@H Team, just promoting these channels for official updates)
_r2w_ben
Posts: 285
Joined: Wed Apr 23, 2008 3:11 pm

Re: AMD GPU Error sortShortList on some projects

Post by _r2w_ben »

muziqaz wrote:Researchers are looking into disabling those projects on AMD GPUs until fix has been found.
Thank you for understanding
The restriction needs to be added to p14533. One was assigned at 2020-03-25T23:29:18Z.
muziqaz
Posts: 946
Joined: Sun Dec 16, 2007 6:22 pm
Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP
Location: London
Contact:

Re: AMD GPU Error sortShortList on some projects

Post by muziqaz »

_r2w_ben wrote:
muziqaz wrote:Researchers are looking into disabling those projects on AMD GPUs until fix has been found.
Thank you for understanding
The restriction needs to be added to p14533. One was assigned at 2020-03-25T23:29:18Z.
Thanks for the info. It was passed to researchers.
FAH Omega tester
MrFrizzy
Posts: 123
Joined: Fri Feb 14, 2020 4:48 am

Re: AMD GPU Error sortShortList on some projects

Post by MrFrizzy »

muziqaz wrote:
_r2w_ben wrote: The restriction needs to be added to p14533. One was assigned at 2020-03-25T23:29:18Z.
Thanks for the info. It was passed to researchers.
On the 5700 XT, I was able to process the only 14533 project I got to 100% and sent the results to the server only to have the server dump the results. So while this is a different result than the kernel message from before, I think it needs to be pointed out for distinction. Whatever the kernel message is about, it is not for all AMD cards.

As pointed out in an earlier post, I can process all of the COVID-19 core22 related projects just fine, not one has erred out for any reason besides me messing with my overclock. See the spreadsheet in my sig, I have tracked 95 successful COVID-19 core22 projects (85 are shown). If any of the devs/researchers need more information, I can provide PRCG numbers for all projects with timestamps or even the full logs (I archive all of them before the client can clean them out).

I would suggest not blocking all AMD cards on these projects and to allow species 6 to continue folding.
S1: AMD R5 3600 & Sapphire RX 5700 XT Reference @2.1GHz under water
S2: Intel Xeon E5-2620v3 & MSI GTX 1650

RX 5700 XT Project & PPD Tracking Spreadsheet

Image
muziqaz
Posts: 946
Joined: Sun Dec 16, 2007 6:22 pm
Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP
Location: London
Contact:

Re: AMD GPU Error sortShortList on some projects

Post by muziqaz »

At the moment, projects which are known to fail on AMD are being blocked. The rest of them are freely available (relatively speaking).
FAH Omega tester
_r2w_ben
Posts: 285
Joined: Wed Apr 23, 2008 3:11 pm

Re: AMD GPU Error sortShortList on some projects

Post by _r2w_ben »

The restriction needs to be added to p11781. One was assigned at 2020-03-28T20:10:11Z.
IkkeDus
Posts: 14
Joined: Wed Jun 18, 2008 10:42 am
Hardware configuration: Q9550 @ 2.8 GHz
WIN10 x64
2x Radeon R9 280X-3GB
1x Radeon R9 7950-3GB
Location: Amsterdam, The Netherlands

Re: AMD GPU Error sortShortList on some projects

Post by IkkeDus »

I also see this problem.
AMD R9 280X 3GB (ID: 6798 SUB: 3001)

Project: 11776

It often seems to be stuck after the "...0x22:Version 0.0.2" log line. If I leave it alone it will stay there for hours. If I pause/unpause it either get stuck there again or it finishes with the error. At least it will retry to fetch another WU.

Code: Select all

20:42:10:WU02:FS02:Starting
20:42:10:WU02:FS02:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\Ray\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/Core_22.fah/FahCore_22.exe -dir 02 -suffix 01 -version 705 -lifeline 8668 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 2 -gpu 2
20:42:10:WU02:FS02:Started FahCore on PID 8908
20:42:10:WU02:FS02:Core PID:8932
20:42:10:WU02:FS02:FahCore 0x22 started
20:42:11:WU02:FS02:0x22:*********************** Log Started 2020-03-28T20:42:10Z ***********************
20:42:11:WU02:FS02:0x22:*************************** Core22 Folding@home Core ***************************
20:42:11:WU02:FS02:0x22:       Type: 0x22
20:42:11:WU02:FS02:0x22:       Core: Core22
20:42:11:WU02:FS02:0x22:    Website: https://foldingathome.org/
20:42:11:WU02:FS02:0x22:  Copyright: (c) 2009-2018 foldingathome.org
20:42:11:WU02:FS02:0x22:     Author: John Chodera <[email protected]> and Rafal Wiewiora
20:42:11:WU02:FS02:0x22:             <[email protected]>
20:42:11:WU02:FS02:0x22:       Args: -dir 02 -suffix 01 -version 705 -lifeline 8908 -checkpoint 15
20:42:11:WU02:FS02:0x22:             -gpu-vendor amd -opencl-platform 0 -opencl-device 2 -gpu 2
20:42:11:WU02:FS02:0x22:     Config: <none>
20:42:11:WU02:FS02:0x22:************************************ Build *************************************
20:42:11:WU02:FS02:0x22:    Version: 0.0.2
20:42:11:WU02:FS02:0x22:       Date: Dec 6 2019
20:42:11:WU02:FS02:0x22:       Time: 21:30:31
20:42:11:WU02:FS02:0x22: Repository: Git
20:42:11:WU02:FS02:0x22:   Revision: abeb39247cc72df5af0f63723edafadb23d5dfbe
20:42:11:WU02:FS02:0x22:     Branch: HEAD
20:42:11:WU02:FS02:0x22:   Compiler: Visual C++ 2008
20:42:11:WU02:FS02:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
20:42:11:WU02:FS02:0x22:   Platform: win32 10
20:42:11:WU02:FS02:0x22:       Bits: 64
20:42:11:WU02:FS02:0x22:       Mode: Release
20:42:11:WU02:FS02:0x22:************************************ System ************************************
20:42:11:WU02:FS02:0x22:        CPU: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz
20:42:11:WU02:FS02:0x22:     CPU ID: GenuineIntel Family 6 Model 23 Stepping 10
20:42:11:WU02:FS02:0x22:       CPUs: 4
20:42:11:WU02:FS02:0x22:     Memory: 4.00GiB
20:42:11:WU02:FS02:0x22:Free Memory: 2.13GiB
20:42:11:WU02:FS02:0x22:    Threads: WINDOWS_THREADS
20:42:11:WU02:FS02:0x22: OS Version: 6.2
20:42:11:WU02:FS02:0x22:Has Battery: false
20:42:11:WU02:FS02:0x22: On Battery: false
20:42:11:WU02:FS02:0x22: UTC Offset: 1
20:42:11:WU02:FS02:0x22:        PID: 8932
20:42:11:WU02:FS02:0x22:        CWD: C:\Users\\AppData\Roaming\FAHClient\work
20:42:11:WU02:FS02:0x22:         OS: Windows 10 Pro
20:42:11:WU02:FS02:0x22:    OS Arch: AMD64
20:42:11:WU02:FS02:0x22:********************************************************************************
20:42:11:WU02:FS02:0x22:Project: 11776 (Run 0, Clone 1781, Gen 6)
20:42:11:WU02:FS02:0x22:Unit: 0x0000000f287234c95e73c47b56c80b8a
20:42:11:WU02:FS02:0x22:Reading tar file core.xml
20:42:11:WU02:FS02:0x22:Reading tar file integrator.xml
20:42:11:WU02:FS02:0x22:Reading tar file state.xml
20:42:12:WU02:FS02:0x22:Reading tar file system.xml
20:42:14:WU02:FS02:0x22:Digital signatures verified
20:42:14:WU02:FS02:0x22:Folding@home GPU Core22 Folding@home Core
20:42:14:WU02:FS02:0x22:Version 0.0.2
20:42:45:WU02:FS02:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
20:42:45:WU02:FS02:0x22:Saving result file ..\logfile_01.txt
20:42:45:WU02:FS02:0x22:Saving result file science.log
20:42:45:WU02:FS02:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
20:42:45:WARNING:WU02:FS02:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
20:42:45:WU02:FS02:Sending unit results: id:02 state:SEND error:FAULTY project:11776 run:0 clone:1781 gen:6 core:0x22 unit:0x0000000f287234c95e73c47b56c80b8a
20:42:45:WU02:FS02:Uploading 15.00KiB to 40.114.52.201
20:42:45:WU02:FS02:Connecting to 40.114.52.201:8080
20:42:46:WU03:FS02:Connecting to 65.254.110.245:8080
20:42:46:WARNING:WU03:FS02:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
20:42:46:WU03:FS02:Connecting to 18.218.241.186:80
20:42:47:WARNING:WU03:FS02:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
20:42:47:ERROR:WU03:FS02:Exception: Could not get an assignment
20:42:47:WU03:FS02:Connecting to 65.254.110.245:8080
20:42:47:WARNING:WU03:FS02:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
20:42:47:WU03:FS02:Connecting to 18.218.241.186:80
20:42:48:WARNING:WU03:FS02:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
20:42:48:ERROR:WU03:FS02:Exception: Could not get an assignment
20:43:06:WARNING:WU02:FS02:WorkServer connection failed on port 8080 trying 80
20:43:06:WU02:FS02:Connecting to 40.114.52.201:80
20:43:14:WU02:FS02:Upload 100.00%
20:43:30:WU02:FS02:Upload complete
20:43:30:WU02:FS02:Server responded WORK_ACK (400)
20:43:30:WU02:FS02:Cleaning up
Q9550 @ 2.8 GHz | 2x R9 280X-3GB | HD 7950-3GB | Win10 x64

Image
muziqaz
Posts: 946
Joined: Sun Dec 16, 2007 6:22 pm
Hardware configuration: 7950x3D, 5950x, 5800x3D, 3900x
7900xtx, Radeon 7, 5700xt, 6900xt, RX 550 640SP
Location: London
Contact:

Re: AMD GPU Error sortShortList on some projects

Post by muziqaz »

Just an update, some people at AMD are aware of this issue and are looking into it :)
Hopefully we will have it solved sooner rather than later :)
Thank you for your patience
FAH Omega tester
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: AMD GPU Error sortShortList on some projects

Post by bruce »

First a temporary solution from FAH: Those projects will not be assigned to that group of GPUs.
Second, a permanent solution: New AMD drivers or a new FAHCore from FAH will be prepared that fixes the original problem. (Then the temporary solution will be removed.)
alxbelu
Posts: 105
Joined: Sat Mar 14, 2020 6:28 pm

Re: AMD GPU Error sortShortList on some projects

Post by alxbelu »

That's great news! Thanks for the update!
Official F@H Twitter (frequently updated): https://twitter.com/foldingathome
Official F@H Facebook: https://www.facebook.com/Foldinghome-136059519794607/

(I'm not affiliated with the F@H Team, just promoting these channels for official updates)
_r2w_ben
Posts: 285
Joined: Wed Apr 23, 2008 3:11 pm

Re: AMD GPU Error sortShortList on some projects

Post by _r2w_ben »

The restriction needs to be added to p11759. One was assigned at 2020-04-04T20:59:25Z.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: AMD GPU Error sortShortList on some projects

Post by bruce »

MrFrizzy wrote:
muziqaz wrote:Researchers are looking into disabling those projects on AMD GPUs until fix has been found.
Just to add to this discussion, I have a 5700 XT and have had no failures on any of the projects mentioned in this thread. Perhaps the source of the error isn't present on Navi cards?

Right. Navi is the one exception.
Hey_Allen
Posts: 1
Joined: Thu Apr 02, 2020 2:34 am

Re: AMD GPU Error sortShortList on some projects

Post by Hey_Allen »

It appears that the AMD GPUs are still getting this family of projects.

Project 11776 just failed on my RX 580, and I've had a few work units end in a status "Failure 2" as reported on the stats page.
I've seen a few instances where I have a ~20 credit job submitted, and if I catch it and examine it, find a failed unit.

Code: Select all

20:05:04:WU02:FS01:Connecting to 65.254.110.245:8080
20:05:04:WU02:FS01:Assigned to work server 140.163.4.231
20:05:04:WU02:FS01:Requesting new work unit for slot 01: READY gpu:0:Ellesmere XT [Radeon RX 470/480/570/580/590] from 140.163.4.231
20:05:04:WU02:FS01:Connecting to 140.163.4.231:8080
20:05:25:WARNING:WU02:FS01:WorkServer connection failed on port 8080 trying 80
20:05:25:WU02:FS01:Connecting to 140.163.4.231:80
20:05:46:ERROR:WU02:FS01:Exception: Failed to connect to 140.163.4.231:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
******************************* Date: 2020-04-06 *******************************
01:27:04:WU02:FS01:Connecting to 65.254.110.245:8080
01:27:04:WARNING:WU02:FS01:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
01:27:04:WU02:FS01:Connecting to 18.218.241.186:80
01:27:04:WARNING:WU02:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
01:27:04:ERROR:WU02:FS01:Exception: Could not get an assignment
01:51:27:WU02:FS01:Connecting to 65.254.110.245:8080
01:51:28:WARNING:WU02:FS01:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
01:51:28:WU02:FS01:Connecting to 18.218.241.186:80
01:51:29:WARNING:WU02:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
01:51:29:ERROR:WU02:FS01:Exception: Could not get an assignment
01:53:04:WU02:FS01:Connecting to 65.254.110.245:8080
01:53:04:WU02:FS01:Assigned to work server 40.114.52.201
01:53:04:WU02:FS01:Requesting new work unit for slot 01: READY gpu:0:Ellesmere XT [Radeon RX 470/480/570/580/590] from 40.114.52.201
01:53:04:WU02:FS01:Connecting to 40.114.52.201:8080
01:53:32:WU02:FS01:Downloading 79.12MiB
01:53:38:WU02:FS01:Download 7.74%
01:53:44:WU02:FS01:Download 19.59%
01:53:50:WU02:FS01:Download 30.10%
01:53:56:WU02:FS01:Download 43.84%
01:54:02:WU02:FS01:Download 57.90%
01:54:08:WU02:FS01:Download 71.96%
01:54:14:WU02:FS01:Download 86.57%
01:54:19:WU02:FS01:Download complete
01:54:19:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:11776 run:0 clone:31304 gen:7 core:0x22 unit:0x0000000b287234c95e7931c2b282407f
01:54:20:WU02:FS01:Starting
01:54:20:WU02:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\Josh\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/Core_22.fah/FahCore_22.exe -dir 02 -suffix 01 -version 705 -lifeline 14036 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
01:54:20:WU02:FS01:Started FahCore on PID 2444
01:54:20:WU02:FS01:Core PID:13684
01:54:20:WU02:FS01:FahCore 0x22 started
01:54:20:WU02:FS01:0x22:*********************** Log Started 2020-04-07T01:54:20Z ***********************
01:54:20:WU02:FS01:0x22:*************************** Core22 Folding@home Core ***************************
01:54:20:WU02:FS01:0x22:       Type: 0x22
01:54:20:WU02:FS01:0x22:       Core: Core22
01:54:20:WU02:FS01:0x22:    Website: https://foldingathome.org/
01:54:20:WU02:FS01:0x22:  Copyright: (c) 2009-2018 foldingathome.org
01:54:20:WU02:FS01:0x22:     Author: John Chodera <[email protected]> and Rafal Wiewiora
01:54:20:WU02:FS01:0x22:             <[email protected]>
01:54:20:WU02:FS01:0x22:       Args: -dir 02 -suffix 01 -version 705 -lifeline 2444 -checkpoint 15
01:54:20:WU02:FS01:0x22:             -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
01:54:20:WU02:FS01:0x22:     Config: <none>
01:54:20:WU02:FS01:0x22:************************************ Build *************************************
01:54:20:WU02:FS01:0x22:    Version: 0.0.2
01:54:20:WU02:FS01:0x22:       Date: Dec 6 2019
01:54:20:WU02:FS01:0x22:       Time: 21:30:31
01:54:20:WU02:FS01:0x22: Repository: Git
01:54:20:WU02:FS01:0x22:   Revision: abeb39247cc72df5af0f63723edafadb23d5dfbe
01:54:20:WU02:FS01:0x22:     Branch: HEAD
01:54:20:WU02:FS01:0x22:   Compiler: Visual C++ 2008
01:54:20:WU02:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
01:54:20:WU02:FS01:0x22:   Platform: win32 10
01:54:20:WU02:FS01:0x22:       Bits: 64
01:54:20:WU02:FS01:0x22:       Mode: Release
01:54:20:WU02:FS01:0x22:************************************ System ************************************
01:54:20:WU02:FS01:0x22:        CPU: AMD Ryzen 5 3600 6-Core Processor
01:54:20:WU02:FS01:0x22:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
01:54:20:WU02:FS01:0x22:       CPUs: 12
01:54:20:WU02:FS01:0x22:     Memory: 31.94GiB
01:54:20:WU02:FS01:0x22:Free Memory: 25.86GiB
01:54:20:WU02:FS01:0x22:    Threads: WINDOWS_THREADS
01:54:20:WU02:FS01:0x22: OS Version: 6.2
01:54:20:WU02:FS01:0x22:Has Battery: false
01:54:20:WU02:FS01:0x22: On Battery: false
01:54:20:WU02:FS01:0x22: UTC Offset: -7
01:54:20:WU02:FS01:0x22:        PID: 13684
01:54:20:WU02:FS01:0x22:        CWD: C:\Users\Josh\AppData\Roaming\FAHClient\work
01:54:20:WU02:FS01:0x22:         OS: Windows 10 Pro
01:54:20:WU02:FS01:0x22:    OS Arch: AMD64
01:54:20:WU02:FS01:0x22:********************************************************************************
01:54:20:WU02:FS01:0x22:Project: 11776 (Run 0, Clone 31304, Gen 7)
01:54:20:WU02:FS01:0x22:Unit: 0x0000000b287234c95e7931c2b282407f
01:54:20:WU02:FS01:0x22:Reading tar file core.xml
01:54:20:WU02:FS01:0x22:Reading tar file integrator.xml
01:54:20:WU02:FS01:0x22:Reading tar file state.xml
01:54:21:WU02:FS01:0x22:Reading tar file system.xml
01:54:21:WU02:FS01:0x22:Digital signatures verified
01:54:21:WU02:FS01:0x22:Folding@home GPU Core22 Folding@home Core
01:54:21:WU02:FS01:0x22:Version 0.0.2
01:54:37:WU02:FS01:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
01:54:37:WU02:FS01:0x22:Saving result file ..\logfile_01.txt
01:54:37:WU02:FS01:0x22:Saving result file science.log
01:54:37:WU02:FS01:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
01:54:37:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
01:54:37:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:11776 run:0 clone:31304 gen:7 core:0x22 unit:0x0000000b287234c95e7931c2b282407f
01:54:37:WU02:FS01:Uploading 8.00KiB to 40.114.52.201
Post Reply