It seems that a lot of GPU problems revolve around specific versions of drivers. Though AMD has their own support structure, you can often learn from information reported by others who fold.
mwroggenbuck wrote:I may be dense, but if this is a driver problem, why do my other tasks (and games) work? This seems to be specific to Folding At Home.
Just a thought. I am just trying to problem solve and not point fingers. I do not want to hurt anyone's feelings.
That's a good question. The fact is that the Error invoking kernel "sortShortList: clEnqueueNDRangeKernel (-5)'' is a known AMD driver bug.
In fact, the bug doesn't always occur. If you look at the 3rd column of https://apps.foldingathome.org/psummary, you'll notice that the proteins that FAH analyzes typically have a lot of atoms -- and that number varies a lot depending on which protein is being analyzed. (I can't compare that with what happens in a game. I don't know that much about code generated by the game industry.)
The error message ending in (-5) means that the sortShortList ran out of resources. Somewhere in the driver there is a process that's doing some sorting and the driver SHOULD know how much memory is available to perform the required sort. but if the driver gives the code the wrong size for the available memory, it will fail. The peculiar thing about this error is that proteins of certain sizes cause this error while both larger and smaller proteins can be processed.
I don't think your game is doing the same kind of analysis so I'm not surprised that the driver bug surfaces here while not in your game.
User Team CPUID Credit Assigned Returned Credited Days Code
Jp 0 7E978B5E72C12135 320.21 2020-04-06 20:57:50 2020-04-06 21:35:05 2020-04-06 21:27:47 0.021 Faulty 2
kwthom 35780 AA4C715E10F25523 27.44 2020-04-06 21:27:54 2020-04-06 21:35:08 2020-04-06 21:30:28 0.002 Faulty 2
Hi Jp (team 0), Your WU (P11776 R0 C14981 G19) was added to the stats database on Tue, 07 Apr 2020 04:27:47 GMT for 320.21 points of credit.
Hi kwthom (team 35780), Your WU (P11776 R0 C14981 G19) was added to the stats database on Tue, 07 Apr 2020 04:30:28 GMT for 27.44 points of credit.
If my interpretation is correct, we were both assigned this WU - and we both had issues.
kwthom wrote:...If my interpretation is correct, we were both assigned this WU - and we both had issues.
Is this a legit bad WU?
There should be 2 additional copies of the WU floating around. If they too are bad and it reached a specified threshold, then the server will automatically block it from being allocated and it will be a bad WU.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time
03:01:53:WU03:FS01:Connecting to 65.254.110.245:8080
03:01:53:WARNING:WU03:FS01:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
03:01:53:WU03:FS01:Connecting to 18.218.241.186:80
03:01:54:WARNING:WU03:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
03:01:54:ERROR:WU03:FS01:Exception: Could not get an assignment
03:03:30:WU03:FS01:Connecting to 65.254.110.245:8080
03:03:31:WU03:FS01:Assigned to work server 128.252.203.10
03:03:31:WU03:FS01:Requesting new work unit for slot 01: READY gpu:0:Ellesmere XT [Radeon RX 470/480/570/580] from 128.252.203.10
03:03:31:WU03:FS01:Connecting to 128.252.203.10:8080
03:03:52:WARNING:WU03:FS01:WorkServer connection failed on port 8080 trying 80
03:03:52:WU03:FS01:Connecting to 128.252.203.10:80
03:04:48:WU03:FS01:Downloading 86.24MiB
03:04:54:WU03:FS01:Download 7.18%
03:05:00:WU03:FS01:Download 15.58%
03:05:06:WU03:FS01:Download 21.45%
03:05:12:WU03:FS01:Download 29.79%
03:05:18:WU03:FS01:Download 42.69%
03:05:24:WU03:FS01:Download 54.07%
03:05:30:WU03:FS01:Download 61.31%
03:05:36:WU03:FS01:Download 67.98%
03:05:42:WU03:FS01:Download 75.37%
03:05:48:WU03:FS01:Download 84.72%
03:05:54:WU03:FS01:Download 96.17%
03:05:55:WU03:FS01:Download complete
03:05:55:WU03:FS01:Received Unit: id:03 state:DOWNLOAD error:NO_ERROR project:11764 run:0 clone:5592 gen:25 core:0x22 unit:0x0000003280fccb0a5e71130ac303f133
03:05:55:WU03:FS01:Starting
03:05:55:WU03:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\kwtho\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/Core_22.fah/FahCore_22.exe -dir 03 -suffix 01 -version 705 -lifeline 1028 -checkpoint 20 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
03:05:55:WU03:FS01:Started FahCore on PID 4560
03:05:55:WU03:FS01:Core PID:11136
03:05:55:WU03:FS01:FahCore 0x22 started
03:05:56:WU03:FS01:0x22:*********************** Log Started 2020-04-07T03:05:55Z ***********************
03:05:56:WU03:FS01:0x22:*************************** Core22 Folding@home Core ***************************
03:05:56:WU03:FS01:0x22: Type: 0x22
03:05:56:WU03:FS01:0x22: Core: Core22
03:05:56:WU03:FS01:0x22: Website: https://foldingathome.org/
03:05:56:WU03:FS01:0x22: Copyright: (c) 2009-2018 foldingathome.org
03:05:56:WU03:FS01:0x22: Author: John Chodera <[email protected]> and Rafal Wiewiora
03:05:56:WU03:FS01:0x22: <[email protected]>
03:05:56:WU03:FS01:0x22: Args: -dir 03 -suffix 01 -version 705 -lifeline 4560 -checkpoint 20
03:05:56:WU03:FS01:0x22: -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
03:05:56:WU03:FS01:0x22: Config: <none>
03:05:56:WU03:FS01:0x22:************************************ Build *************************************
03:05:56:WU03:FS01:0x22: Version: 0.0.2
03:05:56:WU03:FS01:0x22: Date: Dec 6 2019
03:05:56:WU03:FS01:0x22: Time: 21:30:31
03:05:56:WU03:FS01:0x22: Repository: Git
03:05:56:WU03:FS01:0x22: Revision: abeb39247cc72df5af0f63723edafadb23d5dfbe
03:05:56:WU03:FS01:0x22: Branch: HEAD
03:05:56:WU03:FS01:0x22: Compiler: Visual C++ 2008
03:05:56:WU03:FS01:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
03:05:56:WU03:FS01:0x22: Platform: win32 10
03:05:56:WU03:FS01:0x22: Bits: 64
03:05:56:WU03:FS01:0x22: Mode: Release
03:05:56:WU03:FS01:0x22:************************************ System ************************************
03:05:56:WU03:FS01:0x22: CPU: Intel(R) Core(TM) i5-9400F CPU @ 2.90GHz
03:05:56:WU03:FS01:0x22: CPU ID: GenuineIntel Family 6 Model 158 Stepping 10
03:05:56:WU03:FS01:0x22: CPUs: 6
03:05:56:WU03:FS01:0x22: Memory: 15.93GiB
03:05:56:WU03:FS01:0x22:Free Memory: 11.63GiB
03:05:56:WU03:FS01:0x22: Threads: WINDOWS_THREADS
03:05:56:WU03:FS01:0x22: OS Version: 6.2
03:05:56:WU03:FS01:0x22:Has Battery: false
03:05:56:WU03:FS01:0x22: On Battery: false
03:05:56:WU03:FS01:0x22: UTC Offset: -7
03:05:56:WU03:FS01:0x22: PID: 11136
03:05:56:WU03:FS01:0x22: CWD: C:\Users\kwtho\AppData\Roaming\FAHClient\work
03:05:56:WU03:FS01:0x22: OS: Windows 10 Pro
03:05:56:WU03:FS01:0x22: OS Arch: AMD64
03:05:56:WU03:FS01:0x22:********************************************************************************
03:05:56:WU03:FS01:0x22:Project: 11764 (Run 0, Clone 5592, Gen 25)
03:05:56:WU03:FS01:0x22:Unit: 0x0000003280fccb0a5e71130ac303f133
03:05:56:WU03:FS01:0x22:Reading tar file core.xml
03:05:56:WU03:FS01:0x22:Reading tar file integrator.xml
03:05:56:WU03:FS01:0x22:Reading tar file state.xml
03:05:56:WU03:FS01:0x22:Reading tar file system.xml
03:05:57:WU03:FS01:0x22:Digital signatures verified
03:05:57:WU03:FS01:0x22:Folding@home GPU Core22 Folding@home Core
03:05:57:WU03:FS01:0x22:Version 0.0.2
03:06:10:WU03:FS01:0x22:ERROR:exception: Error invoking kernel sortShortList: clEnqueueNDRangeKernel (-5)
03:06:10:WU03:FS01:0x22:Saving result file ..\logfile_01.txt
03:06:10:WU03:FS01:0x22:Saving result file science.log
03:06:10:WU03:FS01:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
03:06:11:WARNING:WU03:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
03:06:11:WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:11764 run:0 clone:5592 gen:25 core:0x22 unit:0x0000003280fccb0a5e71130ac303f133
03:06:11:WU03:FS01:Uploading 8.00KiB to 128.252.203.10
03:06:11:WU03:FS01:Connecting to 128.252.203.10:8080
03:06:11:WU01:FS01:Connecting to 65.254.110.245:8080
03:06:11:WARNING:WU01:FS01:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
03:06:11:WU01:FS01:Connecting to 18.218.241.186:80
03:06:12:WARNING:WU01:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
03:06:12:ERROR:WU01:FS01:Exception: Could not get an assignment
03:06:12:WU01:FS01:Connecting to 65.254.110.245:8080
03:06:12:WARNING:WU01:FS01:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
03:06:12:WU01:FS01:Connecting to 18.218.241.186:80
03:06:12:WARNING:WU01:FS01:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
03:06:12:ERROR:WU01:FS01:Exception: Could not get an assignment
03:07:04:WU03:FS01:Upload complete
03:07:04:WU03:FS01:Server responded WORK_ACK (400)
03:07:04:WU03:FS01:Cleaning up
This is a known AMD OpenCL bug. We're waiting for AMD to accept responsibility for the problem and fix the drivers or OpenMM to figure out how to work around the problem. The OpenCL code is unable to allocate the right amount of memory to the sortShortList. for certain proteins. The same WU running on an nVidia GPU or running on an AMD Navi GPU do not encounter this problem.
Unfortunately, my log file is gone (I exited the software after the error), but the job ran to 14%, then it crashed my radeon control software. The radeon software restarted, and the FAH log file showed nothing during all this time. However, when the advanced control percent complete bar (and numbers) reached 15%, it did not log that information to the log window.
This is exactly how it worked before. It would look like it was running, but not log progress to log screen (although the percent complete bar would increment). I know that if I let this go, it would eventually give up.
If it is any help, even though the percent complete bar shows progress, the GPU is not utilized or drawing any signification power. It is like the control software thinks things are fine when nothing is running.
Please post the segment of your log where the WU that you say appeared to get to 15% was download and started up to the point that it crashed. (See my Sig to find your log or use FAHControl after clicking Refresh.)
Unfortunately, I have removed the FAH software. I was going to try again in a few weeks. I went back to World Community Grid and Einstein at home.
When I looked at the log file, I saw no indication of an error for that particular work unit. My only clue is that my Radeon control software (the icon in the task bar) was restarted after the screen hung for a short period of time. Then the log file would just stop showing anything for that GPU slot (the cpu slot was just fine). The web page GIU and advanced control status would increment, but the log file would not.
I can't be more help at this time. I will continue to watch this thread.
There is one more piece of information that perhaps is relevant. The first work unit my system downloaded for the GPU did cause a sortSortList error that was detected immediately. It happened even before it appeared to start processing work unit. The log file showed an error of bad work unit, uploaded some log file, and then immediately obtained another unit. That unit work until 14% as I described above.
Perhaps the GPU was still in a bad state from the first unit? Nothing was restarted between the two.
No log file or anything. I uninstalled the program (by the way, your uninstall is excellent--most programs leave a few files or empty directories around--your's did not.)
If I do try again, should I increase the log level? I seem to remember there was a setting for this in the advanced controls. If yes, what level should I use (I think the default was 3).
FYI
I happen to be a chemist and a software engineer (retired). I greatly respect what this project is trying to do and I wish you all the best of luck. Like I said, I will try again in the future, probably when AMD publishes new drivers., but I know it is hard to get all hardware configurations to work. I have a home built system, and something could be just different enough to cause problems.
mwroggenbuck wrote:...If I do try again, should I increase the log level? I seem to remember there was a setting for this in the advanced controls. If yes, what level should I use (I think the default was 3)...
Please leave the default log level of 3. Setting it any higher will actually hinder us since it will produce information that isn't needed to troubleshoot this issue.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time