Page 1 of 1

Bad Work Unit - need good stress test?

Posted: Sat May 23, 2020 9:23 pm
by KA1J
I've run into another bad work unit seen below at 16:33. The first bad work unit/crash was, I was told, picked up by another cruncher and was completed successfully.

I have tried to stress test the GPU and all my attempts have been successful. The two tests I have tried are FAHBench using OpenCL, single precision, WU is DHFR, NaN disabled. tested 30 minutes with no drama.
Results: Score 179.037 & 23558 atoms.

I ran Unigine Heaven with the below results:

Image


I do find when I run benchmarks, it's a few degrees C less than when I run F@H and since I have no trouble with benchmarking and the tests not having errors, I'm stumped. Is there a test that challenges the GPU as much as F@H that will tell me if any errors are coming up?

Reviewing with search I have found reference in this forum to deal with this error, to changing: configure > Slots > GPU > Edit OpenCL-index from -1 to 1. Before I make that change, I'm asking here before I do that.

Also, a better F@H compatible stress test?

Thanks

Code: Select all

*********************** Log Started 2020-05-23T15:37:48Z ***********************
16:18:43:WARNING:WU01:FS00:Failed to get assignment from 'assign1.foldingathome.org:80': No WUs available for this configuration
16:18:43:WARNING:WU01:FS00:Failed to get assignment from 'assign2.foldingathome.org:80': No WUs available for this configuration
16:18:43:WARNING:WU01:FS00:Failed to get assignment from 'assign3.foldingathome.org:80': No WUs available for this configuration
16:20:41:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
16:21:02:WARNING:WU00:FS00:Exception: Failed to send results to work server: Failed to connect to 3.133.76.19:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
16:21:23:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
16:21:44:WARNING:WU00:FS00:Exception: Failed to send results to work server: Failed to connect to 3.133.76.19:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
16:22:24:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
16:23:48:WARNING:WU00:FS00:Exception: Failed to send results to work server: Transfer failed
16:33:22:WU01:FS00:0x22:ERROR:exception: clWaitForEvents
16:33:26:WARNING:WU01:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
16:33:27:WARNING:WU00:FS00:Failed to get assignment from 'assign1.foldingathome.org:80': No WUs available for this configuration
16:33:27:WARNING:WU00:FS00:Failed to get assignment from 'assign2.foldingathome.org:80': No WUs available for this configuration
16:33:27:WARNING:WU00:FS00:Failed to get assignment from 'assign3.foldingathome.org:80': No WUs available for this configuration
16:33:27:WARNING:WU00:FS00:Failed to get assignment from 'assign4.foldingathome.org:80': No WUs available for this configuration
16:33:27:ERROR:WU00:FS00:Exception: Could not get an assignment
16:33:27:WARNING:WU00:FS00:Failed to get assignment from 'assign1.foldingathome.org:80': No WUs available for this configuration
16:33:47:WARNING:WU01:FS00:WorkServer connection failed on port 8080 trying 80
16:34:08:WARNING:WU01:FS00:Exception: Failed to send results to work server: Failed to connect to 3.133.76.19:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
16:34:23:WARNING:WU01:FS00:Exception: Failed to send results to work server: Transfer failed
16:35:29:WARNING:WU01:FS00:WorkServer connection failed on port 8080 trying 80
16:35:50:WARNING:WU01:FS00:Exception: Failed to send results to work server: Failed to connect to 3.133.76.19:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
16:36:19:WU00:FS00:0x22:ERROR:exception: clWaitForEvents
16:36:20:WARNING:WU00:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
16:36:20:WARNING:WU02:FS00:Failed to get assignment from 'assign1.foldingathome.org:80': No WUs available for this configuration
16:36:21:WARNING:WU02:FS00:Failed to get assignment from 'assign2.foldingathome.org:80': No WUs available for this configuration
16:36:21:WARNING:WU02:FS00:Failed to get assignment from 'assign3.foldingathome.org:80': No WUs available for this configuration
16:36:21:WARNING:WU02:FS00:Failed to get assignment from 'assign4.foldingathome.org:80': No WUs available for this configuration
16:36:21:ERROR:WU02:FS00:Exception: Could not get an assignment
16:36:21:WARNING:WU02:FS00:Failed to get assignment from 'assign1.foldingathome.org:80': No WUs available for this configuration
16:36:22:WARNING:WU02:FS00:Failed to get assignment from 'assign2.foldingathome.org:80': No WUs available for this configuration
16:36:22:WARNING:WU02:FS00:Failed to get assignment from 'assign3.foldingathome.org:80': No WUs available for this configuration
16:36:22:WARNING:WU02:FS00:Failed to get assignment from 'assign4.foldingathome.org:80': No WUs available for this configuration
16:36:22:ERROR:WU02:FS00:Exception: Could not get an assignment
16:36:41:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
16:37:06:WARNING:WU01:FS00:WorkServer connection failed on port 8080 trying 80
16:37:21:WARNING:WU02:FS00:Failed to get assignment from 'assign1.foldingathome.org:80': No WUs available for this configuration
16:37:21:WARNING:WU02:FS00:Failed to get assignment from 'assign2.foldingathome.org:80': No WUs available for this configuration
16:37:22:WARNING:WU02:FS00:Failed to get assignment from 'assign3.foldingathome.org:80': No WUs available for this configuration
16:37:22:WARNING:WU02:FS00:Failed to get assignment from 'assign4.foldingathome.org:80': No WUs available for this configuration
16:37:22:ERROR:WU02:FS00:Exception: Could not get an assignment
16:37:28:WARNING:WU01:FS00:Exception: Failed to send results to work server: Failed to connect to 3.133.76.19:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
16:38:58:WARNING:WU02:FS00:Failed to get assignment from 'assign1.foldingathome.org:80': No WUs available for this configuration
16:38:59:WARNING:WU02:FS00:Failed to get assignment from 'assign2.foldingathome.org:80': No WUs available for this configuration
16:38:59:WARNING:WU02:FS00:Failed to get assignment from 'assign3.foldingathome.org:80': No WUs available for this configuration
16:38:59:WARNING:WU02:FS00:Failed to get assignment from 'assign4.foldingathome.org:80': No WUs available for this configuration
16:38:59:ERROR:WU02:FS00:Exception: Could not get an assignment

Re: Bad Work Unit - need good stress test?

Posted: Sat May 23, 2020 11:59 pm
by PantherX
Please note that we need to see the entire section of the log file just before the failure, during the failure and after the failure. Unfortunately, this line doesn't tell me what WU it was:
16:36:20:WARNING:WU00:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)

Generally speaking, most stress test and benchmarking applications are less stressful than F@H. Those software focus on rendering aspects of the GPU while folding focuses on the compute aspects of the GPU. Hence, it is tough to measure the stability of a GPU using third party applications. The current version of FAHBench doesn't support FahCore_22 but there are plans to get to it once a new version of FAHCore_22 is released (no ETA).

BTW, can you also please post the first 100 lines of the log file which will contain the details of your system and the client configuration?

Re: Bad Work Unit - need good stress test?

Posted: Sun May 24, 2020 12:44 am
by KA1J
That's all that's left of the info when the bad work unit appeared, I can't post any more from that one. F@H isn't delivering WU's & with that, I just closed down F@H, restarted & here's the log.

Code: Select all

*********************** Log Started 2020-05-24T00:32:06Z ***********************
00:32:06:Trying to access database...
00:32:06:Successfully acquired database lock
00:32:06:Read GPUs.txt
00:32:07:Enabled folding slot 00: READY gpu:0:TU102 [GeForce RTX 2080 Ti Rev. A] M 13448
00:32:07:****************************** FAHClient ******************************
00:32:07:        Version: 7.6.13
00:32:07:         Author: Joseph Coffland <[email protected]>
00:32:07:      Copyright: 2020 foldingathome.org
00:32:07:       Homepage: https://foldingathome.org/
00:32:07:           Date: Apr 27 2020
00:32:07:           Time: 21:21:01
00:32:07:       Revision: 5a652817f46116b6e135503af97f18e094414e3b
00:32:07:         Branch: master
00:32:07:       Compiler: Visual C++ 2008
00:32:07:        Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
00:32:07:       Platform: win32 10
00:32:07:           Bits: 32
00:32:07:           Mode: Release
00:32:07:           Args: --open-web-control
00:32:07:         Config: C:\Users\Zuul\AppData\Roaming\FAHClient\config.xml
00:32:07:******************************** CBang ********************************
00:32:07:           Date: Apr 24 2020
00:32:07:           Time: 17:07:55
00:32:07:       Revision: ea081a3b3b0f4a37c4d0440b4f1bc184197c7797
00:32:07:         Branch: master
00:32:07:       Compiler: Visual C++ 2008
00:32:07:        Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
00:32:07:       Platform: win32 10
00:32:07:           Bits: 32
00:32:07:           Mode: Release
00:32:07:******************************* System ********************************
00:32:07:            CPU: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
00:32:07:         CPU ID: GenuineIntel Family 6 Model 158 Stepping 10
00:32:07:           CPUs: 12
00:32:07:         Memory: 31.92GiB
00:32:07:    Free Memory: 17.89GiB
00:32:07:        Threads: WINDOWS_THREADS
00:32:07:     OS Version: 6.2
00:32:07:    Has Battery: false
00:32:07:     On Battery: false
00:32:07:     UTC Offset: -4
00:32:07:            PID: 20836
00:32:07:            CWD: C:\Users\Zuul\AppData\Roaming\FAHClient
00:32:07:  Win32 Service: false
00:32:07:             OS: Windows 10 Enterprise
00:32:07:        OS Arch: AMD64
00:32:07:           GPUs: 1
00:32:07:          GPU 0: Bus:1 Slot:0 Func:0 NVIDIA:8 TU102 [GeForce RTX 2080 Ti Rev. A]
00:32:07:                 M 13448
00:32:07:  CUDA Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:7.5 Driver:11.0
00:32:07:OpenCL Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:1.2 Driver:445.87
00:32:07:******************************* libFAH ********************************
00:32:07:           Date: Apr 15 2020
00:32:07:           Time: 14:53:14
00:32:07:       Revision: 216968bc7025029c841ed6e36e81a03a316890d3
00:32:07:         Branch: master
00:32:07:       Compiler: Visual C++ 2008
00:32:07:        Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
00:32:07:       Platform: win32 10
00:32:07:           Bits: 32
00:32:07:           Mode: Release
00:32:07:***********************************************************************
00:32:07:<config>
00:32:07:  <!-- Folding Core -->
00:32:07:  <checkpoint v='3'/>
00:32:07:  <core-priority v='low'/>
00:32:07:
00:32:07:  <!-- Network -->
00:32:07:  <proxy v=':8080'/>
00:32:07:
00:32:07:  <!-- Slot Control -->
00:32:07:  <power v='FULL'/>
00:32:07:
00:32:07:  <!-- User Information -->
00:32:07:  <passkey v='*****'/>
00:32:07:  <team v='246763'/>
00:32:07:  <user v='KA1J'/>
00:32:07:
00:32:07:  <!-- Folding Slots -->
00:32:07:  <slot id='0' type='GPU'>
00:32:07:    <opencl-index v='0'/>
00:32:07:  </slot>
00:32:07:</config>
00:32:07:WU00:FS00:Connecting to assign1.foldingathome.org:80
00:32:07:WARNING:WU00:FS00:Failed to get assignment from 'assign1.foldingathome.org:80': No WUs available for this configuration
00:32:07:WU00:FS00:Connecting to assign2.foldingathome.org:80
00:32:08:WARNING:WU00:FS00:Failed to get assignment from 'assign2.foldingathome.org:80': No WUs available for this configuration
00:32:08:WU00:FS00:Connecting to assign3.foldingathome.org:80
00:32:08:WARNING:WU00:FS00:Failed to get assignment from 'assign3.foldingathome.org:80': No WUs available for this configuration
00:32:08:WU00:FS00:Connecting to assign4.foldingathome.org:80
00:32:08:WU00:FS00:Assigned to work server 3.133.76.19
00:32:08:WU00:FS00:Requesting new work unit for slot 00: READY gpu:0:TU102 [GeForce RTX 2080 Ti Rev. A] M 13448 from 3.133.76.19
00:32:08:WU00:FS00:Connecting to 3.133.76.19:8080
00:32:08:3:127.0.0.1:New Web session
00:32:29:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
00:32:29:WU00:FS00:Connecting to 3.133.76.19:80
Unfortunate there's nothing easily available to properly challenge the computational end of the GPU as regards F@H. Thanks for the info.

Re: Bad Work Unit - need good stress test?

Posted: Sun May 24, 2020 1:57 am
by bruce
FAH officially does not support overclocking. Maybe that's just another way of saying the FAHCore for GPUs is the de-facto benchmark for GPUs :D ... and all those other benchmarks are less stressful than FAH so don't depend on them for Stream Computing. (Yes, benchmarking video page rates is something different.)

If you've looking for other WUs that reported errors, you can find a number of previous log files in the "logs" subdirectory of FAH's data directory.

Re: Bad Work Unit - need good stress test?

Posted: Sun May 24, 2020 2:25 am
by KA1J
Heh, short of doing a Macrium backup to a new drive & recloning that backup & then running F@H freshly with the same WU & operating conditions (with internet access off) & then using F@H as a valid testing benchmark, F@H would be a tough one to use as a benchmark. :)

For me, I'm sure for you & most of us, this is important for science & especially defeating Covid which is killing thousands worldwide every day. I want to be able to do the maximum number of WUs possible and I don't like corrupted WUs, if that happened because of my system. If my system has good integrity and the WU crashes, so be it, that's beyond my control.

As I read of so many at F@H who are overclocking to get more WU's done, it seems like a test available to fine-tune a GPU with the current core would be a proper software for quality control.

Re: Bad Work Unit - need good stress test?

Posted: Sun May 24, 2020 2:45 am
by bruce
If FAH could block assignments to overclocked machines, they'd probably do it. Unfortunately when a machine asks for a new assignment, it doesn't say "...and I'm not overclocked" so we have to give them an assignment which may run without error on a cool day and may fail on a hot day (or whatever). FAH's design goal is to make use of ALL of the available resources you choose to donate ... and that has to be based on hardware that meets the original hardware design goals -- no more, no less.

On the other hand, when new code is being developed, it's supposed to find all software QC problems during beta testing. Unfortunately the beta testing team don't happen to have representatives of all OSs and all GPUs and all other client variations so it's difficult to define how long beta testing needs to run to catch all the potential issues.

Because of the urgency of the MoonShot project, that "supposed to" was relaxed because there were many, many clients that were not well represented during beta testing. To find those other issues, a few projects like 134xx temporarily slipped through the "beta testing only" firewall.

Sorry.