Page 1 of 1

"Hung download" bug still present in v7.6.13

Posted: Sun May 17, 2020 5:04 pm
by kc2lrc
FAH Team,

I've found that the "stuck download" bug is still present in v7.6.13, in which a stalled download will never reset, locking up the slot until FAHClient is forcibly restarted. I understand this is a known issue going back several releases.

Unfortunately, with the 9 CPU and 12 GPU slots I maintain, I run into this problem quite often. In Linux, pausing and unpausing the slot does not help, and the client becomes quite difficult to terminate during this - systemctl cannot do it, and it takes several attempts of 'killall FAHClient' before it yields. In Windows, right-clicking on the taskbar icon and selecting Quit causes the icon to disappear, but then one must go into Task Manager and forcibly close the process.

If there is still client development work planned, I'd appreciate it if a fix for this made it in. Not withstanding my lack of white-box knowledge of the client, from the surface I think it would be fairly easy to do by exploiting whatever code in the client produces the "Download xx%" messages in the log - whenever the issue comes up, these messages stop being printed. Perhaps there could be some watchdog implemented that closes the TCP connection and restarts the work unit acquisition process if more than a minute or so has elapsed between whatever invokes the printing of these messages.

Whatever happens, I will still be online to support your work. I realize you are dealing with a lot right now, and am impressed with the issues you've been able to tackle so far.

Cheers-
Sam (aka k2cc_amateur_radio_kc2lrc)

Re: "Hung download" bug still present in v7.6.13

Posted: Sun May 17, 2020 6:56 pm
by Joe_H
There is already a sort of "watchdog" implemented, most of the time the client does detect a stalled or hung download within 15-30 minutes. Then it retries the download or upload as the bug does show up for both at times.

It is a long standing issue. What I can say is that the code does a much better job of detecting and retrying a stalled connection than it did prior to version 7.5.

As for client development work, I don't know what current plans are. I do know that some volunteer developers are working on moving the FAHControl portion to Python 3. As to the other components, no idea for the short term. Long term there were plans in place before COVID-19 for a major rewrite, that was put on hold for COVID-19.

Re: "Hung download" bug still present in v7.6.13

Posted: Mon May 18, 2020 2:55 am
by kc2lrc
Interesting! I do think it has improved since v7.4.4, but I got the issue 5 times on one system last week, and twice on another. Perhaps it's a factor of the first system having a faster GPU and turning over more work units, but I'm not sure. I'll keep an eye on it, and post logs if it happens again.

Cheers -
Sam

Re: "Hung download" bug still present in v7.6.13

Posted: Mon May 18, 2020 3:00 am
by PantherX
Welcome to the F@H Forum kc2lrc,

Generally speaking, a fast system will likely encounter the issue more times than a slower system as there are more network connections being made so increased probability of encountering a network related issue.

Re: "Hung download" bug still present in v7.6.13

Posted: Tue May 19, 2020 2:52 am
by kc2lrc
Makes sense - the slot it usually affects is a GTX 1080, which turns over more work units per day than any of my other GPU slots. I've had 5 lockups on that one, and 2 on a GTX 970 slot, so figured I'd rattle the cages a bit. On the other hand, there's been no lockups in the past few days, so maybe I jinxed it in the right direction by posting this? No problem, if so!

I'll post here about how it's going if it becomes an issue again.

Cheers -
Sam

Re: "Hung download" bug still present in v7.6.13

Posted: Tue May 19, 2020 6:52 pm
by Brad_C
Hello
I also come across this today. Just one CPU slot, one GPU slot, the GPU hung on download, stuck at 19.72%. Over 2 hours ago so any watchdog doesn't seem to be working. The concurrent upload of the completed WU took three tries, perhaps the download is less able to recover from network glitches than the upload.

Code: Select all

......
16:17:40:WU00:FS01:0x22:Completed 980000 out of 1000000 steps (98%)
16:21:03:WU00:FS01:0x22:Completed 990000 out of 1000000 steps (99%)
16:21:04:WU01:FS01:Connecting to assign1.foldingathome.org:80
16:21:05:WU01:FS01:Assigned to work server 140.163.4.241
16:21:05:WU01:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:GP106 [GeForce GTX 1060 6GB] 4372 from 140.163.4.241
16:21:05:WU01:FS01:Connecting to 140.163.4.241:8080
16:21:27:WU01:FS01:Downloading 7.92MiB
16:21:35:WU01:FS01:Download 19.72%
16:24:24:WU00:FS01:0x22:Completed 1000000 out of 1000000 steps (100%)
16:24:31:WU00:FS01:0x22:Saving result file ..\logfile_01.txt
16:24:31:WU00:FS01:0x22:Saving result file checkpointState.xml
16:24:37:WU00:FS01:0x22:Saving result file checkpt.crc
16:24:37:WU00:FS01:0x22:Saving result file positions.xtc
16:24:40:WU00:FS01:0x22:Saving result file science.log
16:24:40:WU00:FS01:0x22:Folding@home Core Shutdown: FINISHED_UNIT
16:24:41:WU00:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
16:24:41:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:11752 run:0 clone:5118 gen:39 core:0x22 unit:0x0000004d8ca304e75e6bbd9f646093eb
16:24:41:WU00:FS01:Uploading 24.34MiB to 140.163.4.231
16:24:41:WU00:FS01:Connecting to 140.163.4.231:8080
16:24:55:WU00:FS01:Upload 0.77%
16:24:55:WARNING:WU00:FS01:Exception: Failed to send results to work server: Transfer failed
16:24:55:WU00:FS01:Trying to send results to collection server
16:24:55:WU00:FS01:Uploading 24.34MiB to 52.224.109.74
16:24:55:WU00:FS01:Connecting to 52.224.109.74:8080
16:25:21:WU00:FS01:Upload 1.80%
16:25:29:WU00:FS01:Upload 3.34%
16:25:35:WU00:FS01:Upload 10.53%
16:25:41:WU00:FS01:Upload 13.61%
16:25:49:WU00:FS01:Upload 13.87%
16:26:45:WU00:FS01:Upload 19.00%
16:27:06:WU00:FS01:Upload 20.54%
16:27:06:ERROR:WU00:FS01:Exception: Transfer failed
16:27:06:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:11752 run:0 clone:5118 gen:39 core:0x22 unit:0x0000004d8ca304e75e6bbd9f646093eb
16:27:06:WU00:FS01:Uploading 24.34MiB to 140.163.4.231
16:27:06:WU00:FS01:Connecting to 140.163.4.231:8080
16:27:12:WU00:FS01:Upload 0.51%
16:27:33:WU00:FS01:Upload 0.77%
16:27:39:WU00:FS01:Upload 1.54%
16:27:47:WU00:FS01:Upload 3.08%
[ .. snip .. ]
16:29:37:WU00:FS01:Upload 92.45%
16:29:53:WU00:FS01:Upload complete
16:29:53:WU00:FS01:Server responded WORK_ACK (400)
16:29:53:WU00:FS01:Final credit estimate, 83862.00 points
16:29:53:WU00:FS01:Cleaning up
[ no more entries ]

Re: "Hung download" bug still present in v7.6.13

Posted: Mon May 25, 2020 3:24 pm
by kc2lrc
Got a hang today - as of now (15:19Z) the watchdog has not reset this download. The system is Windows 10 64-bit with dual GTX 960s. I had to use Task Manager to kill FAHClient to resolve this.

Code: Select all

******************************* Date: 2020-05-25 *******************************
03:07:43:WU00:FS02:0x22:Completed 470000 out of 500000 steps (94%)
03:14:44:WU00:FS02:0x22:Completed 475000 out of 500000 steps (95%)
03:21:46:WU00:FS02:0x22:Completed 480000 out of 500000 steps (96%)
03:29:09:WU00:FS02:0x22:Completed 485000 out of 500000 steps (97%)
03:36:09:WU00:FS02:0x22:Completed 490000 out of 500000 steps (98%)
03:43:11:WU00:FS02:0x22:Completed 495000 out of 500000 steps (99%)
03:43:12:WU02:FS02:Connecting to assign1.foldingathome.org:80
03:43:12:WU02:FS02:Assigned to work server 155.247.166.220
03:43:12:WU02:FS02:Requesting new work unit for slot 02: RUNNING gpu:1:GM206 [GeForce GTX 960] 2308 from 155.247.166.220
03:43:12:WU02:FS02:Connecting to 155.247.166.220:8080
03:43:12:WU02:FS02:Downloading 5.12MiB
03:43:20:WU02:FS02:Download 4.89%
03:43:43:WU02:FS02:Download 6.11%
03:44:00:WU02:FS02:Download 7.33%
03:48:07:WU02:FS02:Download 10.99%
03:50:13:WU00:FS02:0x22:Completed 500000 out of 500000 steps (100%)
03:50:33:WU00:FS02:0x22:Saving result file ..\logfile_01.txt
03:50:33:WU00:FS02:0x22:Saving result file checkpointState.xml
03:50:40:WU00:FS02:0x22:Saving result file checkpt.crc
03:50:40:WU00:FS02:0x22:Saving result file positions.xtc
03:50:41:WU00:FS02:0x22:Saving result file science.log
03:50:41:WU00:FS02:0x22:Folding@home Core Shutdown: FINISHED_UNIT
03:50:42:WU00:FS02:FahCore returned: FINISHED_UNIT (100 = 0x64)
03:50:43:WU00:FS02:Sending unit results: id:00 state:SEND error:NO_ERROR project:14201 run:591 clone:1 gen:18 core:0x22 unit:0x00000018cedfaa925eb99bb986cc3f13
03:50:43:WU00:FS02:Uploading 40.91MiB to 206.223.170.146
03:50:43:WU00:FS02:Connecting to 206.223.170.146:8080
03:50:49:WU00:FS02:Upload 24.75%
03:50:55:WU00:FS02:Upload 50.57%
03:51:01:WU00:FS02:Upload 76.23%
03:51:07:WU00:FS02:Upload complete
03:51:07:WU00:FS02:Server responded WORK_ACK (400)
03:51:07:WU00:FS02:Final credit estimate, 96418.00 points
03:51:07:WU00:FS02:Cleaning up
******************************* Date: 2020-05-25 *******************************
******************************* Date: 2020-05-25 *******************************
System info:

Code: Select all

*********************** Log Started 2020-05-15T22:36:37Z ***********************
22:36:37:Trying to access database...
22:36:37:Successfully acquired database lock
22:36:37:Read GPUs.txt
22:36:37:Enabled folding slot 00: PAUSED cpu:6 (by user)
22:36:37:Enabled folding slot 01: PAUSED gpu:0:GM206 [GeForce GTX 960] 2308 (by user)
22:36:37:Enabled folding slot 02: PAUSED gpu:1:GM206 [GeForce GTX 960] 2308 (by user)
22:36:37:****************************** FAHClient ******************************
22:36:37:        Version: 7.6.13
22:36:37:         Author: Joseph Coffland <[email protected]>
22:36:37:      Copyright: 2020 foldingathome.org
22:36:37:       Homepage: https://foldingathome.org/
22:36:37:           Date: Apr 27 2020
22:36:37:           Time: 21:21:01
22:36:37:       Revision: 5a652817f46116b6e135503af97f18e094414e3b
22:36:37:         Branch: master
22:36:37:       Compiler: Visual C++ 2008
22:36:37:        Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
22:36:37:       Platform: win32 10
22:36:37:           Bits: 32
22:36:37:           Mode: Release
22:36:37:         Config: G:\Installed Programs\FAHData\config.xml
22:36:37:******************************** CBang ********************************
22:36:37:           Date: Apr 24 2020
22:36:37:           Time: 17:07:55
22:36:37:       Revision: ea081a3b3b0f4a37c4d0440b4f1bc184197c7797
22:36:37:         Branch: master
22:36:37:       Compiler: Visual C++ 2008
22:36:37:        Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
22:36:37:       Platform: win32 10
22:36:37:           Bits: 32
22:36:37:           Mode: Release
22:36:37:******************************* System ********************************
22:36:37:            CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
22:36:37:         CPU ID: GenuineIntel Family 6 Model 60 Stepping 3
22:36:37:           CPUs: 8
22:36:37:         Memory: 15.91GiB
22:36:37:    Free Memory: 12.71GiB
22:36:37:        Threads: WINDOWS_THREADS
22:36:37:     OS Version: 6.2
22:36:37:    Has Battery: false
22:36:37:     On Battery: false
22:36:37:     UTC Offset: -4
22:36:37:            PID: 109492
22:36:37:            CWD: G:\Installed Programs\FAHData
22:36:37:  Win32 Service: false
22:36:37:             OS: Windows 10 Enterprise
22:36:37:        OS Arch: AMD64
22:36:37:           GPUs: 2
22:36:37:          GPU 0: Bus:2 Slot:0 Func:0 NVIDIA:5 GM206 [GeForce GTX 960] 2308
22:36:37:          GPU 1: Bus:1 Slot:0 Func:0 NVIDIA:5 GM206 [GeForce GTX 960] 2308
22:36:37:  CUDA Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:5.2 Driver:10.2
22:36:37:  CUDA Device 1: Platform:0 Device:1 Bus:2 Slot:0 Compute:5.2 Driver:10.2
22:36:37:OpenCL Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:1.2 Driver:441.87
22:36:37:OpenCL Device 1: Platform:0 Device:1 Bus:2 Slot:0 Compute:1.2 Driver:441.87
22:36:37:******************************* libFAH ********************************
22:36:37:           Date: Apr 15 2020
22:36:37:           Time: 14:53:14
22:36:37:       Revision: 216968bc7025029c841ed6e36e81a03a316890d3
22:36:37:         Branch: master
22:36:37:       Compiler: Visual C++ 2008
22:36:37:        Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
22:36:37:       Platform: win32 10
22:36:37:           Bits: 32
22:36:37:           Mode: Release
22:36:37:***********************************************************************
Cheers -
Sam

Re: "Hung download" bug still present in v7.6.13

Posted: Tue May 26, 2020 1:19 am
by PantherX
Thanks for the reports. I have updated the issue with links here: https://github.com/FoldingAtHome/fah-issues/issues/983

Re: "Hung download" bug still present in v7.6.13

Posted: Thu May 28, 2020 12:32 pm
by G3WGV
Just a +1 for this problem. It's happened a couple of times recently on different PCs, so I don't think it's a system problem at this end. The only solution I have found is to terminate the client in task manager. Just closing the client and restarting it doesn't work: although the client task bar icon goes away when the client is closed, the restarted client doesn't work (no connection from web control, nothing in advanced control). At that point I terminate the remnants of the client in task manager and then I can restart it OK.

Re: "Hung download" bug still present in v7.6.13

Posted: Thu May 28, 2020 2:16 pm
by Neil-B
you might want to try downloading tcpview and killing the established connection ... this resolved a similar issue without any need to stop/pause/otherwise the client

Re: "Hung download" bug still present in v7.6.13

Posted: Thu May 28, 2020 2:30 pm
by G3WGV
That's an interesting idea Neil. Easy to do, so I will try to remember next time I get a download hang.

John

Re: "Hung download" bug still present in v7.6.13

Posted: Sun May 31, 2020 8:38 am
by TheFreshPrince
It's a bit annoying to wake up and find out my 2080ti has been idle for 8 hours because of a download that got stuck again.
Happens too often, have to babysit the client...

Really hope this gets fixed.

Re: "Hung download" bug still present in v7.6.13

Posted: Tue Jun 16, 2020 9:10 pm
by Kilrah
Same here today, still stuck after 4 hours. Really needs a timeout...

To avoid issues killing the client just delete the slot and recreate it instead. The stuck download will stay in the queue but the new slot works normally.

Re: "Hung download" bug still present in v7.6.13

Posted: Wed Jun 17, 2020 12:31 am
by Joe_H
Kilrah wrote:Same here today, still stuck after 4 hours. Really needs a timeout...

To avoid issues killing the client just delete the slot and recreate it instead. The stuck download will stay in the queue but the new slot works normally.
Can only recommend that as a temporary fix, reboot or restart the FAHClient process fairly soon. Prior experience is that the stuck download or upload connected with this bug will cause other problems eventually.