Page 8 of 12

Re: Multiple WU's Fail to Upload 155.247.166.*

Posted: Tue Oct 15, 2019 4:00 am
by bruce
Hmmm. You're suggesting that when an upload hangs, that your local clock stops running. I've never observed that on my systems.

Pay attention to your time-of-day clock and notice when it stops keeping time.

When was the last time you put a new CMOS battery in that computer/

Re: Multiple WU's Fail to Upload 155.247.166.*

Posted: Tue Oct 15, 2019 9:26 pm
by dfgirl12
I think it's the opposite, and the FAH client is falsely indicating clock drift when the software is too overloaded with failing / slow uploads.
I just searched the current logs on all the PCs, and there are no 'Detected clock skew' messages today.
I replaced about 50% of the CMOS batteries this year, and 50% of them the year before, so that's probably not the problem. I have about 10 more replacement batteries if I need them.

Everything is running smoothly since yesterday afternoon, even with a (specified limit) constant 50-100 KB/s torrent upload today (just to prove it's not my connection).

Re: Multiple WU's Fail to Upload 155.247.166.*

Posted: Tue Oct 15, 2019 11:30 pm
by bruce
Running a torrent will certainly reduce FAH's communications speed --- and perhaps that interference will contribute to FAHClient's difficulties. The same is true if WindowsUpdate is allowed to share their downloads with others (torrent-style sharing). I'm not suggesting that FAHClient is incompatible, but it does like to have a reasonable share of your bandwidth when a WU is uploading or downloading.

Re: Multiple WU's Fail to Upload 155.247.166.*

Posted: Wed Oct 16, 2019 12:58 am
by dfgirl12
I'll try to look at the PC clocks more when this happens again to look for clock skew on all the PCs.

I chose when to run updates on the PCs I have, so that is not a problem.
The WU upload backlogs have happened without running a torrent program, and the torrent program has limits set to not exceed the specified download/upload rates that normally keep everything running smoothly.

My best guess is, my Internet connection is not the cause of the problem with the WU upload backlogs. It just doesn't help clear the congestion.
So far, collection server reboots have been the only thing that really helps with the large WU congestion.

Re: Multiple WU's Fail to Upload 155.247.166.*

Posted: Wed Oct 16, 2019 6:02 am
by dfgirl12
I did finally find the open ticket for a setting to limit the number of uploads, from 2013: https://github.com/FoldingAtHome/fah-issues/issues/1038. I'm not the first person with this issue. :)

This comment was interesting: https://github.com/FoldingAtHome/fah-is ... -294264890 if a connection timeout wasn't added to the WS code, then that's probably why the servers need to be rebooted to fix the current larger GPU WU file transfer issues over time.

Re: Multiple WU's Fail to Upload 155.247.166.*

Posted: Fri Oct 18, 2019 9:52 pm
by dfgirl12
The WU upload failures and slow-downs are starting to happen again to 155.247.166.219. Both temple servers need to be rebooted since they are past 3 days of uptime. :)

Re: Multiple WU's Fail to Upload 155.247.166.*

Posted: Fri Oct 18, 2019 11:09 pm
by JohnJohn
Is anyone going to fix this permanently? This has been happening for about a month.

Re: Multiple WU's Fail to Upload 155.247.166.*

Posted: Sat Oct 19, 2019 1:22 am
by JimF
A stuck download happened to me for the first time. I have never seen it before. I had to reboot to fix it.

Code: Select all

00:45:26:WU01:FS01:Uploading 14.88MiB to 155.247.166.220
00:45:26:WU01:FS01:Connecting to 155.247.166.220:8080
00:45:32:WU01:FS01:Upload 35.70%
00:45:38:WU01:FS01:Upload 68.88%
00:45:43:WU00:FS01:Download 6.58%
00:45:44:WU01:FS01:Upload complete
00:45:44:WU01:FS01:Server responded WORK_ACK (400)
00:45:44:WU01:FS01:Final credit estimate, 162894.00 points
00:45:44:WU01:FS01:Cleaning up
00:45:49:WU00:FS01:Download 8.62%
00:45:55:WU00:FS01:Download 10.21%
01:11:13:Lost lifeline PID 1662, exiting
01:11:20:Caught signal SIGTERM(15) on PID 1665
[91m01:11:20:ERROR:Receive error: 4: Interrupted system call[0m
01:11:20:Exiting, please wait. . .
EDIT: Even stranger, when I rebooted, all the Rosettas that were running on the CPU had errored out. I am not sure how anything on the GPU could affect them.
But it would seem that when FAH gets stuck, it really gets stuck.

Re: Multiple WU's Fail to Upload 155.247.166.*

Posted: Sat Oct 19, 2019 8:55 am
by dfgirl12
It looks like all the downloads from 155.247.166.219 are all dying (at 0-5% downloaded) and hanging FahClient folding slots now. Killing FAH and it's processes, and restarting it, all weekend long is going to be a hassle... Time to start running DarthMouse's Reboot FAH script for Linux again: viewtopic.php?f=96&t=30504#p301797 :)

Re: Multiple WU's Fail to Upload 155.247.166.*

Posted: Sat Oct 19, 2019 1:09 pm
by JimF
It happened on a second Ubuntu machine also. But the server did not freeze in the middle of a download, it just did not start one.
A reboot was required to fix it, but fortunately did not cause a problem on the Rosetta work units running on BOINC.

Code: Select all

03:38:32:WU00:FS01:0x21:Completed 12250000 out of 12500000 steps (98%)
03:41:57:WU00:FS01:0x21:Completed 12375000 out of 12500000 steps (99%)
03:41:58:WU01:FS01:Connecting to 65.254.110.245:8080
03:41:58:WU01:FS01:Assigned to work server 155.247.166.219
03:41:58:WU01:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:GP104 [GeForce GTX 1070] 6463 from 155.247.166.219
03:41:58:WU01:FS01:Connecting to 155.247.166.219:8080
03:41:58:WU01:FS01:Downloading 27.47MiB
03:42:04:WU01:FS01:Download 2.96%
03:42:19:WU01:FS01:Download 3.19%
03:42:37:WU01:FS01:Download 4.10%
03:42:47:WU01:FS01:Download 4.55%
03:42:56:WU01:FS01:Download 4.78%
03:43:02:WU01:FS01:Download 5.69%
03:43:17:WU01:FS01:Download 6.14%
03:43:36:WU01:FS01:Download 6.83%
03:43:42:WU01:FS01:Download 8.87%
03:43:49:WU01:FS01:Download 11.38%
03:45:22:WU00:FS01:0x21:Completed 12500000 out of 12500000 steps (100%)
03:45:24:WU00:FS01:0x21:Saving result file logfile_01.txt
03:45:24:WU00:FS01:0x21:Saving result file checkpointState.xml
03:45:24:WU00:FS01:0x21:Saving result file checkpt.crc
03:45:24:WU00:FS01:0x21:Saving result file log.txt
03:45:24:WU00:FS01:0x21:Saving result file positions.xtc
03:45:24:WU00:FS01:0x21:Folding@home Core Shutdown: FINISHED_UNIT
03:45:24:WU00:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
03:45:24:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:14180 run:2 clone:674 gen:28 core:0x21 unit:0x0000002e0002894c5d3b54882c3bef38
03:45:24:WU00:FS01:Uploading 14.75MiB to 155.247.166.220
03:45:24:WU00:FS01:Connecting to 155.247.166.220:8080
03:45:30:WU00:FS01:Upload 45.35%
03:45:36:WU00:FS01:Upload 87.31%
03:45:38:WU00:FS01:Upload complete
03:45:38:WU00:FS01:Server responded WORK_ACK (400)
03:45:38:WU00:FS01:Final credit estimate, 162318.00 points
03:45:38:WU00:FS01:Cleaning up
******************************* Date: 2019-10-19 *******************************
12:42:47:Lost lifeline PID 1424, exiting
12:42:49:Caught signal SIGTERM(15) on PID 1427
[91m12:42:49:ERROR:Receive error: 4: Interrupted system call[0m
12:42:49:Exiting, please wait. . .
I am beginning to wonder whether the server is sending out some errant signal that causes our FAHClients to hang up. They may not see a problem with the server on their end.
But the experts will have to chase that one down. All I can do is pull cards off of Folding. I am beginning to need the heat in my basement.

Re: Multiple WU's Fail to Upload 155.247.166.*

Posted: Sat Oct 19, 2019 2:37 pm
by MeeLee
I hope they'll have it fixed by winter time!
It is getting annoying, having to physically be there every 2 to 3 or so hours to force a restart. The procedure in my case takes about 15 minutes per server.

They also should fix fahclient, to automatically seek for other sources (servers), if one isn't responding.
And Fah could do well by uploading low priority WUs to solid performing PC's, in times of downtime.
That way not their entire down time is wasted.

Re: Multiple WU's Fail to Upload 155.247.166.*

Posted: Sat Oct 19, 2019 6:10 pm
by JimF
MeeLee wrote:They also should fix fahclient, to automatically seek for other sources (servers), if one isn't responding.
And Fah could do well by uploading low priority WUs to solid performing PC's, in times of downtime.
That way not their entire down time is wasted.
They probably can just adopt the BOINC client and server. They have a "zero resource share" option that more or less does the same thing as FAH, only more reliably.
A crisis is too good of an opportunity to waste.

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Posted: Sat Oct 19, 2019 9:07 pm
by Catalina588
I agree that 1) Temple has been problematic since September 26th. 2) It's a huge labor sink for active folders, including those who have already posted on this thread 3) Problem is on Linux as well as Windows 4) FAHclient problem/bug is that the client cannot/does not automatically recover. 5. The Temple servers appear to me to be dedicated to high-end work units. Only my RTX 2080/Ti gpus are affected. That makes the penalty even worse.

There does not appear any way for folders to alert server operators. Hey, it's 2019 and there's lots of noreply social media options. Furthermore, there's no apparent way to find out when thisngs are right again. The Project Summary is useless as far as this thread is concerned. No help at all.

Here's a representative example of today's fiasco:

Code: Select all

 *********************** Log Started 2019-10-19T20:30:20Z ***********************
20:30:20:************************* Folding@home Client *************************
20:30:20:        Website: https://foldingathome.org/
20:30:20:      Copyright: (c) 2009-2018 foldingathome.org
20:30:20:         Author: Joseph Coffland <[email protected]>
20:30:20:           Args: --child --lifeline 1368 /etc/fahclient/config.xml --run-as
20:30:20:                 fahclient --pid-file=/var/run/fahclient.pid --daemon
20:30:20:         Config: /etc/fahclient/config.xml
20:30:20:******************************** Build ********************************
20:30:20:        Version: 7.5.1
20:30:20:           Date: May 11 2018
20:30:20:           Time: 19:59:04
20:30:20:     Repository: Git
20:30:20:       Revision: 4705bf53c635f88b8fe85af7675557e15d491ff0
20:30:20:         Branch: master
20:30:20:       Compiler: GNU 6.3.0 20170516
20:30:20:        Options: -std=gnu++98 -O3 -funroll-loops
20:30:20:       Platform: linux2 4.14.0-3-amd64
20:30:20:           Bits: 64
20:30:20:           Mode: Release
20:30:20:******************************* System ********************************
20:30:20:            CPU: Intel(R) Pentium(R) CPU G4620 @ 3.70GHz
20:30:20:         CPU ID: GenuineIntel Family 6 Model 158 Stepping 9
20:30:20:           CPUs: 4
20:30:20:         Memory: 7.75GiB
20:30:20:    Free Memory: 7.17GiB
20:30:20:        Threads: POSIX_THREADS
20:30:20:     OS Version: 4.15
20:30:20:    Has Battery: false
20:30:20:     On Battery: false
20:30:20:     UTC Offset: -4
20:30:20:            PID: 1370
20:30:20:            CWD: /var/lib/fahclient
20:30:20:             OS: Linux 4.15.0-47-generic x86_64
20:30:20:        OS Arch: AMD64
20:30:20:           GPUs: 2
20:30:20:          GPU 0: Bus:1 Slot:0 Func:0 NVIDIA:8 TU104 [GeForce RTX 2080]
20:30:20:          GPU 1: Bus:2 Slot:0 Func:0 NVIDIA:8 TU102 [GeForce RTX 2080 Ti] M
20:30:20:                 13448
20:30:20:  CUDA Device 0: Platform:0 Device:0 Bus:2 Slot:0 Compute:7.5 Driver:10.2
20:30:20:  CUDA Device 1: Platform:0 Device:1 Bus:1 Slot:0 Compute:7.5 Driver:10.2
20:30:20:OpenCL Device 0: Platform:0 Device:0 Bus:2 Slot:0 Compute:1.2 Driver:430.14
20:30:20:OpenCL Device 1: Platform:0 Device:1 Bus:1 Slot:0 Compute:1.2 Driver:430.14
20:30:20:***********************************************************************
20:30:20:<config>
20:30:20:  <!-- HTTP Server -->
20:30:20:  <allow v='127.0.0.1 192.168.1.0/24'/>
20:30:20:
20:30:20:  <!-- Network -->
20:30:20:  <proxy v=':8080'/>
20:30:20:
20:30:20:  <!-- Remote Command Server -->
20:30:20:  <command-allow-no-pass v='127.0.0.1 192.168.1.0/24'/>
20:30:20:
20:30:20:  <!-- Slot Control -->
20:30:20:  <power v='full'/>
20:30:20:
20:30:20:  <!-- User Information -->
20:30:20:  <passkey v='********************************'/>
20:30:20:  <team v='224497'/>
20:30:20:  <user v='Catalina588_ALL_1EMQiByPxuaffjHVyb4RDLXChMkwgWmYUn'/>
20:30:20:
20:30:20:  <!-- Work Unit Control -->
20:30:20:  <next-unit-percentage v='100'/>
20:30:20:
20:30:20:  <!-- Folding Slots -->
20:30:20:  <slot id='1' type='GPU'/>
20:30:20:  <slot id='0' type='GPU'/>
20:30:20:</config>
20:30:20:Switching to user fahclient
20:30:20:Trying to access database...
20:30:20:Successfully acquired database lock
20:30:20:Enabled folding slot 01: READY gpu:0:TU104 [GeForce RTX 2080]
20:30:20:Enabled folding slot 00: READY gpu:1:TU102 [GeForce RTX 2080 Ti] M 13448
20:30:20:WU02:FS00:Starting
20:30:20:WU02:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/Linux/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 705 -lifeline 1370 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0
20:30:20:WU02:FS00:Started FahCore on PID 1380
20:30:20:WU02:FS00:Core PID:1384
20:30:20:WU02:FS00:FahCore 0x21 started
20:30:20:WU00:FS01:Connecting to 65.254.110.245:8080
20:30:20:WU02:FS00:0x21:*********************** Log Started 2019-10-19T20:30:20Z ***********************
20:30:20:WU02:FS00:0x21:Project: 14191 (Run 9, Clone 14, Gen 37)
20:30:20:WU02:FS00:0x21:Unit: 0x000000310002894c5d5d741dcf57a569
20:30:20:WU02:FS00:0x21:CPU: 0x00000000000000000000000000000000
20:30:20:WU02:FS00:0x21:Machine: 0
20:30:20:WU02:FS00:0x21:Digital signatures verified
20:30:20:WU02:FS00:0x21:Folding@home GPU Core21 Folding@home Core
20:30:20:WU02:FS00:0x21:Version 0.0.20
20:30:20:WU02:FS00:0x21:  Found a checkpoint file
20:30:21:WU00:FS01:Assigned to work server 155.247.166.219
20:30:21:WU00:FS01:Requesting new work unit for slot 01: READY gpu:0:TU104 [GeForce RTX 2080] from 155.247.166.219
20:30:21:WU00:FS01:Connecting to 155.247.166.219:8080
20:30:22:WU00:FS01:Downloading 27.48MiB
20:30:23:WU02:FS00:0x21:Completed 13500000 out of 25000000 steps (54%)
20:30:23:WU02:FS00:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
[b]20:31:28:WU00:FS01:Download 0.23%
20:31:47:WU00:FS01:Download 0.45%[/b]
20:32:12:WU02:FS00:0x21:Completed 13750000 out of 25000000 steps (55%)
20:34:02:WU02:FS00:0x21:Completed 14000000 out of 25000000 steps (56%)
20:35:53:WU02:FS00:0x21:Completed 14250000 out of 25000000 steps (57%)
20:37:44:WU02:FS00:0x21:Completed 14500000 out of 25000000 steps (58%)
20:39:35:WU02:FS00:0x21:Completed 14750000 out of 25000000 steps (59%)
20:41:25:WU02:FS00:0x21:Completed 15000000 out of 25000000 steps (60%)
20:43:16:WU02:FS00:0x21:Completed 15250000 out of 25000000 steps (61%)
20:45:07:WU02:FS00:0x21:Completed 15500000 out of 25000000 steps (62%)
20:46:57:WU02:FS00:0x21:Completed 15750000 out of 25000000 steps (63%)
20:48:48:WU02:FS00:0x21:Completed 16000000 out of 25000000 steps (64%)
20:50:39:WU02:FS00:0x21:Completed 16250000 out of 25000000 steps (65%)
20:52:30:WU02:FS00:0x21:Completed 16500000 out of 25000000 steps (66%)
 

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Posted: Sat Oct 19, 2019 9:17 pm
by HaloJones
Affecting multiple clients, and it's my fastest water-cooled dedicated rigs. Please get this sorted or turn them off.

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Posted: Sat Oct 19, 2019 9:17 pm
by bruce
Here's what i THINK I know:

* Something is getting overloaded at temple.edu. (Communications gets very slow)
* Several things have been tried, including rebooting their WS/CS (and possibly the campus router). This drops all connections so all clients can try connecting again (at least until the next overload occurs).
* FAHClient does not recover that connection, nor does it recognize that it needs to drop that connection and restart it. Restarting FAHClient is required to do that.

Obviously you can't predict whether the server based problem has been cleared when you restart your FAHClient. Dumping WUs rarely helps except if enough people do it concurrently to contributes to a reduction in the server overload problem.

All WUs downloaded from a WS must eventually be returned to that WS. If your results happen to go to a CS (Collection Server) it will be timestamped as returned and eventually forwarded to the WS (WorkServer), so that'a good thing. I suspect that communications between the CS and the WS can recover from communications failures more effectively that your local Client.