Multiple WU's Fail downld/upld to 155.247.166.*

Moderators: Site Moderators, FAHC Science Team

gordonbb
Posts: 511
Joined: Mon May 21, 2018 4:12 pm
Hardware configuration: Ubuntu 22.04.2 LTS; NVidia 525.60.11; 2 x 4070ti; 4070; 4060ti; 3x 3080; 3070ti; 3070
Location: Great White North

Re: Really slow WU downloads; failed download.

Post by gordonbb »

Glad to know it's not just me. We're ramping up for a Month-Long competition on my Team and I've noticed over the last 3 weeks or so on multiple occasions downloads that normally occur within a couple of seconds on my 50Mbps VDSL taking 5 minutes to over 1/2 an hour. I've collected some logs when they happen but I'm at work now and don't have access to them.

I've also had systems that have been stable for months have their folding slots lock up and needed to restart.

Now I'm seeing a few others on my Team also experiencing the same issues.

I have rebooted my home router after 78 days of uptime to fix a DNS issue that started yesterday evening.

I'll ask my Team members to collect logs when their systems are having issues and submit them with any I collect.
Image
vvoelz
Pande Group Member
Posts: 552
Joined: Sun Dec 02, 2007 8:07 pm
Location: Temple University, Philadelphia PA

Re: Multiple WU's Fail to Upload

Post by vvoelz »

Hi all --

The misbehaving servers (vav3 155.247.166.219, and vav4 155.247.166.220) are from our lab. We're trying to diagnose the problem; might be network propblems on campus at Temple. In the meantime we're turning down the assignment weights
--Vince
DocJonz
Posts: 244
Joined: Thu Dec 06, 2007 6:31 pm
Hardware configuration: Folding with: 4x RTX 4070Ti, 1x RTX 4080 Super
Location: United Kingdom
Contact:

Re: New Work Units Fail to Complete Download

Post by DocJonz »

Glad its not just me having this problem :D
It has been an issue I have noticed since Tuesday. As an example;
On returning from work today, two GPU slots on one machine had stopped mid 'Downloading' the next WUs, and as a result had sat at 'Ready' all day. This has happened across all five of my Folidng rigs this week at different times.
Has something changed at the Stanford end?
Could pulling down multiple large WUs at the same time be affecting the server (as it is not bandwidth at my end)?
Last edited by DocJonz on Thu Sep 26, 2019 8:01 pm, edited 1 time in total.
Folding Stats (HFM.NET): DocJonz Folding Farm Stats
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: New Work Units Fail to Complete Download

Post by bruce »

The network admins at temple.edu have been notified.
DutchForce
Posts: 60
Joined: Sun Sep 08, 2013 12:43 pm
Location: Netherlands

Re: New Work Units Fail to Complete Download

Post by DutchForce »

I had the same problem of slow downloads from server 155.247.166.220 (P14180 WUs), so I've looked back trough my logs and the slow downloads started for me on September 21 (did go from the usual less than 1 minute to more than 6 - 9 minutes). But a few hours ago I've downloaded two P14180 WUs in less than 30 seconds (at 15:43 and 15:55 UTC log time).
Image
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: New Work Units Fail to Complete Download

Post by bruce »

The servers 155.247.166.219 and 155.247.166.220 both are on the same campus and communicate through the same network routers.
dfgirl12
Posts: 38
Joined: Fri Aug 21, 2009 8:34 am

Re: Multiple WU's Fail to Upload

Post by dfgirl12 »

Thanks Vince! :)
gordonbb
Posts: 511
Joined: Mon May 21, 2018 4:12 pm
Hardware configuration: Ubuntu 22.04.2 LTS; NVidia 525.60.11; 2 x 4070ti; 4070; 4060ti; 3x 3080; 3070ti; 3070
Location: Great White North

Re: Multiple WU's Fail to Upload

Post by gordonbb »

dfgirl12 wrote:Thanks Vince! :)

--
I'm starting to experience a different, but related problem of WU taking longer to upload than it takes a 2080Ti / 1080Ti to complete them. It could just be my cable internet connection, but currently having 37+ concurrent 50MB-100MB uploads is choking it to death (fast or slow servers: once the connection is clogged, then none of them get uploaded for hours or days). My uploads are 4+ WU deep on some machines, and getting worse today. Are there any plans for the larger WU sizes and not enough return data channel capacity to handle it (like longer WU run times)? Any suggestions, other than turning off half (~20) of the GPUs, or getting a fiber connection (not really available yet)?
Try running a Speedtest on your internet connection to see what your actual upload speed is and if it’s not what your service is supposed to provide contact the ISP. They should be able to take a look at your cable modem remotely and if there doesn’t appear to be any line issues it might be a hardware issue with your cable modem or the hardware it’s connected to upstream.

Unloaded a 50MB file should transfer in 400s on a 1MBps upload path and 40s on a 10Mbps path.

I have 6 GPUs folding and 6 doing Einstein at home and 50 CPU threads on World Community Grid and have no issues on a 50/10 vDSL connection my average upload speed over 7 days is 270kbps. Some cable connections are limited to 1Mbps upload but most these days are 10-20Mbps.

Often these types of issues will magically disappear with a reboot of the Cable Modem. Another gotcha can be an MTU mismatch but that is rarer these days as most modern TCP/IP stacks have Path MTU detection algorithms that will scale the MTU to match the smallest observed along the path between end-points
Image
dfgirl12
Posts: 38
Joined: Fri Aug 21, 2009 8:34 am

Re: Multiple WU's Fail to Upload

Post by dfgirl12 »

I moved my issue to a different topic, here: viewtopic.php?f=74&t=31882

I think that with the other GPU WU servers stopped assigning, and the other couple of servers are overloaded now and slower for uploads??? https://apps.foldingathome.org/serverstats

I would have to kill the 50 WU transfers in progress to do a speed test. I know I can typically upload 100-300 KB/s with my connection. But, all it takes is a few slow FAH uploads and it starts backing up when the WUs take ~30 minutes to fold.
gordonbb
Posts: 511
Joined: Mon May 21, 2018 4:12 pm
Hardware configuration: Ubuntu 22.04.2 LTS; NVidia 525.60.11; 2 x 4070ti; 4070; 4060ti; 3x 3080; 3070ti; 3070
Location: Great White North

Re: Multiple WU's Fail to Upload

Post by gordonbb »

Stalled uploads from 155.247.166.220 still occurring but much less frequent (I've only seen one today across 6 slots) so the adjustment of the assignment weight has helped.
Image
dfgirl12
Posts: 38
Joined: Fri Aug 21, 2009 8:34 am

Re: Multiple WU's Fail to Upload

Post by dfgirl12 »

With the vav3/vav4 servers available again, the congestion I was seeing to the other server is being alleviated. I didn't reboot my cable modem or make any network settings changes. The completed WUs are not piling up and failing to upload like they were yesterday or this morning. Thanks! :)
HaloJones
Posts: 906
Joined: Thu Jul 24, 2008 10:16 am

Re: New Work Units Fail to Complete Download

Post by HaloJones »

Can they not be turned off for now? They're not just stopping their own units being worked on but killing clients stopping them working on any units at all!
single 1070

Image
bollix47
Posts: 2963
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: New Work Units Fail to Complete Download

Post by bollix47 »

@HaloJones
Could you please describe your current situation. I've had no problem since Temple resolved their comm problems on Thursday.

If you or others are still experiencing problems downloading and/or uploading I suggest a Pause for all slots, reboot and then start them up again by right-clicking on each slot and selecting Fold ... do one at a time and wait for that slot to clear if it was stuck. i.e. wait for any communications to complete before starting up another slot.

One other problem I had during this last week was that somehow these problems messed with my download and upload speeds. I went to speedtest.net and ran their test and both speeds were much lower than they should have been. Both were corrected with a router reboot.
snapshot
Posts: 132
Joined: Thu Apr 09, 2009 7:25 pm
Location: Wiltshire, UK

Re: New Work Units Fail to Complete Download

Post by snapshot »

I'm seeing downloads hang on one out of two PCs. The one I started yesterday after a week's break has been fine (so far). The one I started this morning hung instantly but got a WU after I restarted the FAHClient task. It then hung at about 60% downloading the next WU once the first had finished. Again, restarting the FAHClient task let a download succeed. We'll see what happens with the current WU in about 75 minutes....

I'm running just one GPU slot on each PC. Logs available if you think they'll add anything.
snapshot
Posts: 132
Joined: Thu Apr 09, 2009 7:25 pm
Location: Wiltshire, UK

Re: New Work Units Fail to Complete Download

Post by snapshot »

And the next time that client tried to download a WU it again failed but at least it did actually fail rather than hang. I'm not sure if it was then assigned to the same work server or not but it successfully downloaded a WU. The failed downloads are characterised by being very slow but there's obviously no local problem as the 'good' ones download very quickly.

Code: Select all

18:52:11:WU01:FS01:Connecting to 65.254.110.245:8080
18:52:11:WU01:FS01:Assigned to work server 155.247.166.220
18:52:11:WU01:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:TU116 [GeForce GTX 1660 Ti] from 155.247.166.220
18:52:11:WU01:FS01:Connecting to 155.247.166.220:8080
18:52:12:WU01:FS01:Downloading 15.85MiB
18:52:19:WU01:FS01:Download 1.58%
18:52:25:WU01:FS01:Download 2.37%
18:52:34:WU01:FS01:Download 3.15%
18:52:41:WU01:FS01:Download 4.73%
18:52:47:WU01:FS01:Download 6.31%
18:52:58:WU01:FS01:Download 7.49%
18:53:05:WU01:FS01:Download 8.28%
18:53:07:WU00:FS01:0x21:Completed 500000 out of 500000 steps (100%)
18:53:12:WU01:FS01:Download 9.46%
18:53:13:WU00:FS01:0x21:Saving result file logfile_01.txt
18:53:13:WU00:FS01:0x21:Saving result file checkpointState.xml
18:53:14:WU00:FS01:0x21:Saving result file checkpt.crc
18:53:14:WU00:FS01:0x21:Saving result file log.txt
18:53:14:WU00:FS01:0x21:Saving result file positions.xtc
18:53:14:WU00:FS01:0x21:Folding@home Core Shutdown: FINISHED_UNIT
18:53:14:WU00:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
18:53:14:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:14229 run:1434 clone:0 gen:27 core:0x21 unit:0x0000001e80fccb0a5d65528c9ff1d337
18:53:14:WU00:FS01:Uploading 51.50MiB to 128.252.203.10
18:53:14:WU00:FS01:Connecting to 128.252.203.10:8080
18:53:19:WU01:FS01:Download 10.64%
18:53:20:WU00:FS01:Upload 4.13%
18:53:26:WU00:FS01:Upload 9.10%
18:53:26:WU01:FS01:Download 11.83%
18:53:32:WU00:FS01:Upload 18.21%
18:53:32:WU01:FS01:Download 12.61%
18:53:38:WU00:FS01:Upload 30.34%
18:53:39:WU01:FS01:Download 13.80%
18:53:44:WU00:FS01:Upload 42.96%
18:53:47:WU01:FS01:Download 14.98%
18:53:50:WU00:FS01:Upload 54.49%
18:53:55:WU01:FS01:Download 15.37%
18:53:56:WU00:FS01:Upload 64.45%
18:54:02:WU00:FS01:Upload 76.58%
18:54:02:WU01:FS01:Download 16.56%
18:54:08:WU00:FS01:Upload 88.96%
18:54:14:WU00:FS01:Upload 98.19%
18:54:15:WU01:FS01:Download 16.95%
18:54:17:WU00:FS01:Upload complete
18:54:17:WU00:FS01:Server responded WORK_ACK (400)
18:54:17:WU00:FS01:Final credit estimate, 36768.00 points
18:54:17:WU00:FS01:Cleaning up
18:54:26:WU01:FS01:Download 18.53%
18:54:32:WU01:FS01:Download 19.32%
18:55:21:WU01:FS01:Download 20.50%
18:55:44:WU01:FS01:Download 20.89%
18:55:50:WU01:FS01:Download 21.68%
18:55:59:WU01:FS01:Download 22.47%
18:56:05:WU01:FS01:Download 23.26%
18:56:18:WU01:FS01:Download 24.05%
18:56:30:WU01:FS01:Download 24.44%
18:56:37:WU01:FS01:Download 25.23%
18:56:43:WU01:FS01:Download 26.41%
18:56:50:WU01:FS01:Download 27.59%
18:57:02:WU01:FS01:Download 29.17%
18:57:09:WU01:FS01:Download 29.96%
18:57:22:WU01:FS01:Download 30.35%
18:57:30:WU01:FS01:Download 31.54%
18:57:39:WU01:FS01:Download 32.72%
18:57:59:WU01:FS01:Download 33.11%
18:58:06:WU01:FS01:Download 34.30%
18:58:14:WU01:FS01:Download 35.48%
18:58:22:WU01:FS01:Download 36.66%
18:58:31:WU01:FS01:Download 37.84%
18:58:38:WU01:FS01:Download 39.42%
18:58:45:WU01:FS01:Download 40.60%
18:58:54:WU01:FS01:Download 41.79%
18:59:02:WU01:FS01:Download 42.97%
18:59:08:WU01:FS01:Download 44.15%
18:59:14:WU01:FS01:Download 45.33%
18:59:21:WU01:FS01:Download 47.31%
18:59:27:WU01:FS01:Download 48.88%
18:59:33:WU01:FS01:Download 52.04%
18:59:40:WU01:FS01:Download 55.19%
18:59:47:WU01:FS01:Download 56.77%
19:00:10:WU01:FS01:Download 57.16%
19:02:51:WU01:FS01:Download 57.55%
19:02:51:ERROR:WU01:FS01:Exception: Transfer failed
19:02:51:WU01:FS01:Connecting to 65.254.110.245:8080
19:02:52:WU01:FS01:Assigned to work server 128.252.203.10
19:02:52:WU01:FS01:Requesting new work unit for slot 01: READY gpu:0:TU116 [GeForce GTX 1660 Ti] from 128.252.203.10
19:02:52:WU01:FS01:Connecting to 128.252.203.10:8080
19:02:53:WU01:FS01:Downloading 68.43MiB
19:02:59:WU01:FS01:Download 34.61%
19:03:05:WU01:FS01:Download 70.78%
19:03:09:WU01:FS01:Download complete
19:03:09:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:14228 run:585 clone:1 gen:13 core:0x21 unit:0x0000001280fccb0a5d716f5ad52a22a9
19:03:09:WU01:FS01:Starting
19:03:09:WU01:FS01:Running FahCore: \"C:\\Program Files (x86)\\FAHClient/FAHCoreWrapper.exe\" C:\\FAHClient\\cores/cores.foldingathome.org/Win32/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21.exe -dir 01 -suffix 01 -version 705 -lifeline 8016 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0
19:03:10:WU01:FS01:Started FahCore on PID 3404
19:03:10:WU01:FS01:Core PID:7784
19:03:10:WU01:FS01:FahCore 0x21 started
19:03:10:WU01:FS01:0x21:*********************** Log Started 2019-09-28T19:03:10Z ***********************
19:03:10:WU01:FS01:0x21:Project: 14228 (Run 585, Clone 1, Gen 13)
19:03:10:WU01:FS01:0x21:Unit: 0x0000001280fccb0a5d716f5ad52a22a9
19:03:10:WU01:FS01:0x21:CPU: 0x00000000000000000000000000000000
19:03:10:WU01:FS01:0x21:Machine: 1
19:03:10:WU01:FS01:0x21:Reading tar file core.xml
19:03:10:WU01:FS01:0x21:Reading tar file integrator.xml
19:03:10:WU01:FS01:0x21:Reading tar file state.xml
19:03:11:WU01:FS01:0x21:Reading tar file system.xml
19:03:11:WU01:FS01:0x21:Digital signatures verified
19:03:11:WU01:FS01:0x21:Folding@home GPU Core21 Folding@home Core
19:03:11:WU01:FS01:0x21:Version 0.0.20
19:03:30:WU01:FS01:0x21:Completed 0 out of 2000000 steps (0%)
19:03:30:WU01:FS01:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
19:07:09:WU01:FS01:0x21:Completed 20000 out of 2000000 steps (1%)
19:10:55:WU01:FS01:0x21:Completed 40000 out of 2000000 steps (2%)
19:14:40:WU01:FS01:0x21:Completed 60000 out of 2000000 steps (3%)
19:18:25:WU01:FS01:0x21:Completed 80000 out of 2000000 steps (4%)
19:22:10:WU01:FS01:0x21:Completed 100000 out of 2000000 steps (5%)
Post Reply