18201 Upload repeating fails at 57.27% - (60 seconds)

Moderators: Site Moderators, FAHC Science Team

parkut
Posts: 363
Joined: Tue Feb 12, 2008 7:33 am
Hardware configuration: Running exclusively Linux headless blades. All are dedicated crunching machines.
Location: SE Michigan, USA

18201 Upload repeating fails at 57.27% - (60 seconds)

Post by parkut »

I have a problem with project:18201 run:5707 clone:0 gen:5

Similar to Prior reported 18201 work unit failing to upload here: viewtopic.php?f=18&t=37456

Found a Linux system with a stuck work unit. On examination, it tried an upload
And it stalled out at 57.4% , and has retried over 80 times as of this writing. Each
Upload retry stalls out at the same 57.4% mark (it takes about 60 seconds).

I checked the WU status here: https://apps.foldingathome.org/wu#proje ... ne=0&gen=5
And returns "not found"

Code: Select all

05:14:57:WU02:FS01:0x22:Completed 1212500 out of 1250000 steps (97%)
05:18:00:WU02:FS01:0x22:Completed 1225000 out of 1250000 steps (98%)
05:18:01:WU02:FS01:0x22:Checkpoint completed at step 1225000
05:21:04:WU02:FS01:0x22:Completed 1237500 out of 1250000 steps (99%)
05:24:07:WU02:FS01:0x22:Completed 1250000 out of 1250000 steps (100%)
05:24:07:WU02:FS01:0x22:Average performance: 9.44262 ns/day
05:24:09:WU02:FS01:0x22:Checkpoint completed at step 1250000
05:24:16:WU02:FS01:0x22:Saving result file ../logfile_01.txt
05:24:16:WU02:FS01:0x22:Saving result file checkpointIntegrator.xml
05:24:16:WU02:FS01:0x22:Saving result file checkpointState.xml
05:24:21:WU02:FS01:0x22:Saving result file positions.xtc
05:24:21:WU02:FS01:0x22:Saving result file science.log
05:24:21:WU02:FS01:0x22:Folding@home Core Shutdown: FINISHED_UNIT
05:24:22:WU02:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
05:24:22:WU02:FS01:Sending unit results: id:02 state:SEND error:NO_ERROR project:18201 run:5707 clone:0 gen:5 core:0x22 unit:0x0000000000000005000047190000164b
05:24:22:WU02:FS01:Uploading 27.50MiB to 128.252.203.11
05:24:22:WU02:FS01:Connecting to 128.252.203.11:8080
05:24:28:WU02:FS01:Upload 11.82%
05:24:34:WU02:FS01:Upload 24.09%
05:24:40:WU02:FS01:Upload 36.13%
05:24:46:WU02:FS01:Upload 48.63%
05:25:22:WU02:FS01:Upload 57.27%
05:25:22:WARNING:WU02:FS01:Exception: Failed to send results to work server: Transfer failed
05:25:22:WU02:FS01:Trying to send results to collection server
05:25:22:WU02:FS01:Uploading 27.50MiB to 128.252.203.13
05:25:22:WU02:FS01:Connecting to 128.252.203.13:8080
05:25:28:WU02:FS01:Upload 11.59%
05:25:34:WU02:FS01:Upload 23.63%
05:25:40:WU02:FS01:Upload 35.90%
05:25:46:WU02:FS01:Upload 48.18%
05:26:21:WU02:FS01:Upload 57.27%
05:26:21:ERROR:WU02:FS01:Exception: Transfer failed
05:26:21:WU02:FS01:Sending unit results: id:02 state:SEND error:NO_ERROR project:18201 run:5707 clone:0 gen:5 core:0x22 unit:0x0000000000000005000047190000164b
05:26:22:WU02:FS01:Uploading 27.50MiB to 128.252.203.11
05:26:22:WU02:FS01:Connecting to 128.252.203.11:8080
05:26:28:WU02:FS01:Upload 12.73%
05:26:34:WU02:FS01:Upload 24.77%
05:26:40:WU02:FS01:Upload 37.04%
05:26:46:WU02:FS01:Upload 49.08%
05:27:21:WU02:FS01:Upload 57.27%
05:27:21:WARNING:WU02:FS01:Exception: Failed to send results to work server: Transfer failed
It has received new work and returned it since the WU upload stalled out.

Code: Select all

13:07:16:WU00:FS01:0x22:Folding@home Core Shutdown: FINISHED_UNIT
13:07:16:WU00:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
13:07:16:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:17804 run:23 clone:35 gen:420 core:0x22 unit:0x00000023000001a40000458c00000017
13:07:17:WU00:FS01:Uploading 8.58MiB to 207.53.233.146
13:07:17:WU00:FS01:Connecting to 207.53.233.146:8080
13:07:23:WU00:FS01:Upload 40.07%
13:07:29:WU00:FS01:Upload 78.69%
13:07:32:WU00:FS01:Upload complete
13:07:32:WU00:FS01:Server responded WORK_ACK (400)
13:07:32:WU00:FS01:Final credit estimate, 129377.00 points
Should I dump the stuck WU 18201 ?
Gary480six
Posts: 93
Joined: Mon Jan 21, 2008 6:42 pm

Re: 18201 Upload repeating fails at 57.27% - (60 seconds)

Post by Gary480six »

I'm having a similar issue - also with a P18201 work unit. P 18201 R:43928, G:0, C:4
Only mine gets all the way to 92.06% before failing. And I'm at 53 attempts.
The work server is 128.252.203.11 and it rolls over to the collection server 128.252.203.13

I don't care too much about the points..... but if there are two of us reporting - it's likely more people are affected.

Size of the file it's trying to upload is 27.50MB (if that helps the diagnosis in any way)
Joe_H
Site Admin
Posts: 7937
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: 18201 Upload repeating fails at 57.27% - (60 seconds)

Post by Joe_H »

I have sent a report to the person running this project and the servers.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
jjmiller
Scientist
Posts: 139
Joined: Fri Apr 09, 2021 4:43 pm

Re: 18201 Upload repeating fails at 57.27% - (60 seconds)

Post by jjmiller »

Hi all,

Thanks for reporting this. I've taken a peek at the logs on the server side but unfortunately also don't see any information on the stuck WUs. I do see that each of you has returned other WUs from 18201 recently - if I may ask, were these completed on the same computers/internet connection/AV programs? I've also raised this to the folks above me and am hopeful we're able to get this figured out this time.

Thanks,
parkut
Posts: 363
Joined: Tue Feb 12, 2008 7:33 am
Hardware configuration: Running exclusively Linux headless blades. All are dedicated crunching machines.
Location: SE Michigan, USA

Re: 18201 Upload repeating fails at 57.27% - (60 seconds)

Post by parkut »

If I understood you to have directed the question to me, all from the same residential ISP account, all from
this one Ubuntu 18.04 Linux machine which has returned many without issue, and has returned more since the one got stuck.

Code: Select all

All of these ran to completion and uploaded with no trouble

log-20210929-031152.txt:18:08:38:WU01:FS01:0x22:Project: 18201 (Run 2775, Clone 4, Gen 3)
log-20210929-031152.txt:04:16:29:WU00:FS01:0x22:Project: 18201 (Run 1498, Clone 3, Gen 15)
log-20210929-031152.txt:15:46:55:WU02:FS01:0x22:Project: 18201 (Run 2239, Clone 4, Gen 12)
log-20210929-031152.txt:10:02:14:WU00:FS01:0x22:Project: 18201 (Run 2082, Clone 1, Gen 16)
log-20210929-031152.txt:19:59:25:WU01:FS01:0x22:Project: 18201 (Run 2094, Clone 4, Gen 15)
log-20210929-031152.txt:10:31:35:WU00:FS01:0x22:Project: 18201 (Run 1144, Clone 3, Gen 28)
log-20211020-120158.txt:06:57:58:WU01:FS01:0x22:Project: 18201 (Run 1126, Clone 2, Gen 26)
log-20211020-120158.txt:07:42:08:WU01:FS01:0x22:Project: 18201 (Run 2352, Clone 4, Gen 17)
log-20211020-120158.txt:17:50:25:WU01:FS01:0x22:Project: 18201 (Run 85, Clone 3, Gen 13)
log-20211020-120158.txt:22:56:10:WU00:FS01:0x22:Project: 18201 (Run 56, Clone 1, Gen 37)
log-20211020-120158.txt:21:19:16:WU00:FS01:0x22:Project: 18201 (Run 2986, Clone 4, Gen 3)
log-20211020-120158.txt:02:25:32:WU01:FS01:0x22:Project: 18201 (Run 1501, Clone 3, Gen 34)
log-20211020-120158.txt:16:05:06:WU01:FS01:0x22:Project: 18201 (Run 1985, Clone 4, Gen 32)
log-20211020-120158.txt:14:28:00:WU01:FS01:0x22:Project: 18201 (Run 1605, Clone 2, Gen 12)
log-20211020-120158.txt:22:37:12:WU01:FS01:0x22:Project: 18201 (Run 1788, Clone 2, Gen 37)
log-20211020-120158.txt:09:15:17:WU01:FS01:0x22:Project: 18201 (Run 3292, Clone 4, Gen 3)
log-20211020-120158.txt:19:54:16:WU03:FS01:0x22:Project: 18201 (Run 3451, Clone 3, Gen 0)
log-20211020-120158.txt:01:07:31:WU00:FS01:0x22:Project: 18201 (Run 2963, Clone 4, Gen 14)
log-20211020-120158.txt:21:47:17:WU01:FS01:0x22:Project: 18201 (Run 3584, Clone 0, Gen 1)
log-20211020-120158.txt:18:51:17:WU02:FS01:0x22:Project: 18201 (Run 1202, Clone 0, Gen 36)
log-20211020-120158.txt:19:50:21:WU00:FS01:0x22:Project: 18201 (Run 4457, Clone 0, Gen 0)
log-20211020-120158.txt:12:06:16:WU02:FS01:0x22:Project: 18201 (Run 1988, Clone 0, Gen 23)
log-20211020-120158.txt:15:48:06:WU02:FS01:0x22:Project: 18201 (Run 7152, Clone 0, Gen 0)
log-20211020-120158.txt:20:54:01:WU01:FS01:0x22:Project: 18201 (Run 12110, Clone 0, Gen 0)
log-20211020-120158.txt:07:29:34:WU00:FS01:0x22:Project: 18201 (Run 25018, Clone 0, Gen 0)
log-20211020-120158.txt:12:33:37:WU01:FS01:0x22:Project: 18201 (Run 31581, Clone 0, Gen 0)
log-20211020-120158.txt:17:37:57:WU02:FS01:0x22:Project: 18201 (Run 37844, Clone 0, Gen 0)
log-20211020-120158.txt:22:51:26:WU00:FS01:0x22:Project: 18201 (Run 5362, Clone 0, Gen 1)
log-20211020-120158.txt:03:58:53:WU01:FS01:0x22:Project: 18201 (Run 11926, Clone 0, Gen 1)
log-20211020-120158.txt:18:24:22:WU02:FS01:0x22:Project: 18201 (Run 32824, Clone 0, Gen 1)
log-20211020-120158.txt:20:57:55:WU01:FS01:0x22:Project: 18201 (Run 24192, Clone 0, Gen 2)
log-20211020-120158.txt:01:55:16:WU02:FS01:0x22:Project: 18201 (Run 14750, Clone 0, Gen 3)
log-20211020-120158.txt:13:05:35:WU01:FS01:0x22:Project: 18201 (Run 31207, Clone 0, Gen 3)
log-20211020-120158.txt:19:32:31:WU01:FS01:0x22:Project: 18201 (Run 41233, Clone 0, Gen 3)
log-20211020-120158.txt:00:44:33:WU02:FS01:0x22:Project: 18201 (Run 6841, Clone 0, Gen 1)
log-20211020-120158.txt:17:44:26:WU01:FS01:0x22:Project: 18201 (Run 20025, Clone 0, Gen 4)
log-20211020-120158.txt:22:49:01:WU02:FS01:0x22:Project: 18201 (Run 24977, Clone 0, Gen 4)
log-20211020-120158.txt:10:47:30:WU02:FS01:0x22:Project: 18201 (Run 39160, Clone 0, Gen 4)
log-20211020-120158.txt:19:05:27:WU00:FS01:0x22:Project: 18201 (Run 49570, Clone 0, Gen 4)
log-20211020-120158.txt:05:24:22:WU00:FS01:0x22:Project: 18201 (Run 11173, Clone 0, Gen 5)
log-20211020-120158.txt:19:05:15:WU00:FS01:0x22:Project: 18201 (Run 27161, Clone 0, Gen 5)
log-20211020-151421.txt:13:07:17:WU01:FS01:0x22:Project: 18201 (Run 43662, Clone 0, Gen 5)
Model Name: GTX 1660 Ti
Driver Version: 460.91.03
Gpu temp: 64C
Power Draw: 121.92W
...
Client Version: 7.6.21
Core: FahCore_22
Core Version: 0.0.13

edited for clarity
Last edited by parkut on Thu Oct 21, 2021 12:03 am, edited 2 times in total.
jchang6
Posts: 65
Joined: Sat May 09, 2020 2:13 pm
Hardware configuration: Intel Xeon E3/E5, various generations from Westmere to Skylake. AMD Radeon RX5x00 and nVidia RTX 2080 Super.
Location: Boston
Contact:

Re: 18201 Upload repeating fails at 57.27% - (60 seconds)

Post by jchang6 »

18201 works great for me, regularly getting 8M+ PPD on 3080 TI
Image
parkut
Posts: 363
Joined: Tue Feb 12, 2008 7:33 am
Hardware configuration: Running exclusively Linux headless blades. All are dedicated crunching machines.
Location: SE Michigan, USA

Re: 18201 Upload repeating fails at 57.27% - (60 seconds)

Post by parkut »

Also, there is no AV running on this machine.
jjmiller
Scientist
Posts: 139
Joined: Fri Apr 09, 2021 4:43 pm

Re: 18201 Upload repeating fails at 57.27% - (60 seconds)

Post by jjmiller »

Thanks parkut for the followup information.

This remains perplexing- the only thing that recently has changed on our end is that I reset the server a couple of days before the stuck WU was assigned (on the 17th) but I wouldn't expect any wrinkles from that several days later. Other WU returns are of similar packet size and there are a ton of WUs being returned (highlighted by both you and jchang6). I'm really stuck on what else may be behind WUs sporadically getting stuck.

Based on the previous post (viewtopic.php?f=18&t=37456) it seems unlikely that the WU will unstick and resetting the server doesn't seem to affect the sticking. I just issued a restart to the server, with the hope that things are different this time around?
Gary480six
Posts: 93
Joined: Mon Jan 21, 2008 6:42 pm

Re: 18201 Upload repeating fails at 57.27% - (60 seconds)

Post by Gary480six »

I have at least seven systems Folding here from the same ISP. I did not notice if the one system that is having an issue with the uploads, has completed another P18201 since the troubles started... but I did spot a P18201 that system completed some weeks back.

My 'troubled' system is running Windows 7 Pro and has just Microsoft Security Essentials. And this system has completed other GPU work units since the P18201 got 'stuck'.

I was not expecting a 'fix' on your end. On the 23rd it will reach it's five day final deadline.. and get deleted.

Maybe this was just a blip that happened to a handful of work units assigned on the 17th/18th - and that's all.
toTOW
Site Moderator
Posts: 6359
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: 18201 Upload repeating fails at 57.27% - (60 seconds)

Post by toTOW »

Sometimes, a reboot of the modem and/or the router helps ...

Or a reboot of the FAH server :D
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
Gary480six
Posts: 93
Joined: Mon Jan 21, 2008 6:42 pm

Re: 18201 Upload repeating fails at 57.27% - (60 seconds)

Post by Gary480six »

Thought I'd follow up here - because it looks like there might be a major problem with this project or this server.

I have a different system stuck. The work is still from P18201 but now it's R:54378 C:0 G:9
Sent to me on 10/30. Finished but failing to upload. Stopping at 62.06% this time.

But look at the status page for that work

Code: Select all

https://apps.foldingathome.org/wu#project=18201&run=54378&clone=0&gen=9
It only shows two assignments of that work - neither of them me. So not even a record of the failure. But also notice that it was first assigned on AUGUST 31 - and not completed till NOVEMBER 03.
That work unit has a FIVE day final expiration. Does that mean it was sent out EIGHTEEN times before it was completed?

Is nobody seeing this as a problem?
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon [email protected], 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon [email protected], 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: [email protected], 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: 18201 Upload repeating fails at 57.27% - (60 seconds)

Post by Neil-B »

Reported this thread to the researcher concerned.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
jjmiller
Scientist
Posts: 139
Joined: Fri Apr 09, 2021 4:43 pm

Re: 18201 Upload repeating fails at 57.27% - (60 seconds)

Post by jjmiller »

Hi Gary480six and thanks for the followup.

Regarding the status of this particular WU- it was assigned once on 8/31 and failed within 20 minutes of assignment (as seen on WU Status). The very next assignment of this WU was on 10/30 (to you). When it got stuck and hit the timeout (2 days) it became eligible for reassignment and was reassigned on 11/3. The record of your stuck WU should not hit WUstatus until it reaches the deadline (5 days), but is recorded on our private logs. The reason for the change in WU assignment timing is that we swapped the prioritization of WUs for this project around a little and are now currently prioritizing: runs>gens>clones in accordance with some timeline developments on our end.

WUs becoming stuck is a big concern. I don't like that we're sending WUs out and not able to retrieve them after people have donated their resources folding them. At the moment we cannot differentiate between a WU that was completed and the upload failed vs a WU that was assigned but then the person immediately turned off their computer, so these reports are vital in our efforts to fix this. I am continuing to try and troubleshoot what may be causing these issues- at one point we were concerned that WUs assigned/returning at the time of a server restart may be getting stuck, but this report rules that out as the WS for P18201 has been live for 13 days. I've raised this as an issue to the higher ups on my end.

Out of curiosity- do you have a max packet/upload size set?

Thanks,
Gary480six
Posts: 93
Joined: Mon Jan 21, 2008 6:42 pm

Re: 18201 Upload repeating fails at 57.27% - (60 seconds)

Post by Gary480six »

JJ,

Thanks for the information. No - I don't have max packet/upload size set. Even though I have been Folding for years... I don't even know where that setting would be? Is that something in the Configure/Expert section?

You said that the original release of this work unit failed and was returned after only 20 minutes. Does that follow any pattern with the other work that is getting 'stuck'? (a quick failure followed by issues later)

The i7-4770 with it's GTX 1070 that has this specific stuck work, just finished and returned a different P18201. So the hardware, network etc do not have an issue with this work or those servers. This is one quirky issue. :-)
jjmiller
Scientist
Posts: 139
Joined: Fri Apr 09, 2021 4:43 pm

Re: 18201 Upload repeating fails at 57.27% - (60 seconds)

Post by jjmiller »

Max-packet used to be a setting that could be used for folks working on dial-up modems but I don't think it is widely used any more (and may not be supported?).
Technically, it's still supported but it has been adjusted in a recent client update -- perhaps because it was causing too many WUs to be reassigned...perhaps not.

It CAN be set by the donor and used by the project owner ... but it's not widely used.
Post Reply