Page 10 of 12

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Posted: Wed Oct 30, 2019 5:22 pm
by bollix47
I had a similar problem last weekend ... here's what worked for me:

Open FAHControl
Click on Pause
Exit FAHControl
Re-boot computer
Open FAHControl
Click on Fold

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Posted: Wed Oct 30, 2019 5:50 pm
by Frisa
As a CPU only folder, i found out theres one interesting thing is the A7 project never failed to downloaded/uploaded.
currently most A7 WUs are assigned from A7 only server 128.252.203.9, but occasionally i got WUs from 155.247.166.219, sometime i got transfer failure from 128.252.203.9 but never got SINGLE transfer failure from 219 server for one month

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Posted: Wed Oct 30, 2019 5:59 pm
by bollix47
My failure on the weekend was an a7 project ... unfortunately it happens to all projects regardless of the core. It does not happen every time and some will not experience it for weeks at a time or maybe never, but it's certainly a 'pain' when it happens. The fact is that the core has nothing to do with the download/upload sequences ... that's all done by the client.

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Posted: Wed Oct 30, 2019 6:16 pm
by biodoc
bollix47 wrote:I had a similar problem last weekend ... here's what worked for me:

Open FAHControl
Click on Pause
Exit FAHControl
Re-boot computer
Open FAHControl
Click on Fold
This worked. thanks!

I tried stopping the fahclient and killing any remaining processes for user fahclient and then restarting the client but that didn't work in this case. Rebooting linux is not a satisfying solution for me but it worked.

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Posted: Thu Oct 31, 2019 1:27 pm
by bruce
There's a known bug in FAHCore_A7. It has been fixed in a new version of that FAHCore and that version is being beta tested so it should be ready to release soon. The bug causes extra (unnecessary) information to be added to the result, making the file too large to upload. Excessively large uploads are being rejected by the servers. (Yours is 68 MiB and it should be maybe 10 MiB)

I would probably discard that file, but it will eventually expire and delete itself from your system.

Most likely you have been processing that WU with the "on idle" setting. I recommend you discontinue using that setting until the new CPU FAHCore_a7 is released.

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Posted: Wed Nov 06, 2019 9:16 pm
by Catalina588
November 6 1600 EST - Temple server .219 is failing to download GPU work units.

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Posted: Wed Nov 06, 2019 10:27 pm
by DocJonz
Catalina588 wrote:November 6 1600 EST - Temple server .219 is failing to download GPU work units.
I concur - looks like the download issues are back with the 155.247.166.* server.

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Posted: Wed Nov 06, 2019 10:34 pm
by JimF
I am down on two out of four Folding machines. I will keep them down until/unless someone can give the "all clear".

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Posted: Wed Nov 06, 2019 11:21 pm
by bruce
What messages are you seeing when .219 doesn't issue a WU?

=====

The bug in the CPU FAHCore_A7 has been fixed so all CPU Wus going out now will no longer be inflated ... consuming extra bandwidth. Over the next couple of weeks, those CPU WUs that are being processed will be completed and the congestion problem will gradually be reduced.

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Posted: Wed Nov 06, 2019 11:58 pm
by dfgirl12
Same. I've been getting hung folding slots for the past 4+ hours. Downloads fail from *.219, and just stop like this:

Code: Select all

2019-11-06:23:46:53:WU01:FS01:0x21:Completed 25000000 out of 25000000 steps (100%)
2019-11-06:23:46:53:WU01:FS01:0x21:Saving result file logfile_01.txt
2019-11-06:23:46:53:WU01:FS01:0x21:Saving result file checkpointState.xml
2019-11-06:23:46:53:WU01:FS01:0x21:Saving result file checkpt.crc
2019-11-06:23:46:53:WU01:FS01:0x21:Saving result file log.txt
2019-11-06:23:46:53:WU01:FS01:0x21:Saving result file positions.xtc
2019-11-06:23:46:54:WU01:FS01:0x21:Folding@home Core Shutdown: FINISHED_UNIT
2019-11-06:23:46:54:WU01:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
2019-11-06:23:46:54:WU01:FS01:Sending unit results: id:01 state:SEND error:NO_ERROR project:14191 run:17 clone:13 gen:89 core:0x21 unit:0x0000007f0002894c5d5d742b6b992b05
2019-11-06:23:46:54:WU01:FS01:Uploading 9.30MiB to 155.247.166.220
2019-11-06:23:46:54:WU01:FS01:Connecting to 155.247.166.220:8080
2019-11-06:23:46:54:WU02:FS01:Connecting to 65.254.110.245:8080
2019-11-06:23:46:54:WU02:FS01:Assigned to work server 155.247.166.219
2019-11-06:23:46:54:WU02:FS01:Requesting new work unit for slot 01: READY gpu:1:GP104 [GeForce GTX 1080] 8873 from 155.247.166.219
2019-11-06:23:46:54:WU02:FS01:Connecting to 155.247.166.219:8080
2019-11-06:23:46:55:WU02:FS01:Downloading 27.50MiB
2019-11-06:23:47:00:WU01:FS01:Upload 11.43%
2019-11-06:23:47:02:WU02:FS01:Download 1.59%
2019-11-06:23:47:06:WU01:FS01:Upload 21.51%
2019-11-06:23:47:09:WU02:FS01:Download 2.27%
2019-11-06:23:47:13:WU01:FS01:Upload 30.92%
2019-11-06:23:47:17:WU02:FS01:Download 2.73%
2019-11-06:23:47:19:WU01:FS01:Upload 43.02%
2019-11-06:23:47:24:WU02:FS01:Download 3.18%
2019-11-06:23:47:25:WU01:FS01:Upload 56.46%
2019-11-06:23:47:31:WU01:FS01:Upload 67.88%
2019-11-06:23:47:33:WU02:FS01:Download 3.86%
2019-11-06:23:47:37:WU01:FS01:Upload 84.02%
2019-11-06:23:47:42:WU02:FS01:Download 4.32%
2019-11-06:23:47:43:WU01:FS01:Upload 96.79%
2019-11-06:23:47:45:WU01:FS01:Upload complete
2019-11-06:23:47:45:WU01:FS01:Server responded WORK_ACK (400)
2019-11-06:23:47:45:WU01:FS01:Final credit estimate, 192064.00 points
2019-11-06:23:47:45:WU01:FS01:Cleaning up
2019-11-06:23:48:21:WU02:FS01:Download 4.55%
2019-11-06:23:48:29:WU02:FS01:Download 4.77%
2019-11-06:23:48:35:WU02:FS01:Download 5.23%
2019-11-06:23:51:11:WU02:FS01:Download 5.45%
2019-11-06:23:51:12:ERROR:WU02:FS01:Exception: Transfer failed
2019-11-06:23:51:13:WU02:FS01:Connecting to 65.254.110.245:8080
2019-11-06:23:51:13:WU02:FS01:Assigned to work server 155.247.166.219
2019-11-06:23:51:13:WU02:FS01:Requesting new work unit for slot 01: READY gpu:1:GP104 [GeForce GTX 1080] 8873 from 155.247.166.219
2019-11-06:23:51:13:WU02:FS01:Connecting to 155.247.166.219:8080
2019-11-06:23:51:13:WU02:FS01:Downloading 27.45MiB
2019-11-06:23:51:21:WU02:FS01:Download 0.46%
2019-11-06:23:51:28:WU02:FS01:Download 1.59%
2019-11-06:23:51:54:WU02:FS01:Download 2.28%
2019-11-06:23:52:20:WU02:FS01:Download 2.96%
2019-11-06:23:52:26:WU02:FS01:Download 3.87%
2019-11-06:23:52:33:WU02:FS01:Download 4.55%
2019-11-06:23:52:39:WU02:FS01:Download 5.46%
2019-11-06:23:52:45:WU02:FS01:Download 6.15%
2019-11-06:23:52:56:WU02:FS01:Download 7.06%
2019-11-06:23:53:03:WU02:FS01:Download 7.51%
2019-11-06:23:53:10:WU02:FS01:Download 8.42%
2019-11-06:23:53:20:WU02:FS01:Download 8.65%
2019-11-06:23:53:26:WU02:FS01:Download 8.88%
2019-11-06:23:53:35:WU02:FS01:Download 9.56%
2019-11-06:23:53:42:WU02:FS01:Download 9.79%
2019-11-06:23:53:49:WU02:FS01:Download 10.47%
2019-11-06:23:53:56:WU02:FS01:Download 10.93%
2019-11-06:23:54:02:WU02:FS01:Download 11.15%
2019-11-06:23:54:26:WU02:FS01:Download 11.38%
2019-11-06:23:54:32:WU02:FS01:Download 12.29%
2019-11-06:23:54:40:WU02:FS01:Download 12.98%
2019-11-06:23:54:46:WU02:FS01:Download 13.43%
2019-11-06:23:54:53:WU02:FS01:Download 14.34%
2019-11-06:23:55:14:WU02:FS01:Download 14.57%
2019-11-06:23:55:20:WU02:FS01:Download 15.94%
2019-11-06:23:55:26:WU02:FS01:Download 16.85%
2019-11-06:23:55:32:WU02:FS01:Download 17.76%
2019-11-06:23:55:38:WU02:FS01:Download 19.35%
2019-11-06:23:55:45:WU02:FS01:Download 20.72%
2019-11-06:23:55:52:WU02:FS01:Download 21.63%
2019-11-06:23:55:58:WU02:FS01:Download 22.31%
2019-11-06:23:56:05:WU02:FS01:Download 23.22%
2019-11-06:23:56:11:WU02:FS01:Download 24.13%
2019-11-06:23:56:17:WU02:FS01:Download 25.04%
2019-11-06:23:56:36:WU02:FS01:Download 25.95%
2019-11-06:23:56:45:WU02:FS01:Download 26.18%
2019-11-06:23:56:53:WU02:FS01:Download 26.41%
Or, this one that fixed itself. Downloads failed from *.219, but are OK from *.220:

Code: Select all

2019-11-06:23:47:36:WU00:FS00:Connecting to 65.254.110.245:8080
2019-11-06:23:47:36:WU00:FS00:Assigned to work server 155.247.166.219
2019-11-06:23:47:36:WU00:FS00:Requesting new work unit for slot 00: READY gpu:0:TU102 [GeForce RTX 2080 Ti Rev. A] M 13448 from 155.247.166.219
2019-11-06:23:47:36:WU00:FS00:Connecting to 155.247.166.219:8080
2019-11-06:23:47:37:ERROR:WU00:FS00:Exception: Server did not assign work unit
2019-11-06:23:47:37:WU00:FS00:Connecting to 65.254.110.245:8080
2019-11-06:23:47:38:WU00:FS00:Assigned to work server 155.247.166.219
2019-11-06:23:47:38:WU00:FS00:Requesting new work unit for slot 00: READY gpu:0:TU102 [GeForce RTX 2080 Ti Rev. A] M 13448 from 155.247.166.219
2019-11-06:23:47:38:WU00:FS00:Connecting to 155.247.166.219:8080
2019-11-06:23:47:38:WU00:FS00:Downloading 27.49MiB
2019-11-06:23:47:52:WU00:FS00:Download 0.68%
2019-11-06:23:47:59:WU00:FS00:Download 0.91%
2019-11-06:23:48:09:WU00:FS00:Download 1.36%
2019-11-06:23:48:17:WU00:FS00:Download 2.05%
2019-11-06:23:48:25:WU00:FS00:Download 2.73%
2019-11-06:23:48:47:WU00:FS00:Download 2.96%
2019-11-06:23:49:45:WU00:FS00:Download 3.08%
2019-11-06:23:49:45:ERROR:WU00:FS00:Exception: Transfer failed
2019-11-06:23:49:45:WU00:FS00:Connecting to 65.254.110.245:8080
2019-11-06:23:49:46:WU00:FS00:Assigned to work server 155.247.166.220
2019-11-06:23:49:46:WU00:FS00:Requesting new work unit for slot 00: READY gpu:0:TU102 [GeForce RTX 2080 Ti Rev. A] M 13448 from 155.247.166.220
2019-11-06:23:49:46:WU00:FS00:Connecting to 155.247.166.220:8080
2019-11-06:23:49:46:WU00:FS00:Downloading 15.58MiB
2019-11-06:23:49:52:WU00:FS00:Download 89.88%
2019-11-06:23:49:52:WU00:FS00:Download complete

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Posted: Thu Nov 07, 2019 3:28 pm
by absolutefunk
Both of my 1070s were hung on 'download' this morning. 155.247.166.219 needs to be pulled until the underlying problem can be fixed. It's been over a month intermittently now. This is not a good look for the project.

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Posted: Fri Nov 08, 2019 2:54 pm
by bruce
absolutefunk wrote:Both of my 1070s were hung on 'download' this morning. 155.247.166.219 needs to be pulled until the underlying problem can be fixed. It's been over a month intermittently now. This is not a good look for the project.
is

I don't think there's any chance that the server will be pulled. vav3.ocis.temple.edu is currently supporting about 25% of FAH''s activity. Taking that capacity off-line would make a huge disruption in your ability to get an assignment when you need one. I understand it looks bad, and is an inconvenience for you but that's unrealistic and makes a moderate problem into a big one. New hardware is being ordered to handle the recent increase in production, but provisioning for that increase takes time and money.

Besides, the first step has been completed (fixing FAHCore_a7 software) and rolling out that fix takes time because it's cannot be called "completed" until all WUs currently in the field are refreshed with new ones, no matter how slow the Donor hardware happens to be.

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Posted: Fri Nov 08, 2019 3:42 pm
by absolutefunk
Wow, 25%, I thought the load was more distributed than that. These issues don't bother me that much, but the hanging downloads require manual intervention on our behalf, and for folders which don't (or can't) check their systems periodically, it results in lost output. I'm hoping the next client release supports a hard timeout on downloads, which would help a lot.

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Posted: Fri Nov 08, 2019 3:47 pm
by bruce
My 25% number came from https://apps.foldingathome.org/serverstats.

There are a lot of servers currently off-line, and don't know why. (Possibly related to a recent critical upgrade of the server software)

The essential part of my post is that the problems are understood and are being addressed --- and it takes a lot of "red tape" to get enough signatures to spend as much money as it takes to get a new server(s).

Re: Multiple WU's Fail downld/upld to 155.247.166.*

Posted: Fri Nov 08, 2019 3:57 pm
by Joe_H
Also probably related to servers no longer being operated out of Stanford, the last one was shut off in the last couple months. That leaves servers at WUSTL, Temple and MSKCC.