Page 2 of 3
Re: WU's not being send to work server 3.*
Posted: Thu Apr 30, 2020 5:50 pm
by Neil-B
Whilst I know how you must feel, and uploading straight away or before Timeout is obviously preferable, the WU is only down the drain once it reaches expiration.
Re: WU's not being send to work server
Posted: Thu Apr 30, 2020 8:11 pm
by CaptainHalon
Neil-B wrote:but if you think it helps then feel free to state your opinions just as others might feel free to state contradictory ones
I think a lot of the frustration on the donor side could have been mitigated by putting simple controls in the client a long time ago. A white list/black list feature for project numbers would alleviate much frustration, and I think it should be something that's allowed. If you consider how much a research team would have to pay for AWS or Azure resources to accomplish the same tasks that the donors allow them to accomplish for free, then it should be a donor's right omit problem projects that are wasting their hardware resources and electricity.
More often that not when thumbing through the forums, I just see donors getting push back for complaints and told they can quit FAH if they don't like it. It's akin to giving a homeless man $100, watching him spend it all on liquor, and then being told "hey buddy, if ya don't like it, don't give me any money." I suppose that's his right, but it's still rather tasteless.
Re: WU's not being send to work server 3.*
Posted: Thu Apr 30, 2020 9:06 pm
by HenrikJolsen
Same problem here
19:52:35:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:16435 run:2891 clone:0 gen:0 core:0x22 unit:0x0000000203854c135e9a4ef77d34b1df
19:52:35:WU00:FS01:Uploading 133.16MiB to 3.133.76.19
19:52:35:WU00:FS01:Connecting to 3.133.76.19:8080
19:52:36:WARNING:WU00:FS01:Exception: Failed to send results to work server: Transfer failed
19:52:36:WU00:FS01:Trying to send results to collection server
19:52:36:WU00:FS01:Uploading 133.16MiB to 3.21.157.11
19:52:36:WU00:FS01:Connecting to 3.21.157.11:8080
19:52:37:ERROR:WU00:FS01:Exception: Transfer failed
13 attempts already...
Re: WU's not being send to work server 3.*
Posted: Thu Apr 30, 2020 9:12 pm
by HaloJones
Allowing the donors to decide which units to do leads to cherry-picking and impacts the science. This project isn't run for the sake of the donors and for getting points. When it started there were no points, only units and the only statistics were how many units had been done. Many of the changes in the points system have been to try to prevent the donors from having to do anything other than accept work and do work. Donate the hardware available and let them fold.
Until the start of this CV-19 work, the number of donors and the amount of server hardware and work was pretty equitable. It worked most of the time and the only real recurring problem was the stats server constantly stopping.
Now with a new project, a ramp up in donors twenty fold, huge publicity from Nvidia, Intel and a bunch of influential Youtubers, the project is getting constant repeat questions on a forum where there are no staffers only other donors who try to answer and help.
Has this project been a victim of its own sudden success? Yes, of course it has. But arguing with 20/20 hindsight that it should have done x or y without stopping to perhaps ASK why it is the way it is, is perhaps not overly helpful.
It's been going for over a decade and the method of allocating work automatically via the priorities of the scientists has always worked fine. The new CV-19 units were put at the top of the priority list as soon as they were ready to be worked on and everyone would have got that work but oh no, the new donors complained that they couldn't specify that they would only do CV-19. Is there something wrong with curing cancer while waiting for the CV-19 work? Is there something wrong with being allocated the work that the researchers need doing?
We get enough questions about simply adding a GPU slot without having to deal with users asking which units they should blacklist and which ones get the best points. Who are you suggesting will maintain this white/black list? Are you volunteering?
Re: WU's not being send to work server 3.*
Posted: Thu Apr 30, 2020 11:28 pm
by bruce
I understand this has just been corrected.
Re: WU's not being send to work server 3.*
Posted: Fri May 01, 2020 1:02 am
by Epsilon_Process
bruce wrote:I understand this has just been corrected.
Yes, looks like it. Both my stuck work units over 130MiB did finally upload and receive points, although it took many tries before a server connected. I can only imagine there must be a considerable backlog.
Thanks for getting it all sorted out.
Re: WU's not being send to work server 3.*
Posted: Fri May 01, 2020 1:42 am
by schertt
ChrisD5710 wrote:Maybe You should consider supporting work servers with less storage?
The science itself necessitates a large amount of storage space for the data. Fragmenting the infrastructure into even more servers that would then require even more attention than before is a troublesome way to approach the issue. Those types of servers aren't meant to be hosted by the average user; it requires a degree of understanding in both hardware and networking and a level of financial resource that typically comes at the institution level.
Re: WU's not being send to work server 3.*
Posted: Fri May 01, 2020 2:12 am
by lazyacevw
I've been failing with a large result over the last several days as well. No other issues with any of my slots over the past week:
Waiting on: Send Results
Attempts: 25
Assigned: 2020-04-28T21:46:14Z
Expiration: 2020-05-05T21:46:14Z
Bonus: 0
3.133.76.19 was restarted 30 minutes ago.
3.21.157.11 was restarted 3 hours ago.
We will see....
https://apps.foldingathome.org/serverstats
01:12:32:WU02:FS01:Sending unit results: id:02 state:SEND error:NO_ERROR project:16435 run:1479 clone:4 gen:0 core:0x22 unit:0x0000000003854c135e9a4ef9e4ff1a84
01:12:32:WU02:FS01:Uploading 141.53MiB to 3.133.76.19
01:12:32:WU02:FS01:Connecting to 3.133.76.19:8080
01:14:42:WARNING:WU02:FS01:WorkServer connection failed on port 8080 trying 80
01:14:42:WU02:FS01:Connecting to 3.133.76.19:80
01:16:53:WARNING:WU02:FS01:Exception: Failed to send results to work server: Failed to connect to 3.133.76.19:80: Connection timed out
01:16:53:WU02:FS01:Trying to send results to collection server
01:16:53:WU02:FS01:Uploading 141.53MiB to 3.21.157.11
01:16:53:WU02:FS01:Connecting to 3.21.157.11:8080
01:19:05:WARNING:WU02:FS01:WorkServer connection failed on port 8080 trying 80
01:19:05:WU02:FS01:Connecting to 3.21.157.11:80
01:21:16:ERROR:WU02:FS01:Exception: Failed to connect to 3.21.157.11:80: Connection timed out
******************************* Date: 2020-05-01 *******************************
Re: WU's not being send to work server 3.*
Posted: Fri May 01, 2020 3:49 am
by anandhanju
@lazyacevw, Can you try stopping your client entirely and restarting the program to ensure it retries again on startup? Please post an update if it fails.
Re: WU's not being send to work server 3.*
Posted: Fri May 01, 2020 5:49 am
by lazyacevw
anandhanju wrote:@lazyacevw, Can you try stopping your client entirely and restarting the program to ensure it retries again on startup? Please post an update if it fails.
Thanks! I restarted the computer yesterday but it didn't work. I went to try a service restart but as I was about to do so, the upload cleared right before my eyes! Must've taken 30 or so attempts.
Code: Select all
01:12:32:WU02:FS01:Sending unit results: id:02 state:SEND error:NO_ERROR project:16435 run:1479 clone:4 gen:0 core:0x22 unit:0x0000000003854c135e9a4ef9e4ff1a84
01:12:32:WU02:FS01:Uploading 141.53MiB to 3.133.76.19
01:12:32:WU02:FS01:Connecting to 3.133.76.19:8080
01:14:42:WARNING:WU02:FS01:WorkServer connection failed on port 8080 trying 80
01:14:42:WU02:FS01:Connecting to 3.133.76.19:80
01:16:53:WARNING:WU02:FS01:Exception: Failed to send results to work server: Failed to connect to 3.133.76.19:80: Connection timed out
01:16:53:WU02:FS01:Trying to send results to collection server
01:16:53:WU02:FS01:Uploading 141.53MiB to 3.21.157.11
01:16:53:WU02:FS01:Connecting to 3.21.157.11:8080
01:19:05:WARNING:WU02:FS01:WorkServer connection failed on port 8080 trying 80
01:19:05:WU02:FS01:Connecting to 3.21.157.11:80
01:21:16:ERROR:WU02:FS01:Exception: Failed to connect to 3.21.157.11:80: Connection timed out
...
04:31:33:WU02:FS01:Sending unit results: id:02 state:SEND error:NO_ERROR project:16435 run:1479 clone:4 gen:0 core:0x22 unit:0x0000000003854c135e9a4ef9e4ff1a84
04:31:33:WU02:FS01:Uploading 141.53MiB to 3.133.76.19
04:31:33:WU02:FS01:Connecting to 3.133.76.19:8080
04:33:42:WARNING:WU02:FS01:WorkServer connection failed on port 8080 trying 80
04:33:42:WU02:FS01:Connecting to 3.133.76.19:80
04:33:46:WU02:FS01:Upload 0.04%
04:35:33:WU02:FS01:Upload 0.13%
04:35:39:WU02:FS01:Upload 0.97%
...
04:45:27:WU02:FS01:Upload 97.59%
04:45:33:WU02:FS01:Upload 98.61%
04:45:39:WU02:FS01:Upload 99.67%
04:45:43:WU02:FS01:Upload complete
04:45:43:WU02:FS01:Server responded WORK_ACK (400)
04:45:43:WU02:FS01:Final credit estimate, 43291.00 points
04:45:43:WU02:FS01:Cleaning up
Weird. The servers must still be not in a happy place.
Re: WU's not being send to work server 3.*
Posted: Fri May 01, 2020 11:37 am
by hnapel
My WU for which I started this topic eventually got uploaded within the deadline, I'm not sure if it had to do with (simply) overload or that the server had some other issue, but anyway patience helps.
Re: WU's not being send to work server 3.*
Posted: Fri May 01, 2020 12:06 pm
by Neil-B
From a post in another thread - and reading between the lines - the cause of the issues was identified and a solution put in place.
Really glad they got the server accepting WUs before yours expired
Re: WU's not being send to work server 3.*
Posted: Fri May 01, 2020 2:39 pm
by jrweiss
Finally uploading after restarting client this morning. Very slow start, though, then accepting at ~4-5Mbps:
Code: Select all
14:29:01:WARNING:WU02:FS01:WorkServer connection failed on port 8080 trying 80
14:29:01:WU02:FS01:Connecting to 3.133.76.19:80
14:29:22:WARNING:WU02:FS01:Exception: Failed to send results to work server: Failed to connect to 3.133.76.19:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
14:29:22:WU02:FS01:Trying to send results to collection server
14:29:22:WU02:FS01:Uploading 140.41MiB to 3.21.157.11
14:29:22:WU02:FS01:Connecting to 3.21.157.11:8080
14:29:32:WU02:FS01:Upload 0.09%
14:33:08:WU02:FS01:Upload 0.13%
14:33:14:WU02:FS01:Upload 2.23%
14:33:20:WU02:FS01:Upload 3.96%
14:33:26:WU02:FS01:Upload 6.37%
14:33:32:WU02:FS01:Upload 8.59%
14:33:38:WU02:FS01:Upload 10.51%
14:33:44:WU02:FS01:Upload 12.06%
14:33:50:WU02:FS01:Upload 14.16%
14:33:56:WU02:FS01:Upload 15.22%
14:34:02:WU02:FS01:Upload 17.23%
14:34:08:WU02:FS01:Upload 19.23%
14:34:14:WU02:FS01:Upload 21.59%
14:34:20:WU02:FS01:Upload 23.81%
14:34:26:WU02:FS01:Upload 25.64%
14:34:32:WU02:FS01:Upload 27.42%
14:34:38:WU02:FS01:Upload 29.69%
14:34:44:WU02:FS01:Upload 31.52%
14:34:50:WU02:FS01:Upload 33.30%
14:34:56:WU02:FS01:Upload 35.12%
14:35:02:WU02:FS01:Upload 36.68%
14:35:08:WU02:FS01:Upload 38.73%
14:35:14:WU02:FS01:Upload 40.55%
14:35:20:WU02:FS01:Upload 42.06%
14:35:26:WU02:FS01:Upload 44.34%
14:35:32:WU03:FS00:0xa7:Completed 172500 out of 250000 steps (69%)
14:35:32:WU02:FS01:Upload 45.76%
14:35:38:WU02:FS01:Upload 47.81%
14:35:44:WU02:FS01:Upload 50.17%
14:35:50:WU02:FS01:Upload 51.55%
14:35:56:WU02:FS01:Upload 53.68%
14:36:02:WU02:FS01:Upload 56.00%
14:36:08:WU02:FS01:Upload 58.13%
14:36:14:WU02:FS01:Upload 60.00%
14:36:20:WU02:FS01:Upload 62.18%
14:36:26:WU02:FS01:Upload 64.32%
14:36:32:WU02:FS01:Upload 66.59%
14:36:38:WU02:FS01:Upload 67.66%
14:36:44:WU02:FS01:Upload 69.62%
14:36:50:WU02:FS01:Upload 72.16%
14:36:56:WU02:FS01:Upload 74.34%
14:37:02:WU02:FS01:Upload 76.52%
14:37:08:WU02:FS01:Upload 79.01%
14:37:14:WU02:FS01:Upload 80.93%
14:37:20:WU02:FS01:Upload 82.53%
14:37:26:WU02:FS01:Upload 84.40%
14:37:32:WU02:FS01:Upload 86.36%
14:37:38:WU02:FS01:Upload 88.45%
14:37:44:WU02:FS01:Upload 89.65%
14:37:50:WU02:FS01:Upload 91.96%
14:37:56:WU02:FS01:Upload 93.74%
14:38:02:WU02:FS01:Upload 95.84%
14:38:08:WU02:FS01:Upload 97.57%
14:38:14:WU02:FS01:Upload 99.98%
14:38:16:WU02:FS01:Upload complete
14:38:16:WU02:FS01:Server responded WORK_ACK (400)
14:38:16:WU02:FS01:Final credit estimate, 51387.00 points
14:38:16:WU02:FS01:Cleaning up
3.133.76.19 (aws1.foldingathome.org)
Posted: Sun May 03, 2020 4:16 am
by tbonse
This server's uptime is less than 40 minutes and is already having problems again.
3.133.76.19 aws1.foldingathome.org WS 9.6.8 joseph 10,764.00/hr 0 0 Yes Assign 18,116 7,896 OPENMM_22, GRO_A7 6.72TiB 37 minutes 2020-05-03T04:09:54Z
Code: Select all
04:07:02:WU01:FS01:Uploading 78.05MiB to 3.133.76.19
04:07:02:WU01:FS01:Connecting to 3.133.76.19:8080
04:08:42:WARNING:WU00:FS01:WorkServer connection failed on port 8080 trying 80
04:08:42:WU00:FS01:Connecting to 3.133.76.19:80
04:09:09:WARNING:WU01:FS01:WorkServer connection failed on port 8080 trying 80
04:09:09:WU01:FS01:Connecting to 3.133.76.19:80
04:10:49:ERROR:WU00:FS01:Exception: Failed to connect to 3.133.76.19:80: Connection timed out
04:10:49:WU00:FS01:Connecting to 65.254.110.245:80
04:10:50:WU00:FS01:Assigned to work server 128.252.203.10
04:10:50:WU00:FS01:Requesting new work unit for slot 01: READY gpu:0:GP107GL [Quadro P1000] from 128.252.203.10
04:10:50:WU00:FS01:Connecting to 128.252.203.10:8080
04:11:16:WARNING:WU01:FS01:Exception: Failed to send results to work server: Failed to connect to 3.133.76.19:80: Connection timed out
04:11:16:WU01:FS01:Trying to send results to collection server
Re: 3.133.76.19 (aws1.foldingathome.org)
Posted: Sun May 03, 2020 7:22 am
by foldy
Stats say it is online but I also cannot connect to
http://aws1.foldingathome.org/