Page 2 of 3
Re: 128.252.203.10 problem or WU?
Posted: Sat May 02, 2020 7:23 pm
by PantherX
Neil-B wrote:...I have never worked out if CS is set when project is created, when WU is issued, or continually when infrastructure is adjusted...
The CS is set on a Project level and is optional. If the project started with no CS and then it was later added, it will only take effect on the new WUs, not the old ones. That's my understanding based on observation which may or may not have changed given the various tweaks and optimizations done to the infrastructure over the last month or so.
Re: 128.252.203.10 problem or WU?
Posted: Sat May 02, 2020 7:27 pm
by Neil-B
Ta for that ... makes sense from a number of things I've seen
Re: 128.252.203.10 problem or WU?
Posted: Sun May 03, 2020 4:24 am
by esfishox
I saw my WU uploaded after nearly three days of mostly trying.
https://apps.foldingathome.org/wu#proje ... 163&gen=59
Re: 128.252.203.10 problem or WU?
Posted: Mon May 04, 2020 7:58 pm
by GDF
The server seems to be rebooting every 20 minutes or so. It often goes 5-10 minutes without updating the last contact timestamp. It has been doing so since I started watching it two days ago. That doesn't sound normal.
Re: 128.252.203.10 problem or WU?
Posted: Tue May 05, 2020 4:04 am
by level6
Hello. We are a new team that has been going for 2 weeks and 2 days.
I am having the same problem with this troubled server for more than 1 day:
Code: Select all
02:30:40:WU03:FS00:0xa7:Completed 80000 out of 250000 steps (32%)
02:32:40:WU00:FS01:0x22:Completed 390000 out of 1000000 steps (39%)
02:32:50:WU01:FS01:Upload 1.63%
02:32:56:WU01:FS01:Upload 7.05%
02:33:03:WU01:FS01:Upload 11.92%
02:33:03:WARNING:WU01:FS01:Exception: Failed to send results to work server: Transfer failed
02:33:03:WU01:FS01:Sending unit results: id:01 state:SEND error:NO_ERROR project:11760 run:0 clone:2274 gen:19 core:0x22 unit:0x0000002680fccb0a5e6d7ce977531da8
02:33:04:WU01:FS01:Uploading 23.06MiB to 128.252.203.10
02:33:04:WU01:FS01:Connecting to 128.252.203.10:8080
02:33:10:WU01:FS01:Upload 6.23%
02:33:16:WU01:FS01:Upload 13.55%
02:33:22:WU01:FS01:Upload 20.60%
02:33:28:WU01:FS01:Upload 27.91%
02:33:29:WARNING:WU01:FS01:Exception: Failed to send results to work server: Transfer failed
02:34:23:WU03:FS00:0xa7:Completed 82500 out of 250000 steps (33%)
02:37:19:WU00:FS01:0x22:Completed 400000 out of 1000000 steps (40%)
02:38:25:WU03:FS00:0xa7:Completed 85000 out of 250000 steps (34%)
02:39:55:WU01:FS01:Sending unit results: id:01 state:SEND error:NO_ERROR project:11760 run:0 clone:2274 gen:19 core:0x22 unit:0x0000002680fccb0a5e6d7ce977531da8
02:39:55:WU01:FS01:Uploading 23.06MiB to 128.252.203.10
02:39:55:WU01:FS01:Connecting to 128.252.203.10:8080
02:40:16:WARNING:WU01:FS01:WorkServer connection failed on port 8080 trying 80
02:40:16:WU01:FS01:Connecting to 128.252.203.10:80
02:40:19:WARNING:WU01:FS01:Exception: Failed to send results to work server: Failed to connect to 128.252.203.10:80: No connection could be made because the target machine actively refused it.
I see from
https://apps.foldingathome.org/serverstats that 155.247.164.213 runs the same version and works on the same project types. Is there any way to force F@H to use another server? Would that even work? Do they only expect results that were assigned? I could set up a proxy to try to force it, maybe?
Re: 128.252.203.10 problem or WU?
Posted: Tue May 05, 2020 4:33 am
by bruce
I'm frustrated, too. I've repeatedly reported problems with that server to the people who can fix it. They get it running and before long, it fails again.
Re: 128.252.203.10 problem or WU?
Posted: Tue May 05, 2020 5:47 am
by level6
I was working on trying to set up a web proxy, thinking I was going to lose this WU anyway, and that required a reboot. It failed once more, and then:
Code: Select all
05:37:19:WU01:FS01:Sending unit results: id:01 state:SEND error:NO_ERROR project:11760 run:0 clone:2274 gen:19 core:0x22 unit:0x0000002680fccb0a5e6d7ce977531da8
05:37:19:WU01:FS01:Uploading 23.06MiB to 128.252.203.10
05:37:19:WU01:FS01:Connecting to 128.252.203.10:8080
05:37:20:WU03:FS00:0xa7:Completed 200000 out of 250000 steps (80%)
05:37:25:WU01:FS01:Upload 7.59%
05:37:31:WU01:FS01:Upload 14.63%
05:37:37:WU01:FS01:Upload 20.87%
05:37:43:WU01:FS01:Upload 27.91%
05:37:49:WU01:FS01:Upload 34.96%
05:37:55:WU01:FS01:Upload 42.00%
05:38:01:WU01:FS01:Upload 49.05%
05:38:07:WU01:FS01:Upload 56.10%
05:38:13:WU01:FS01:Upload 62.87%
05:38:19:WU01:FS01:Upload 66.93%
05:38:25:WU01:FS01:Upload 71.54%
05:38:31:WU01:FS01:Upload 78.59%
05:38:37:WU01:FS01:Upload 85.63%
05:38:43:WU01:FS01:Upload 92.68%
05:38:49:WU01:FS01:Upload 99.72%
05:38:55:WU01:FS01:Upload complete
05:38:55:WU01:FS01:Server responded WORK_ACK (400)
05:38:55:WU01:FS01:Final credit estimate, 27322.00 points
05:38:55:WU01:FS01:Cleaning up
This happens to be a Windows 10 box running FAH 7.6.9, btw. Man, those points are pitiful, but it's arbitrary. It's the WUs that count, right.
We're comin' for ya, Corona.
Re: 128.252.203.10 problem or WU?
Posted: Tue May 05, 2020 5:56 am
by level6
bruce wrote: I'm frustrated, too. I've repeatedly reported problems with that server to the people who can fix it. They get it running and before long, it fails again.
It was probably your reporting that did it. I see a 16-minute uptime on it, now. Thanks, man!
These guys need some real CS help. I'm sure it's easier to make that judgement never having seen the complexities involved behind that curtain. Still,... I hope they have people dedicated to this sort of thing, and it's not stealing from the bio-physicists' time. I've seen the crazy specs required to run a server. They aren't normal machines, for sure.
Re: 128.252.203.10 problem or WU?
Posted: Tue May 05, 2020 3:30 pm
by Oussebon
One of our machines is still wrestling with sending a WU to this server. Mostly it says Transfer failed, but will periodically start uploading, then get interrupted.
For instance:
Code: Select all
*********************** Log Started 2020-05-05T14:55:53Z ***********************
14:55:53:WU00:FS02:Sending unit results: id:00 state:SEND error:NO_ERROR project:11764 run:0 clone:5195 gen:51 core:0x22 unit:0x0000005d80fccb0a5e71130f4744690a
14:55:53:WU00:FS02:Uploading 55.24MiB to 128.252.203.10
14:55:53:WU00:FS02:Connecting to 128.252.203.10:8080
14:55:59:WU00:FS02:Upload 0.45%
14:56:46:WU00:FS02:Upload 0.57%
14:56:53:WU00:FS02:Upload 0.79%
14:56:59:WU00:FS02:Upload 1.24%
14:57:05:WU00:FS02:Upload 1.81%
14:57:12:WU00:FS02:Upload 2.26%
14:57:18:WU00:FS02:Upload 2.38%
14:57:27:WU00:FS02:Upload 2.49%
14:57:33:WU00:FS02:Upload 2.83%
14:57:39:WU00:FS02:Upload 3.28%
14:57:47:WU00:FS02:Upload 3.62%
14:57:53:WU00:FS02:Upload 3.96%
14:58:07:WU00:FS02:Upload 4.07%
14:58:13:WU00:FS02:Upload 4.75%
14:58:20:WU00:FS02:Upload 5.32%
14:58:26:WU00:FS02:Upload 6.00%
14:58:32:WU00:FS02:Upload 6.68%
14:58:39:WU00:FS02:Upload 7.24%
14:58:45:WU00:FS02:Upload 7.81%
14:58:54:WU00:FS02:Upload 8.37%
14:59:00:WU00:FS02:Upload 8.71%
14:59:06:WU00:FS02:Upload 9.28%
14:59:12:WU00:FS02:Upload 9.96%
14:59:21:WU00:FS02:Upload 10.41%
14:59:29:WU00:FS02:Upload 11.09%
14:59:35:WU00:FS02:Upload 11.77%
14:59:43:WU00:FS02:Upload 12.45%
14:59:50:WU00:FS02:Upload 12.56%
14:59:56:WU00:FS02:Upload 12.78%
15:00:02:WU00:FS02:Upload 13.01%
15:00:08:WU00:FS02:Upload 13.35%
15:00:14:WU00:FS02:Upload 13.69%
15:00:20:WU00:FS02:Upload 14.26%
15:00:26:WU00:FS02:Upload 14.82%
15:00:32:WU00:FS02:Upload 15.50%
15:00:38:WU00:FS02:Upload 16.07%
15:00:44:WU00:FS02:Upload 16.63%
15:00:51:WU00:FS02:Upload 17.20%
15:00:57:WU00:FS02:Upload 17.99%
15:01:04:WU00:FS02:Upload 18.55%
15:01:10:WU00:FS02:Upload 19.23%
15:01:17:WU00:FS02:Upload 19.69%
15:01:23:WU00:FS02:Upload 20.14%
15:01:29:WU00:FS02:Upload 20.93%
15:01:36:WU00:FS02:Upload 21.38%
15:01:42:WU00:FS02:Upload 21.84%
15:01:48:WU00:FS02:Upload 22.74%
15:01:58:WU00:FS02:Upload 23.53%
15:02:04:WU00:FS02:Upload 24.44%
15:02:10:WU00:FS02:Upload 25.34%
15:02:16:WU00:FS02:Upload 26.25%
15:02:22:WU00:FS02:Upload 27.27%
15:02:28:WU00:FS02:Upload 28.28%
15:02:34:WU00:FS02:Upload 29.19%
15:02:47:WU00:FS02:Upload 29.87%
15:02:53:WU00:FS02:Upload 30.32%
15:03:01:WU00:FS02:Upload 31.11%
15:03:07:WU00:FS02:Upload 32.13%
15:03:13:WU00:FS02:Upload 33.04%
15:04:50:WU00:FS02:Upload 33.83%
15:04:50:WARNING:WU00:FS02:Exception: Failed to send results to work server: Transfer failed
15:04:51:WU00:FS02:Sending unit results: id:00 state:SEND error:NO_ERROR project:11764 run:0 clone:5195 gen:51 core:0x22 unit:0x0000005d80fccb0a5e71130f4744690a
15:04:51:WU00:FS02:Uploading 55.24MiB to 128.252.203.10
15:04:51:WU00:FS02:Connecting to 128.252.203.10:8080
15:04:54:WARNING:WU00:FS02:WorkServer connection failed on port 8080 trying 80
15:04:54:WU00:FS02:Connecting to 128.252.203.10:80
15:04:58:WARNING:WU00:FS02:Exception: Failed to send results to work server: Failed to connect to 128.252.203.10:80: No connection could be made because the target machine actively refused it.
15:06:28:WU00:FS02:Sending unit results: id:00 state:SEND error:NO_ERROR project:11764 run:0 clone:5195 gen:51 core:0x22 unit:0x0000005d80fccb0a5e71130f4744690a
15:06:28:WU00:FS02:Uploading 55.24MiB to 128.252.203.10
15:06:28:WU00:FS02:Connecting to 128.252.203.10:8080
15:06:30:WARNING:WU00:FS02:WorkServer connection failed on port 8080 trying 80
15:06:30:WU00:FS02:Connecting to 128.252.203.10:80
15:06:33:WARNING:WU00:FS02:Exception: Failed to send results to work server: Failed to connect to 128.252.203.10:80: No connection could be made because the target machine actively refused it.
15:09:05:WU00:FS02:Sending unit results: id:00 state:SEND error:NO_ERROR project:11764 run:0 clone:5195 gen:51 core:0x22 unit:0x0000005d80fccb0a5e71130f4744690a
15:09:05:WU00:FS02:Uploading 55.24MiB to 128.252.203.10
15:09:05:WU00:FS02:Connecting to 128.252.203.10:8080
Code: Select all
*********************** Log Started 2020-05-05T15:19:53Z ***********************
15:19:53:WU00:FS02:Sending unit results: id:00 state:SEND error:NO_ERROR project:11764 run:0 clone:5195 gen:51 core:0x22 unit:0x0000005d80fccb0a5e71130f4744690a
15:19:53:WU00:FS02:Uploading 55.24MiB to 128.252.203.10
15:19:53:WU00:FS02:Connecting to 128.252.203.10:8080
15:20:08:WU00:FS02:Upload 0.23%
15:20:08:WARNING:WU00:FS02:Exception: Failed to send results to work server: Transfer failed
15:20:09:WU00:FS02:Sending unit results: id:00 state:SEND error:NO_ERROR project:11764 run:0 clone:5195 gen:51 core:0x22 unit:0x0000005d80fccb0a5e71130f4744690a
15:20:09:WU00:FS02:Uploading 55.24MiB to 128.252.203.10
15:20:09:WU00:FS02:Connecting to 128.252.203.10:8080
15:20:24:WU00:FS02:Upload 0.11%
15:20:56:WU00:FS02:Upload 0.23%
15:20:56:WARNING:WU00:FS02:Exception: Failed to send results to work server: Transfer failed
15:21:09:WU00:FS02:Sending unit results: id:00 state:SEND error:NO_ERROR project:11764 run:0 clone:5195 gen:51 core:0x22 unit:0x0000005d80fccb0a5e71130f4744690a
15:21:09:WU00:FS02:Uploading 55.24MiB to 128.252.203.10
15:21:09:WU00:FS02:Connecting to 128.252.203.10:8080
15:22:19:WU00:FS02:Upload 0.23%
15:22:19:WARNING:WU00:FS02:Exception: Failed to send results to work server: Transfer failed
15:22:46:WU00:FS02:Sending unit results: id:00 state:SEND error:NO_ERROR project:11764 run:0 clone:5195 gen:51 core:0x22 unit:0x0000005d80fccb0a5e71130f4744690a
15:22:46:WU00:FS02:Uploading 55.24MiB to 128.252.203.10
15:22:46:WU00:FS02:Connecting to 128.252.203.10:8080
15:23:07:WARNING:WU00:FS02:WorkServer connection failed on port 8080 trying 80
15:23:07:WU00:FS02:Connecting to 128.252.203.10:80
15:23:28:WARNING:WU00:FS02:Exception: Failed to send results to work server: Failed to connect to 128.252.203.10:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
It's been like this for several days now, with the PC and FAHclient left running throughout the night and for most of the daytime too each day. Restarting the PC and/or FAH client seems to make no difference. The server will apparently start taking the upload and then stop, usually after 0.23%, very occasionally - as above - after a lot more. I really thought today would be the day it would actually take the WU, but it failed at just under 34%. I saw the server was restarted recently, but that was before the attempts in the logs above.
Anything to be done?
Re: 128.252.203.10 problem or WU?
Posted: Wed May 06, 2020 3:08 am
by GDF
This is only anecdotal, worked for me, and might have been complete coincidence. I paused the slot with the problem, waited for the server to reboot (which you can see on the serverstats page by watching uptime roll back to zero), then restarted the slot. The upload went right through.
It's annoying that there have been days of reports about this and no real response. But I get that there are a lot of moving parts and a federated (or loosely collaborative?) management structure. It would just be nice to know that the problem has been formally reported to someone who can address it.
Re: 128.252.203.10 problem or WU?
Posted: Wed May 06, 2020 3:45 am
by anandhanju
Thanks for your reports. The necessary folks have been notified and they will be looking into this.
Re: 128.252.203.10 problem or WU?
Posted: Wed May 06, 2020 4:16 am
by PantherX
GDF wrote:...It's annoying that there have been days of reports about this and no real response. But I get that there are a lot of moving parts and a federated (or loosely collaborative?) management structure. It would just be nice to know that the problem has been formally reported to someone who can address it.
Welcome to the F@H Forum GDF,
I do understand your POV and it negatively impacts all involved, the researchers and the donors. However, considering that there are multiple labs involved (
https://foldingathome.org/about/the-fol ... onsortium/) across the globe in various countries dealing with various lock-down policies, even on a "good" day, it would take a bit of time. In a pandemic situation, it is a lot harder but no-one has given up and instead, they have double-down and working to improving various aspects to ensure that it is fixed. Sometimes, labs will have to involve their internal IT department which can also add to the delay if it is a University infrastructure limitation like internet or electricity.
Re: 128.252.203.10 problem or WU?
Posted: Wed May 06, 2020 4:53 am
by level6
There will always be trouble, somewhere, though. We just need a better way of redirecting the work, so we aren't feeling like our electricity bill was not worth it (I personally don't care, but many others might). Especially when we have enough smarts to determine there is a problematic server, even giving us a way to do this on our end would be a great advancement. Heck, the more complicated and challenging the better for some of us... IF it's possible.
I have seen other gentle suggestions like this go unanswered, as I've lurked around the last couple of weeks. Is it that it's an ignorant concept that we will learn better about wishing for, as we learn more of the details? Or, is it that no one knows whether it is possible? If the latter, then there is hope and we can play.
Is that communication locked into place, once the job begins? If A assigned to me, I'm reporting to B, but B breaks, is there no possibility of a C that can accept the work and pass it on to B later to aggregate? Assuming B and C are similar in every important way (if so, what are those ways?... same projects, same arch job types, same version of F@H?) Is B the only machine who will ever accept the final data for this job? Or, can any similar server accept it, and it's just a matter of luck and there not yet being a mechanism in place to send it to C?
Re: 128.252.203.10 problem or WU?
Posted: Wed May 06, 2020 4:59 am
by level6
And, if our client knows where B is... then it must have saved that in a file, somewhere, right? Could a file be changed to replace B with C? That seems too simple to work. There are no plain strings of my collection server's IP address in these files. Is it maybe encoded in that client.db sqlite DB?
Re: 128.252.203.10 problem or WU?
Posted: Wed May 06, 2020 5:18 am
by PantherX
level6 wrote:There will always be trouble, somewhere, though. We just need a better way of redirecting the work, so we aren't feeling like our electricity bill was not worth it (I personally don't care, but many others might). Especially when we have enough smarts to determine there is a problematic server, even giving us a way to do this on our end would be a great advancement. Heck, the more complicated and challenging the better for some of us... IF it's possible...
Work has happened over the last few weeks were multiple WS (Work Servers) spawned and some included cloud services too. There is still more to come.
level6 wrote:...I have seen other gentle suggestions like this go unanswered, as I've lurked around the last couple of weeks. Is it that it's an ignorant concept that we will learn better about wishing for, as we learn more of the details? Or, is it that no one knows whether it is possible? If the latter, then there is hope and we can play.
...
Sorry, I am not following you 100% There has been engagement between the F@H Team and other parties across Forum, Email, Twitter, Discord, etc. AFAIK, for troubleshooting details were asked and Donors responded. Troubleshooting issues in production can be challenging especially when new features have been deployed and rolling forward is the only way.
level6 wrote:...Is that communication locked into place, once the job begins? If A assigned to me, I'm reporting to B, but B breaks, is there no possibility of a C that can accept the work and pass it on to B later to aggregate? Assuming B and C are similar in every important way (if so, what are those ways?... same projects, same arch job types, same version of F@H?) Is B the only machine who will ever accept the final data for this job? Or, can any similar server accept it, and it's just a matter of luck and there not yet being a mechanism in place to send it to C?
The downloaded WU can only be uploaded to the WS that it came from. Historically, there was the CS (Collection Server) which was optional and up to the researcher to configure or not. They only collect WUs. The reason a WS can only accept the WU is that WUs are sequential and once it is uploaded, the next one in sequence is generated. Thus, WUs from a particular WS has to be returned to it. If it does go to the CS, the WS has to "pull" it back and then process it to generate the next sequence. If you would like to know a bit more about this, please read this topic as it provides an overview of the various servers at play and their roles: viewtopic.php?f=18&t=17794