Page 5 of 7

Re: Project 6318: Collection server misconfigured?

Posted: Thu Feb 04, 2010 5:09 pm
by bruce
Server 171.64.65.60 was down for most of the night (Stanford time). It's just coming back now and it will take a while to accept the backlog.

At the time you uploaded, the Collection Server did not have a record of the WU, but I suspect that the Work Server does have that information, so everthing should straighten itself out in a while. Please be patient.

Re: Project 6318: Collection server misconfigured?

Posted: Thu Feb 04, 2010 5:43 pm
by Pette Broad
bruce wrote:Server 171.64.65.60 was down for most of the night (Stanford time). It's just coming back now and it will take a while to accept the backlog.
Yes, just sent one back, 34 more to go :)

EDIT, Just got lucky with that one unit from the looks of it, multiple attempts every six hours and on completion of units and yet no further units uploaded...now up to 37, at least I've no more to complete. FAHSpy overloaded with red error messages.. :D

Pete

Re: Project 6318: Collection server misconfigured?

Posted: Thu Feb 04, 2010 11:55 pm
by whiteb
Is the work server still down ?

[23:46:48] Project: 6316 (Run 430, Clone 12, Gen 24)
[23:46:48]
[23:46:49] - Couldn't send HTTP request to server
[23:46:49] + Could not connect to Work Server (results)
[23:46:49] (171.64.65.60:8080)
[23:46:49] + Retrying using alternative port
[23:46:49] Assembly optimizations on if available.
[23:46:49] Entering M.D.
[23:46:50] - Couldn't send HTTP request to server
[23:46:50] + Could not connect to Work Server (results)
[23:46:50] (171.64.65.60:80)
[23:46:50] - Error: Could not transmit unit 04 (completed February 4) to work server.


[23:46:50] + Attempting to send results [February 4 23:46:50 UTC]
[23:47:09] (Starting from checkpoint)
[23:47:09] Protein: p6316_sh3_with_ALA_frags
[23:47:09]
[23:47:09] Writing local files
[23:47:09] Completed 110000 out of 500000 steps (22%)
[23:47:09] Extra SSE boost OK.
[23:49:42] - Server does not have record of this unit. Will try again later.
[23:49:42] Could not transmit unit 04 to Collection server; keeping in queue.

Re: Project 6318: Collection server misconfigured?

Posted: Fri Feb 05, 2010 1:32 am
by bruce
whiteb wrote:Is the work server still down ?
The answer to that question can be found on the "Server Status" link at the top of the page.

In the log you posted, you'll see that it tried to upload to 171.64.65.60 and failed. On the page that opens from that link, scroll down until you see 171.64.65.60 (very near the bottom).

At the present time, it says standby / Not Accept so, yes, it's still down. It has been up several times today for varying periods but apparently they're still working hard to fix whatever they're trying to correct.

Re: Project 6318: Collection server misconfigured?

Posted: Fri Feb 05, 2010 5:08 pm
by brityank
Not only is the server 171.64.65.60 Standby/Not Accept, but the IP for the Collection Server is not in the Server List: 171.67.108.26:

Code: Select all

[16:18:02] Connecting to http://171.64.65.60:8080/
[16:18:04] - Couldn't send HTTP request to server
[16:18:04] + Could not connect to Work Server (results)
[16:18:04]     (171.64.65.60:8080)
[16:18:04] + Retrying using alternative port
[16:18:04] Connecting to http://171.64.65.60:80/
[16:18:05] - Couldn't send HTTP request to server
[16:18:05] + Could not connect to Work Server (results)
[16:18:05]     (171.64.65.60:80)
[16:18:05] - Error: Could not transmit unit 00 (completed February 5) to work server.
[16:18:05] - 4 failed uploads of this unit.
[16:18:05] - Read packet limit of 540015616... Set to 524286976.


[16:18:05] + Attempting to send results [February 5 16:18:05 UTC]
[16:18:05] - Reading file work/wuresults_00.dat from core
[16:18:05]   (Read 6877118 bytes from disk)
[16:18:05] Connecting to http://171.67.108.26:8080/
[16:20:30] Posted data.
[16:20:30] Initial: 0000; - Uploaded at ~46 kB/s
[16:20:30] - Averaged speed for that direction ~42 kB/s
[16:20:30] - Server does not have record of this unit. Will try again later.
[16:20:30]   Could not transmit unit 00 to Collection server; keeping in queue.
I now have three systems sending completed WUs back to these servers - all show the same hangups.

So where are the completed WUs going? :?

Re: Project 6318: Collection server misconfigured?

Posted: Sat Feb 06, 2010 1:16 am
by bruce
brityank wrote:So where are the completed WUs going? :?
I hate to state the obvious . . .

Code: Select all

. . .keeping in queue.
They're in the queue on your machine until a server is ready to accept the upload.

Re: Project 6318: Collection server misconfigured?

Posted: Sat Feb 06, 2010 4:56 am
by chrisretusn
Getting a lot of the "Server does not have record of this unit. Will try again later." entries in the log.

Project: 6318 (Run 3075, Clone 18, Gen 1)

Code: Select all

[10:17:30] - Autosending finished units...
[10:17:30] Trying to send all finished work units


[10:17:30] + Attempting to send results
[10:17:30] - Reading file work/wuresults_04.dat from core
[10:17:30]   (Read 6874049 bytes from disk)
[10:17:30] Connecting to http://171.64.65.60:8080/
[10:17:30] - Couldn't send HTTP request to server
[10:17:30] + Could not connect to Work Server (results)
[10:17:30]     (171.64.65.60:8080)
[10:17:30] - Error: Could not transmit unit 04 (completed February 4) to work server.
[10:17:30] - 6 failed uploads of this unit.


[10:17:30] + Attempting to send results
[10:17:30] - Reading file work/wuresults_04.dat from core
[10:17:30]   (Read 6874049 bytes from disk)
[10:17:30] Connecting to http://171.67.108.26:8080/
[10:23:07] Timered checkpoint triggered.
[10:23:07] Posted data.
[10:23:07] Initial: 0000; - Uploaded at ~19 kB/s
[10:23:07] - Averaged speed for that direction ~19 kB/s
[10:23:07] - Server does not have record of this unit. Will try again later.
[10:23:07]   Could not transmit unit 04 to Collection server; keeping in queue.
[10:23:07] + Sent 0 of 1 completed units to the server
[10:23:07] - Autosend completed
<< SNIP >>
[16:23:07] - Autosending finished units...
[16:23:07] Trying to send all finished work units


[16:23:07] + Attempting to send results
[16:23:07] - Reading file work/wuresults_04.dat from core
[16:23:07]   (Read 6874049 bytes from disk)
[16:23:07] Connecting to http://171.64.65.60:8080/
[16:23:07] - Couldn't send HTTP request to server
[16:23:07] + Could not connect to Work Server (results)
[16:23:07]     (171.64.65.60:8080)
[16:23:07] - Error: Could not transmit unit 04 (completed February 4) to work server.
[16:23:07] - 7 failed uploads of this unit.


[16:23:07] + Attempting to send results
[16:23:07] - Reading file work/wuresults_04.dat from core
[16:23:07]   (Read 6874049 bytes from disk)
[16:23:07] Connecting to http://171.67.108.26:8080/
[16:28:44] Posted data.
[16:28:44] Initial: 0000; - Uploaded at ~19 kB/s
[16:28:44] - Averaged speed for that direction ~19 kB/s
[16:28:44] - Server does not have record of this unit. Will try again later.
[16:28:44]   Could not transmit unit 04 to Collection server; keeping in queue.
[16:28:44] + Sent 0 of 1 completed units to the server
[16:28:44] - Autosend completed
[22:28:44] - Autosending finished units...
[22:28:44] Trying to send all finished work units


[22:28:44] + Attempting to send results
[22:28:44] - Reading file work/wuresults_04.dat from core
[22:28:44]   (Read 6874049 bytes from disk)
[22:28:44] Connecting to http://171.64.65.60:8080/
[22:32:58] Writing local files
[22:32:58] Completed 335000 out of 500000 steps  (67%)
[22:34:22] Posted data.
[22:34:22] Initial: 0000; - Uploaded at ~19 kB/s
[22:34:22] - Averaged speed for that direction ~19 kB/s
[22:34:22] - Server does not have record of this unit. Will try again later.
[22:34:22] - Error: Could not transmit unit 04 (completed February 4) to work server.
[22:34:22] - 8 failed uploads of this unit.


[22:34:22] + Attempting to send results
[22:34:22] - Reading file work/wuresults_04.dat from core
[22:34:22]   (Read 6874049 bytes from disk)
[22:34:22] Connecting to http://171.67.108.26:8080/
[22:40:30] Posted data.
[22:40:30] Initial: 0000; - Uploaded at ~18 kB/s
[22:40:30] - Averaged speed for that direction ~19 kB/s
[22:40:30] - Server does not have record of this unit. Will try again later.
[22:40:30]   Could not transmit unit 04 to Collection server; keeping in queue.
[22:40:30] + Sent 0 of 1 completed units to the server
[22:40:30] - Autosend completed
This is from a fah6 -send 04

Code: Select all


--- Opening Log file [February 6 04:21:14] 


# Linux Console Edition #######################################################
###############################################################################

                       Folding@Home Client Version 6.02

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/folder/fah
Executable: fah6
Arguments: -send 04 

[04:21:14] - Ask before connecting: No
[04:21:14] - User name: chrisretusn (Team 2291)
[04:21:14] - User ID: xxxx
[04:21:14] - Machine ID: 1
[04:21:14] 
[04:21:14] Loaded queue successfully.
[04:21:14] Attempting to return result(s) to server...


[04:21:14] + Attempting to send results
[04:26:51] - Server does not have record of this unit. Will try again later.
[04:26:51] - Error: Could not transmit unit 04 (completed February 4) to work server.


[04:26:51] + Attempting to send results
[04:32:30] - Server does not have record of this unit. Will try again later.
[04:32:30]   Could not transmit unit 04 to Collection server; keeping in queue.
[04:32:31] - Failed to send unit 04 to server

Folding@Home Client Shutdown.

Re: Project 6318: Collection server misconfigured?

Posted: Sat Feb 06, 2010 4:06 pm
by VijayPande
Dr. Voelz has been on this one for the last few days. In working to fix this, we've been slowed down since this has exposed a bug in the new v5 server code, which Joe has been working on. (Briefly, it was running out of RAM, due to 32-bit binaries). It looks like it's now fixed (running as 64-bit).

Thanks for bearing with us on this one. The v5 server code is still having growing pains here and there, but it's getting there. In particular, it's extremely capable (we could likely run all of FAH on *one* of our newer 24GB RAM servers with this new code, a big improvement over the v4 version) and has lots of neat functions which we'll be using in the future. My hope is that this is the last big issue, but there will undoubtably be more smaller issues as time goes on.

Re: Project 6318: Collection server misconfigured?

Posted: Sat Feb 06, 2010 4:20 pm
by AgrFan
VijayPande wrote:Dr. Voelz has been on this one for the last few days. In working to fix this, we've been slowed down since this has exposed a bug in the new v5 server code, which Joe has been working on. (Briefly, it was running out of RAM, due to 32-bit binaries). It looks like it's now fixed (running as 64-bit).

Thanks for bearing with us on this one. The v5 server code is still having growing pains here and there, but it's getting there. In particular, it's extremely capable (we could likely run all of FAH on *one* of our newer 24GB RAM servers with this new code, a big improvement over the v4 version) and has lots of neat functions which we'll be using in the future. My hope is that this is the last big issue, but there will undoubtably be more smaller issues as time goes on.
Thanks Vijay for the update!

Any ETA on when points for the 63xx units will be credited?

Re: Project 6318: Collection server misconfigured?

Posted: Sat Feb 06, 2010 5:45 pm
by toTOW
AgrFan wrote:Any ETA on when points for the 63xx units will be credited?
Probably a few hours after the server goes back online and accepts your WUs.

Re: Project 6318: Collection server misconfigured?

Posted: Sat Feb 06, 2010 6:22 pm
by AgrFan
toTOW wrote:
AgrFan wrote:Any ETA on when points for the 63xx units will be credited?
Probably a few hours after the server goes back online and accepts your WUs.
I'm missing points for 63xx units uploaded over the last couple of weeks. I can provide logs if needed. Hopefully once the backlog is processed the points will arrive.

Re: Project 6318: Collection server misconfigured?

Posted: Sat Feb 06, 2010 8:28 pm
by DrSpalding
I would actually like to see the assignment server stop giving out packets to 171.64.65.60 for a while. One of my machines has been unsuccessfully waiting to get a WU from it since 17:08 UTC (9:08 PST) today even though it was finally able to upload the two or three completed 6318 WUs I had waiting. I see that the server is pretty heavily loaded (some columns removed):

Code: Select all

Sat Feb 6 08:55:10 PST 2010  171.64.65.60  classic  vspg10a  vvoelz  accept  Accepting  9.86  455  
Sat Feb 6 09:10:10 PST 2010  171.64.65.60  classic  vspg10a  vvoelz  accept  Accepting  10.40  432 
Sat Feb 6 09:25:10 PST 2010  171.64.65.60  classic  vspg10a  vvoelz  accept  Accepting  8.95  414 
Sat Feb 6 09:40:10 PST 2010  171.64.65.60  classic  vspg10a  vvoelz  accept  Accepting  9.15  532 
Sat Feb 6 09:55:10 PST 2010  171.64.65.60  classic  vspg10a  vvoelz  accept  Accepting  9.88  513 
Sat Feb 6 10:10:10 PST 2010  171.64.65.60  classic  vspg10a  vvoelz  accept  Accepting  10.03  458
Sat Feb 6 10:25:10 PST 2010  171.64.65.60  classic  vspg10a  vvoelz  accept  Accepting  10.81  505
Sat Feb 6 10:40:10 PST 2010  171.64.65.60  classic  vspg10a  vvoelz  accept  Accepting  11.84  531
Sat Feb 6 10:55:09 PST 2010  171.64.65.60  classic  vspg10a  vvoelz  accept  Accepting  13.22  614
Sat Feb 6 11:10:10 PST 2010  171.64.65.60  classic  vspg10a  vvoelz  accept  Accepting  21.65  823
Sat Feb 6 11:25:10 PST 2010  171.64.65.60  classic  vspg10a  vvoelz  accept  Accepting  25.55  1031
Sat Feb 6 11:40:10 PST 2010  171.64.65.60  classic  vspg10a  vvoelz  accept  Accepting  15.14  421
Sat Feb 6 11:55:11 PST 2010  171.64.65.60  classic  vspg10a  vvoelz  accept  Accepting  7.70  523
I expect that the last two columns are the cpuload and netload values.

Re: Project 6318: Collection server misconfigured?

Posted: Sat Feb 06, 2010 9:25 pm
by bruce
Yes, the last two columns are CPU LOAD and NET LOAD.

The CPU LOAD includes not only what happens at high priority (managing the internet connection) but any number of low priority tasks such as recompiling some of the code. A large number doesn't necessarily imply that the server isn't able to handle the load.

I don't know what a reasonable value of NET LOAD is reasonable for this server. Unlike the older servers which were much more limited in both RAM and Core count, this server might be dealing with the load quite well. I'm sure that people are watching it carefully.

I don't mean to discount your observations. It's possible that a reduction in assignment rate is in order. It's also possible that you have attempted to upload during times that the server has been down. They've been working very hard on this server over the past couple of days. (See VijayPande's post above.)

Please post FAHlog showing the errors that you're getting and when they occurred.

Re: Project 6318: Collection server misconfigured?

Posted: Sat Feb 06, 2010 11:08 pm
by DrSpalding
At 21:35 UTC today, the assignment server changed this particular machine to server 171.67.108.13 and picked up a P4441 r143c3g45 project instead of a 6318.

The one thing I have noticed on another machine is that it took 7-9 minutes routinely to be told that the server had no record of the WU. After about three days, it finally uploaded today at 13:58 UTC. That was a p6318 r3344 c43 g1.

If you still would like to see the relevant portions of those two logs, let me know via a PM and an email address to mail them to you. Otherwise, I am content to let it go for now. The machine is pretty obviously heavily loaded while it tries to catch up with all the 6318 (and others?) WUs that were waiting to be uploaded.

Re: Project 6318: Collection server misconfigured?

Posted: Sun Feb 07, 2010 12:14 am
by bruce
The assignment to a particular server leads to the issuing of a new WU. If it's a new server, you'll get assignments from a different family of project. That has nothing to do with what you've already completed.

Work which has been completed is often uploaded immediately but there are numerous reasons why it might not be. In the case of p6318, there is more than one reason, and one of them is that the Collection Server does not have a record of it. I still am trying to figure out why it has not been uploaded to the primary Work Server.