Project 6318: Collection server misconfigured?

Moderators: Site Moderators, FAHC Science Team

bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project 6318: Collection server misconfigured?

Post by bruce »

Server 171.64.65.60 was down for most of the night (Stanford time). It's just coming back now and it will take a while to accept the backlog.

At the time you uploaded, the Collection Server did not have a record of the WU, but I suspect that the Work Server does have that information, so everthing should straighten itself out in a while. Please be patient.
Pette Broad
Posts: 128
Joined: Mon Dec 03, 2007 9:38 pm
Hardware configuration: CPU folding on only one machine a laptop

GPU Hardware..
3 x 460
1 X 260
4 X 250

+ 1 X 9800GT (3 days a week)
Location: Chester U.K

Re: Project 6318: Collection server misconfigured?

Post by Pette Broad »

bruce wrote:Server 171.64.65.60 was down for most of the night (Stanford time). It's just coming back now and it will take a while to accept the backlog.
Yes, just sent one back, 34 more to go :)

EDIT, Just got lucky with that one unit from the looks of it, multiple attempts every six hours and on completion of units and yet no further units uploaded...now up to 37, at least I've no more to complete. FAHSpy overloaded with red error messages.. :D

Pete
Image
whiteb
Posts: 1
Joined: Mon Jul 21, 2008 4:30 pm

Re: Project 6318: Collection server misconfigured?

Post by whiteb »

Is the work server still down ?

[23:46:48] Project: 6316 (Run 430, Clone 12, Gen 24)
[23:46:48]
[23:46:49] - Couldn't send HTTP request to server
[23:46:49] + Could not connect to Work Server (results)
[23:46:49] (171.64.65.60:8080)
[23:46:49] + Retrying using alternative port
[23:46:49] Assembly optimizations on if available.
[23:46:49] Entering M.D.
[23:46:50] - Couldn't send HTTP request to server
[23:46:50] + Could not connect to Work Server (results)
[23:46:50] (171.64.65.60:80)
[23:46:50] - Error: Could not transmit unit 04 (completed February 4) to work server.


[23:46:50] + Attempting to send results [February 4 23:46:50 UTC]
[23:47:09] (Starting from checkpoint)
[23:47:09] Protein: p6316_sh3_with_ALA_frags
[23:47:09]
[23:47:09] Writing local files
[23:47:09] Completed 110000 out of 500000 steps (22%)
[23:47:09] Extra SSE boost OK.
[23:49:42] - Server does not have record of this unit. Will try again later.
[23:49:42] Could not transmit unit 04 to Collection server; keeping in queue.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project 6318: Collection server misconfigured?

Post by bruce »

whiteb wrote:Is the work server still down ?
The answer to that question can be found on the "Server Status" link at the top of the page.

In the log you posted, you'll see that it tried to upload to 171.64.65.60 and failed. On the page that opens from that link, scroll down until you see 171.64.65.60 (very near the bottom).

At the present time, it says standby / Not Accept so, yes, it's still down. It has been up several times today for varying periods but apparently they're still working hard to fix whatever they're trying to correct.
brityank
Posts: 161
Joined: Wed Dec 05, 2007 9:16 pm
Location: SE Pennsylvania

Re: Project 6318: Collection server misconfigured?

Post by brityank »

Not only is the server 171.64.65.60 Standby/Not Accept, but the IP for the Collection Server is not in the Server List: 171.67.108.26:

Code: Select all

[16:18:02] Connecting to http://171.64.65.60:8080/
[16:18:04] - Couldn't send HTTP request to server
[16:18:04] + Could not connect to Work Server (results)
[16:18:04]     (171.64.65.60:8080)
[16:18:04] + Retrying using alternative port
[16:18:04] Connecting to http://171.64.65.60:80/
[16:18:05] - Couldn't send HTTP request to server
[16:18:05] + Could not connect to Work Server (results)
[16:18:05]     (171.64.65.60:80)
[16:18:05] - Error: Could not transmit unit 00 (completed February 5) to work server.
[16:18:05] - 4 failed uploads of this unit.
[16:18:05] - Read packet limit of 540015616... Set to 524286976.


[16:18:05] + Attempting to send results [February 5 16:18:05 UTC]
[16:18:05] - Reading file work/wuresults_00.dat from core
[16:18:05]   (Read 6877118 bytes from disk)
[16:18:05] Connecting to http://171.67.108.26:8080/
[16:20:30] Posted data.
[16:20:30] Initial: 0000; - Uploaded at ~46 kB/s
[16:20:30] - Averaged speed for that direction ~42 kB/s
[16:20:30] - Server does not have record of this unit. Will try again later.
[16:20:30]   Could not transmit unit 00 to Collection server; keeping in queue.
I now have three systems sending completed WUs back to these servers - all show the same hangups.

So where are the completed WUs going? :?
... ... Free Republic Folders - A Tribute to Ronald Reagan ... ...
Image
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project 6318: Collection server misconfigured?

Post by bruce »

brityank wrote:So where are the completed WUs going? :?
I hate to state the obvious . . .

Code: Select all

. . .keeping in queue.
They're in the queue on your machine until a server is ready to accept the upload.
chrisretusn
Posts: 101
Joined: Sat Feb 02, 2008 10:12 am
Hardware configuration: AMD Athlon(tm) 64 X2 Dual Core Processor 4000+
AMD Athlon(tm) XP 2600+
Location: Philippines

Re: Project 6318: Collection server misconfigured?

Post by chrisretusn »

Getting a lot of the "Server does not have record of this unit. Will try again later." entries in the log.

Project: 6318 (Run 3075, Clone 18, Gen 1)

Code: Select all

[10:17:30] - Autosending finished units...
[10:17:30] Trying to send all finished work units


[10:17:30] + Attempting to send results
[10:17:30] - Reading file work/wuresults_04.dat from core
[10:17:30]   (Read 6874049 bytes from disk)
[10:17:30] Connecting to http://171.64.65.60:8080/
[10:17:30] - Couldn't send HTTP request to server
[10:17:30] + Could not connect to Work Server (results)
[10:17:30]     (171.64.65.60:8080)
[10:17:30] - Error: Could not transmit unit 04 (completed February 4) to work server.
[10:17:30] - 6 failed uploads of this unit.


[10:17:30] + Attempting to send results
[10:17:30] - Reading file work/wuresults_04.dat from core
[10:17:30]   (Read 6874049 bytes from disk)
[10:17:30] Connecting to http://171.67.108.26:8080/
[10:23:07] Timered checkpoint triggered.
[10:23:07] Posted data.
[10:23:07] Initial: 0000; - Uploaded at ~19 kB/s
[10:23:07] - Averaged speed for that direction ~19 kB/s
[10:23:07] - Server does not have record of this unit. Will try again later.
[10:23:07]   Could not transmit unit 04 to Collection server; keeping in queue.
[10:23:07] + Sent 0 of 1 completed units to the server
[10:23:07] - Autosend completed
<< SNIP >>
[16:23:07] - Autosending finished units...
[16:23:07] Trying to send all finished work units


[16:23:07] + Attempting to send results
[16:23:07] - Reading file work/wuresults_04.dat from core
[16:23:07]   (Read 6874049 bytes from disk)
[16:23:07] Connecting to http://171.64.65.60:8080/
[16:23:07] - Couldn't send HTTP request to server
[16:23:07] + Could not connect to Work Server (results)
[16:23:07]     (171.64.65.60:8080)
[16:23:07] - Error: Could not transmit unit 04 (completed February 4) to work server.
[16:23:07] - 7 failed uploads of this unit.


[16:23:07] + Attempting to send results
[16:23:07] - Reading file work/wuresults_04.dat from core
[16:23:07]   (Read 6874049 bytes from disk)
[16:23:07] Connecting to http://171.67.108.26:8080/
[16:28:44] Posted data.
[16:28:44] Initial: 0000; - Uploaded at ~19 kB/s
[16:28:44] - Averaged speed for that direction ~19 kB/s
[16:28:44] - Server does not have record of this unit. Will try again later.
[16:28:44]   Could not transmit unit 04 to Collection server; keeping in queue.
[16:28:44] + Sent 0 of 1 completed units to the server
[16:28:44] - Autosend completed
[22:28:44] - Autosending finished units...
[22:28:44] Trying to send all finished work units


[22:28:44] + Attempting to send results
[22:28:44] - Reading file work/wuresults_04.dat from core
[22:28:44]   (Read 6874049 bytes from disk)
[22:28:44] Connecting to http://171.64.65.60:8080/
[22:32:58] Writing local files
[22:32:58] Completed 335000 out of 500000 steps  (67%)
[22:34:22] Posted data.
[22:34:22] Initial: 0000; - Uploaded at ~19 kB/s
[22:34:22] - Averaged speed for that direction ~19 kB/s
[22:34:22] - Server does not have record of this unit. Will try again later.
[22:34:22] - Error: Could not transmit unit 04 (completed February 4) to work server.
[22:34:22] - 8 failed uploads of this unit.


[22:34:22] + Attempting to send results
[22:34:22] - Reading file work/wuresults_04.dat from core
[22:34:22]   (Read 6874049 bytes from disk)
[22:34:22] Connecting to http://171.67.108.26:8080/
[22:40:30] Posted data.
[22:40:30] Initial: 0000; - Uploaded at ~18 kB/s
[22:40:30] - Averaged speed for that direction ~19 kB/s
[22:40:30] - Server does not have record of this unit. Will try again later.
[22:40:30]   Could not transmit unit 04 to Collection server; keeping in queue.
[22:40:30] + Sent 0 of 1 completed units to the server
[22:40:30] - Autosend completed
This is from a fah6 -send 04

Code: Select all


--- Opening Log file [February 6 04:21:14] 


# Linux Console Edition #######################################################
###############################################################################

                       Folding@Home Client Version 6.02

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/folder/fah
Executable: fah6
Arguments: -send 04 

[04:21:14] - Ask before connecting: No
[04:21:14] - User name: chrisretusn (Team 2291)
[04:21:14] - User ID: xxxx
[04:21:14] - Machine ID: 1
[04:21:14] 
[04:21:14] Loaded queue successfully.
[04:21:14] Attempting to return result(s) to server...


[04:21:14] + Attempting to send results
[04:26:51] - Server does not have record of this unit. Will try again later.
[04:26:51] - Error: Could not transmit unit 04 (completed February 4) to work server.


[04:26:51] + Attempting to send results
[04:32:30] - Server does not have record of this unit. Will try again later.
[04:32:30]   Could not transmit unit 04 to Collection server; keeping in queue.
[04:32:31] - Failed to send unit 04 to server

Folding@Home Client Shutdown.
Image
Folding on Slackware Linux.
VijayPande
Pande Group Member
Posts: 2058
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: Project 6318: Collection server misconfigured?

Post by VijayPande »

Dr. Voelz has been on this one for the last few days. In working to fix this, we've been slowed down since this has exposed a bug in the new v5 server code, which Joe has been working on. (Briefly, it was running out of RAM, due to 32-bit binaries). It looks like it's now fixed (running as 64-bit).

Thanks for bearing with us on this one. The v5 server code is still having growing pains here and there, but it's getting there. In particular, it's extremely capable (we could likely run all of FAH on *one* of our newer 24GB RAM servers with this new code, a big improvement over the v4 version) and has lots of neat functions which we'll be using in the future. My hope is that this is the last big issue, but there will undoubtably be more smaller issues as time goes on.
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
AgrFan
Posts: 63
Joined: Sat Mar 15, 2008 8:07 pm

Re: Project 6318: Collection server misconfigured?

Post by AgrFan »

VijayPande wrote:Dr. Voelz has been on this one for the last few days. In working to fix this, we've been slowed down since this has exposed a bug in the new v5 server code, which Joe has been working on. (Briefly, it was running out of RAM, due to 32-bit binaries). It looks like it's now fixed (running as 64-bit).

Thanks for bearing with us on this one. The v5 server code is still having growing pains here and there, but it's getting there. In particular, it's extremely capable (we could likely run all of FAH on *one* of our newer 24GB RAM servers with this new code, a big improvement over the v4 version) and has lots of neat functions which we'll be using in the future. My hope is that this is the last big issue, but there will undoubtably be more smaller issues as time goes on.
Thanks Vijay for the update!

Any ETA on when points for the 63xx units will be credited?
toTOW
Site Moderator
Posts: 6359
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: Project 6318: Collection server misconfigured?

Post by toTOW »

AgrFan wrote:Any ETA on when points for the 63xx units will be credited?
Probably a few hours after the server goes back online and accepts your WUs.
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
AgrFan
Posts: 63
Joined: Sat Mar 15, 2008 8:07 pm

Re: Project 6318: Collection server misconfigured?

Post by AgrFan »

toTOW wrote:
AgrFan wrote:Any ETA on when points for the 63xx units will be credited?
Probably a few hours after the server goes back online and accepts your WUs.
I'm missing points for 63xx units uploaded over the last couple of weeks. I can provide logs if needed. Hopefully once the backlog is processed the points will arrive.
DrSpalding
Posts: 136
Joined: Wed May 27, 2009 4:48 pm
Hardware configuration: Dell Studio 425 MTS-Core i7-920 c0 stock
evga SLI 3x o/c Core i7-920 d0 @ 3.9GHz + nVidia GTX275
Dell 5150 + nVidia 9800GT

Re: Project 6318: Collection server misconfigured?

Post by DrSpalding »

I would actually like to see the assignment server stop giving out packets to 171.64.65.60 for a while. One of my machines has been unsuccessfully waiting to get a WU from it since 17:08 UTC (9:08 PST) today even though it was finally able to upload the two or three completed 6318 WUs I had waiting. I see that the server is pretty heavily loaded (some columns removed):

Code: Select all

Sat Feb 6 08:55:10 PST 2010  171.64.65.60  classic  vspg10a  vvoelz  accept  Accepting  9.86  455  
Sat Feb 6 09:10:10 PST 2010  171.64.65.60  classic  vspg10a  vvoelz  accept  Accepting  10.40  432 
Sat Feb 6 09:25:10 PST 2010  171.64.65.60  classic  vspg10a  vvoelz  accept  Accepting  8.95  414 
Sat Feb 6 09:40:10 PST 2010  171.64.65.60  classic  vspg10a  vvoelz  accept  Accepting  9.15  532 
Sat Feb 6 09:55:10 PST 2010  171.64.65.60  classic  vspg10a  vvoelz  accept  Accepting  9.88  513 
Sat Feb 6 10:10:10 PST 2010  171.64.65.60  classic  vspg10a  vvoelz  accept  Accepting  10.03  458
Sat Feb 6 10:25:10 PST 2010  171.64.65.60  classic  vspg10a  vvoelz  accept  Accepting  10.81  505
Sat Feb 6 10:40:10 PST 2010  171.64.65.60  classic  vspg10a  vvoelz  accept  Accepting  11.84  531
Sat Feb 6 10:55:09 PST 2010  171.64.65.60  classic  vspg10a  vvoelz  accept  Accepting  13.22  614
Sat Feb 6 11:10:10 PST 2010  171.64.65.60  classic  vspg10a  vvoelz  accept  Accepting  21.65  823
Sat Feb 6 11:25:10 PST 2010  171.64.65.60  classic  vspg10a  vvoelz  accept  Accepting  25.55  1031
Sat Feb 6 11:40:10 PST 2010  171.64.65.60  classic  vspg10a  vvoelz  accept  Accepting  15.14  421
Sat Feb 6 11:55:11 PST 2010  171.64.65.60  classic  vspg10a  vvoelz  accept  Accepting  7.70  523
I expect that the last two columns are the cpuload and netload values.
Not a real doctor, I just play one on the 'net!
Image
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project 6318: Collection server misconfigured?

Post by bruce »

Yes, the last two columns are CPU LOAD and NET LOAD.

The CPU LOAD includes not only what happens at high priority (managing the internet connection) but any number of low priority tasks such as recompiling some of the code. A large number doesn't necessarily imply that the server isn't able to handle the load.

I don't know what a reasonable value of NET LOAD is reasonable for this server. Unlike the older servers which were much more limited in both RAM and Core count, this server might be dealing with the load quite well. I'm sure that people are watching it carefully.

I don't mean to discount your observations. It's possible that a reduction in assignment rate is in order. It's also possible that you have attempted to upload during times that the server has been down. They've been working very hard on this server over the past couple of days. (See VijayPande's post above.)

Please post FAHlog showing the errors that you're getting and when they occurred.
DrSpalding
Posts: 136
Joined: Wed May 27, 2009 4:48 pm
Hardware configuration: Dell Studio 425 MTS-Core i7-920 c0 stock
evga SLI 3x o/c Core i7-920 d0 @ 3.9GHz + nVidia GTX275
Dell 5150 + nVidia 9800GT

Re: Project 6318: Collection server misconfigured?

Post by DrSpalding »

At 21:35 UTC today, the assignment server changed this particular machine to server 171.67.108.13 and picked up a P4441 r143c3g45 project instead of a 6318.

The one thing I have noticed on another machine is that it took 7-9 minutes routinely to be told that the server had no record of the WU. After about three days, it finally uploaded today at 13:58 UTC. That was a p6318 r3344 c43 g1.

If you still would like to see the relevant portions of those two logs, let me know via a PM and an email address to mail them to you. Otherwise, I am content to let it go for now. The machine is pretty obviously heavily loaded while it tries to catch up with all the 6318 (and others?) WUs that were waiting to be uploaded.
Not a real doctor, I just play one on the 'net!
Image
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project 6318: Collection server misconfigured?

Post by bruce »

The assignment to a particular server leads to the issuing of a new WU. If it's a new server, you'll get assignments from a different family of project. That has nothing to do with what you've already completed.

Work which has been completed is often uploaded immediately but there are numerous reasons why it might not be. In the case of p6318, there is more than one reason, and one of them is that the Collection Server does not have a record of it. I still am trying to figure out why it has not been uploaded to the primary Work Server.
Post Reply