Project 6318: Collection server misconfigured?
Moderators: Site Moderators, FAHC Science Team
Re: Project 6318: Collection server misconfigured?
Server 171.64.65.60 was down for most of the night (Stanford time). It's just coming back now and it will take a while to accept the backlog.
At the time you uploaded, the Collection Server did not have a record of the WU, but I suspect that the Work Server does have that information, so everthing should straighten itself out in a while. Please be patient.
At the time you uploaded, the Collection Server did not have a record of the WU, but I suspect that the Work Server does have that information, so everthing should straighten itself out in a while. Please be patient.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 128
- Joined: Mon Dec 03, 2007 9:38 pm
- Hardware configuration: CPU folding on only one machine a laptop
GPU Hardware..
3 x 460
1 X 260
4 X 250
+ 1 X 9800GT (3 days a week) - Location: Chester U.K
Re: Project 6318: Collection server misconfigured?
Yes, just sent one back, 34 more to gobruce wrote:Server 171.64.65.60 was down for most of the night (Stanford time). It's just coming back now and it will take a while to accept the backlog.
EDIT, Just got lucky with that one unit from the looks of it, multiple attempts every six hours and on completion of units and yet no further units uploaded...now up to 37, at least I've no more to complete. FAHSpy overloaded with red error messages..
Pete
Re: Project 6318: Collection server misconfigured?
Is the work server still down ?
[23:46:48] Project: 6316 (Run 430, Clone 12, Gen 24)
[23:46:48]
[23:46:49] - Couldn't send HTTP request to server
[23:46:49] + Could not connect to Work Server (results)
[23:46:49] (171.64.65.60:8080)
[23:46:49] + Retrying using alternative port
[23:46:49] Assembly optimizations on if available.
[23:46:49] Entering M.D.
[23:46:50] - Couldn't send HTTP request to server
[23:46:50] + Could not connect to Work Server (results)
[23:46:50] (171.64.65.60:80)
[23:46:50] - Error: Could not transmit unit 04 (completed February 4) to work server.
[23:46:50] + Attempting to send results [February 4 23:46:50 UTC]
[23:47:09] (Starting from checkpoint)
[23:47:09] Protein: p6316_sh3_with_ALA_frags
[23:47:09]
[23:47:09] Writing local files
[23:47:09] Completed 110000 out of 500000 steps (22%)
[23:47:09] Extra SSE boost OK.
[23:49:42] - Server does not have record of this unit. Will try again later.
[23:49:42] Could not transmit unit 04 to Collection server; keeping in queue.
[23:46:48] Project: 6316 (Run 430, Clone 12, Gen 24)
[23:46:48]
[23:46:49] - Couldn't send HTTP request to server
[23:46:49] + Could not connect to Work Server (results)
[23:46:49] (171.64.65.60:8080)
[23:46:49] + Retrying using alternative port
[23:46:49] Assembly optimizations on if available.
[23:46:49] Entering M.D.
[23:46:50] - Couldn't send HTTP request to server
[23:46:50] + Could not connect to Work Server (results)
[23:46:50] (171.64.65.60:80)
[23:46:50] - Error: Could not transmit unit 04 (completed February 4) to work server.
[23:46:50] + Attempting to send results [February 4 23:46:50 UTC]
[23:47:09] (Starting from checkpoint)
[23:47:09] Protein: p6316_sh3_with_ALA_frags
[23:47:09]
[23:47:09] Writing local files
[23:47:09] Completed 110000 out of 500000 steps (22%)
[23:47:09] Extra SSE boost OK.
[23:49:42] - Server does not have record of this unit. Will try again later.
[23:49:42] Could not transmit unit 04 to Collection server; keeping in queue.
Re: Project 6318: Collection server misconfigured?
The answer to that question can be found on the "Server Status" link at the top of the page.whiteb wrote:Is the work server still down ?
In the log you posted, you'll see that it tried to upload to 171.64.65.60 and failed. On the page that opens from that link, scroll down until you see 171.64.65.60 (very near the bottom).
At the present time, it says standby / Not Accept so, yes, it's still down. It has been up several times today for varying periods but apparently they're still working hard to fix whatever they're trying to correct.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
Re: Project 6318: Collection server misconfigured?
Not only is the server 171.64.65.60 Standby/Not Accept, but the IP for the Collection Server is not in the Server List: 171.67.108.26:
I now have three systems sending completed WUs back to these servers - all show the same hangups.
So where are the completed WUs going?
Code: Select all
[16:18:02] Connecting to http://171.64.65.60:8080/
[16:18:04] - Couldn't send HTTP request to server
[16:18:04] + Could not connect to Work Server (results)
[16:18:04] (171.64.65.60:8080)
[16:18:04] + Retrying using alternative port
[16:18:04] Connecting to http://171.64.65.60:80/
[16:18:05] - Couldn't send HTTP request to server
[16:18:05] + Could not connect to Work Server (results)
[16:18:05] (171.64.65.60:80)
[16:18:05] - Error: Could not transmit unit 00 (completed February 5) to work server.
[16:18:05] - 4 failed uploads of this unit.
[16:18:05] - Read packet limit of 540015616... Set to 524286976.
[16:18:05] + Attempting to send results [February 5 16:18:05 UTC]
[16:18:05] - Reading file work/wuresults_00.dat from core
[16:18:05] (Read 6877118 bytes from disk)
[16:18:05] Connecting to http://171.67.108.26:8080/
[16:20:30] Posted data.
[16:20:30] Initial: 0000; - Uploaded at ~46 kB/s
[16:20:30] - Averaged speed for that direction ~42 kB/s
[16:20:30] - Server does not have record of this unit. Will try again later.
[16:20:30] Could not transmit unit 00 to Collection server; keeping in queue.
So where are the completed WUs going?
... ... Free Republic Folders - A Tribute to Ronald Reagan ... ...
Re: Project 6318: Collection server misconfigured?
I hate to state the obvious . . .brityank wrote:So where are the completed WUs going?
Code: Select all
. . .keeping in queue.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 101
- Joined: Sat Feb 02, 2008 10:12 am
- Hardware configuration: AMD Athlon(tm) 64 X2 Dual Core Processor 4000+
AMD Athlon(tm) XP 2600+ - Location: Philippines
Re: Project 6318: Collection server misconfigured?
Getting a lot of the "Server does not have record of this unit. Will try again later." entries in the log.
Project: 6318 (Run 3075, Clone 18, Gen 1)
This is from a fah6 -send 04
Project: 6318 (Run 3075, Clone 18, Gen 1)
Code: Select all
[10:17:30] - Autosending finished units...
[10:17:30] Trying to send all finished work units
[10:17:30] + Attempting to send results
[10:17:30] - Reading file work/wuresults_04.dat from core
[10:17:30] (Read 6874049 bytes from disk)
[10:17:30] Connecting to http://171.64.65.60:8080/
[10:17:30] - Couldn't send HTTP request to server
[10:17:30] + Could not connect to Work Server (results)
[10:17:30] (171.64.65.60:8080)
[10:17:30] - Error: Could not transmit unit 04 (completed February 4) to work server.
[10:17:30] - 6 failed uploads of this unit.
[10:17:30] + Attempting to send results
[10:17:30] - Reading file work/wuresults_04.dat from core
[10:17:30] (Read 6874049 bytes from disk)
[10:17:30] Connecting to http://171.67.108.26:8080/
[10:23:07] Timered checkpoint triggered.
[10:23:07] Posted data.
[10:23:07] Initial: 0000; - Uploaded at ~19 kB/s
[10:23:07] - Averaged speed for that direction ~19 kB/s
[10:23:07] - Server does not have record of this unit. Will try again later.
[10:23:07] Could not transmit unit 04 to Collection server; keeping in queue.
[10:23:07] + Sent 0 of 1 completed units to the server
[10:23:07] - Autosend completed
<< SNIP >>
[16:23:07] - Autosending finished units...
[16:23:07] Trying to send all finished work units
[16:23:07] + Attempting to send results
[16:23:07] - Reading file work/wuresults_04.dat from core
[16:23:07] (Read 6874049 bytes from disk)
[16:23:07] Connecting to http://171.64.65.60:8080/
[16:23:07] - Couldn't send HTTP request to server
[16:23:07] + Could not connect to Work Server (results)
[16:23:07] (171.64.65.60:8080)
[16:23:07] - Error: Could not transmit unit 04 (completed February 4) to work server.
[16:23:07] - 7 failed uploads of this unit.
[16:23:07] + Attempting to send results
[16:23:07] - Reading file work/wuresults_04.dat from core
[16:23:07] (Read 6874049 bytes from disk)
[16:23:07] Connecting to http://171.67.108.26:8080/
[16:28:44] Posted data.
[16:28:44] Initial: 0000; - Uploaded at ~19 kB/s
[16:28:44] - Averaged speed for that direction ~19 kB/s
[16:28:44] - Server does not have record of this unit. Will try again later.
[16:28:44] Could not transmit unit 04 to Collection server; keeping in queue.
[16:28:44] + Sent 0 of 1 completed units to the server
[16:28:44] - Autosend completed
[22:28:44] - Autosending finished units...
[22:28:44] Trying to send all finished work units
[22:28:44] + Attempting to send results
[22:28:44] - Reading file work/wuresults_04.dat from core
[22:28:44] (Read 6874049 bytes from disk)
[22:28:44] Connecting to http://171.64.65.60:8080/
[22:32:58] Writing local files
[22:32:58] Completed 335000 out of 500000 steps (67%)
[22:34:22] Posted data.
[22:34:22] Initial: 0000; - Uploaded at ~19 kB/s
[22:34:22] - Averaged speed for that direction ~19 kB/s
[22:34:22] - Server does not have record of this unit. Will try again later.
[22:34:22] - Error: Could not transmit unit 04 (completed February 4) to work server.
[22:34:22] - 8 failed uploads of this unit.
[22:34:22] + Attempting to send results
[22:34:22] - Reading file work/wuresults_04.dat from core
[22:34:22] (Read 6874049 bytes from disk)
[22:34:22] Connecting to http://171.67.108.26:8080/
[22:40:30] Posted data.
[22:40:30] Initial: 0000; - Uploaded at ~18 kB/s
[22:40:30] - Averaged speed for that direction ~19 kB/s
[22:40:30] - Server does not have record of this unit. Will try again later.
[22:40:30] Could not transmit unit 04 to Collection server; keeping in queue.
[22:40:30] + Sent 0 of 1 completed units to the server
[22:40:30] - Autosend completed
Code: Select all
--- Opening Log file [February 6 04:21:14]
# Linux Console Edition #######################################################
###############################################################################
Folding@Home Client Version 6.02
http://folding.stanford.edu
###############################################################################
###############################################################################
Launch directory: /home/folder/fah
Executable: fah6
Arguments: -send 04
[04:21:14] - Ask before connecting: No
[04:21:14] - User name: chrisretusn (Team 2291)
[04:21:14] - User ID: xxxx
[04:21:14] - Machine ID: 1
[04:21:14]
[04:21:14] Loaded queue successfully.
[04:21:14] Attempting to return result(s) to server...
[04:21:14] + Attempting to send results
[04:26:51] - Server does not have record of this unit. Will try again later.
[04:26:51] - Error: Could not transmit unit 04 (completed February 4) to work server.
[04:26:51] + Attempting to send results
[04:32:30] - Server does not have record of this unit. Will try again later.
[04:32:30] Could not transmit unit 04 to Collection server; keeping in queue.
[04:32:31] - Failed to send unit 04 to server
Folding@Home Client Shutdown.
-
- Pande Group Member
- Posts: 2058
- Joined: Fri Nov 30, 2007 6:25 am
- Location: Stanford
Re: Project 6318: Collection server misconfigured?
Dr. Voelz has been on this one for the last few days. In working to fix this, we've been slowed down since this has exposed a bug in the new v5 server code, which Joe has been working on. (Briefly, it was running out of RAM, due to 32-bit binaries). It looks like it's now fixed (running as 64-bit).
Thanks for bearing with us on this one. The v5 server code is still having growing pains here and there, but it's getting there. In particular, it's extremely capable (we could likely run all of FAH on *one* of our newer 24GB RAM servers with this new code, a big improvement over the v4 version) and has lots of neat functions which we'll be using in the future. My hope is that this is the last big issue, but there will undoubtably be more smaller issues as time goes on.
Thanks for bearing with us on this one. The v5 server code is still having growing pains here and there, but it's getting there. In particular, it's extremely capable (we could likely run all of FAH on *one* of our newer 24GB RAM servers with this new code, a big improvement over the v4 version) and has lots of neat functions which we'll be using in the future. My hope is that this is the last big issue, but there will undoubtably be more smaller issues as time goes on.
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
Re: Project 6318: Collection server misconfigured?
Thanks Vijay for the update!VijayPande wrote:Dr. Voelz has been on this one for the last few days. In working to fix this, we've been slowed down since this has exposed a bug in the new v5 server code, which Joe has been working on. (Briefly, it was running out of RAM, due to 32-bit binaries). It looks like it's now fixed (running as 64-bit).
Thanks for bearing with us on this one. The v5 server code is still having growing pains here and there, but it's getting there. In particular, it's extremely capable (we could likely run all of FAH on *one* of our newer 24GB RAM servers with this new code, a big improvement over the v4 version) and has lots of neat functions which we'll be using in the future. My hope is that this is the last big issue, but there will undoubtably be more smaller issues as time goes on.
Any ETA on when points for the 63xx units will be credited?
-
- Site Moderator
- Posts: 6359
- Joined: Sun Dec 02, 2007 10:38 am
- Location: Bordeaux, France
- Contact:
Re: Project 6318: Collection server misconfigured?
Probably a few hours after the server goes back online and accepts your WUs.AgrFan wrote:Any ETA on when points for the 63xx units will be credited?
Re: Project 6318: Collection server misconfigured?
I'm missing points for 63xx units uploaded over the last couple of weeks. I can provide logs if needed. Hopefully once the backlog is processed the points will arrive.toTOW wrote:Probably a few hours after the server goes back online and accepts your WUs.AgrFan wrote:Any ETA on when points for the 63xx units will be credited?
-
- Posts: 136
- Joined: Wed May 27, 2009 4:48 pm
- Hardware configuration: Dell Studio 425 MTS-Core i7-920 c0 stock
evga SLI 3x o/c Core i7-920 d0 @ 3.9GHz + nVidia GTX275
Dell 5150 + nVidia 9800GT
Re: Project 6318: Collection server misconfigured?
I would actually like to see the assignment server stop giving out packets to 171.64.65.60 for a while. One of my machines has been unsuccessfully waiting to get a WU from it since 17:08 UTC (9:08 PST) today even though it was finally able to upload the two or three completed 6318 WUs I had waiting. I see that the server is pretty heavily loaded (some columns removed):
I expect that the last two columns are the cpuload and netload values.
Code: Select all
Sat Feb 6 08:55:10 PST 2010 171.64.65.60 classic vspg10a vvoelz accept Accepting 9.86 455
Sat Feb 6 09:10:10 PST 2010 171.64.65.60 classic vspg10a vvoelz accept Accepting 10.40 432
Sat Feb 6 09:25:10 PST 2010 171.64.65.60 classic vspg10a vvoelz accept Accepting 8.95 414
Sat Feb 6 09:40:10 PST 2010 171.64.65.60 classic vspg10a vvoelz accept Accepting 9.15 532
Sat Feb 6 09:55:10 PST 2010 171.64.65.60 classic vspg10a vvoelz accept Accepting 9.88 513
Sat Feb 6 10:10:10 PST 2010 171.64.65.60 classic vspg10a vvoelz accept Accepting 10.03 458
Sat Feb 6 10:25:10 PST 2010 171.64.65.60 classic vspg10a vvoelz accept Accepting 10.81 505
Sat Feb 6 10:40:10 PST 2010 171.64.65.60 classic vspg10a vvoelz accept Accepting 11.84 531
Sat Feb 6 10:55:09 PST 2010 171.64.65.60 classic vspg10a vvoelz accept Accepting 13.22 614
Sat Feb 6 11:10:10 PST 2010 171.64.65.60 classic vspg10a vvoelz accept Accepting 21.65 823
Sat Feb 6 11:25:10 PST 2010 171.64.65.60 classic vspg10a vvoelz accept Accepting 25.55 1031
Sat Feb 6 11:40:10 PST 2010 171.64.65.60 classic vspg10a vvoelz accept Accepting 15.14 421
Sat Feb 6 11:55:11 PST 2010 171.64.65.60 classic vspg10a vvoelz accept Accepting 7.70 523
Not a real doctor, I just play one on the 'net!
Re: Project 6318: Collection server misconfigured?
Yes, the last two columns are CPU LOAD and NET LOAD.
The CPU LOAD includes not only what happens at high priority (managing the internet connection) but any number of low priority tasks such as recompiling some of the code. A large number doesn't necessarily imply that the server isn't able to handle the load.
I don't know what a reasonable value of NET LOAD is reasonable for this server. Unlike the older servers which were much more limited in both RAM and Core count, this server might be dealing with the load quite well. I'm sure that people are watching it carefully.
I don't mean to discount your observations. It's possible that a reduction in assignment rate is in order. It's also possible that you have attempted to upload during times that the server has been down. They've been working very hard on this server over the past couple of days. (See VijayPande's post above.)
Please post FAHlog showing the errors that you're getting and when they occurred.
The CPU LOAD includes not only what happens at high priority (managing the internet connection) but any number of low priority tasks such as recompiling some of the code. A large number doesn't necessarily imply that the server isn't able to handle the load.
I don't know what a reasonable value of NET LOAD is reasonable for this server. Unlike the older servers which were much more limited in both RAM and Core count, this server might be dealing with the load quite well. I'm sure that people are watching it carefully.
I don't mean to discount your observations. It's possible that a reduction in assignment rate is in order. It's also possible that you have attempted to upload during times that the server has been down. They've been working very hard on this server over the past couple of days. (See VijayPande's post above.)
Please post FAHlog showing the errors that you're getting and when they occurred.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 136
- Joined: Wed May 27, 2009 4:48 pm
- Hardware configuration: Dell Studio 425 MTS-Core i7-920 c0 stock
evga SLI 3x o/c Core i7-920 d0 @ 3.9GHz + nVidia GTX275
Dell 5150 + nVidia 9800GT
Re: Project 6318: Collection server misconfigured?
At 21:35 UTC today, the assignment server changed this particular machine to server 171.67.108.13 and picked up a P4441 r143c3g45 project instead of a 6318.
The one thing I have noticed on another machine is that it took 7-9 minutes routinely to be told that the server had no record of the WU. After about three days, it finally uploaded today at 13:58 UTC. That was a p6318 r3344 c43 g1.
If you still would like to see the relevant portions of those two logs, let me know via a PM and an email address to mail them to you. Otherwise, I am content to let it go for now. The machine is pretty obviously heavily loaded while it tries to catch up with all the 6318 (and others?) WUs that were waiting to be uploaded.
The one thing I have noticed on another machine is that it took 7-9 minutes routinely to be told that the server had no record of the WU. After about three days, it finally uploaded today at 13:58 UTC. That was a p6318 r3344 c43 g1.
If you still would like to see the relevant portions of those two logs, let me know via a PM and an email address to mail them to you. Otherwise, I am content to let it go for now. The machine is pretty obviously heavily loaded while it tries to catch up with all the 6318 (and others?) WUs that were waiting to be uploaded.
Not a real doctor, I just play one on the 'net!
Re: Project 6318: Collection server misconfigured?
The assignment to a particular server leads to the issuing of a new WU. If it's a new server, you'll get assignments from a different family of project. That has nothing to do with what you've already completed.
Work which has been completed is often uploaded immediately but there are numerous reasons why it might not be. In the case of p6318, there is more than one reason, and one of them is that the Collection Server does not have a record of it. I still am trying to figure out why it has not been uploaded to the primary Work Server.
Work which has been completed is often uploaded immediately but there are numerous reasons why it might not be. In the case of p6318, there is more than one reason, and one of them is that the Collection Server does not have a record of it. I still am trying to figure out why it has not been uploaded to the primary Work Server.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.