171.64.65.64 overloaded

noorman · Post by **noorman** » Wed Jun 22, 2011 8:19 am

Same reports from my friend; it 's not possible to get a constant stream of Work and swift uploads.
That way production is seriously hampered and very inefficient; the donor systems are running idle for too long on average!

He runs a farm with a mix of SMP and multi-GPU, almost all exactly the same systems (hardware and software).
Configs also the same.
This is hurting production really badly!

Can someone take control of that server ( 171.64.65.64 ) and repair it, please?

.

noorman · Post by **noorman** » Wed Jun 22, 2011 5:16 pm

.
Now it 's 171.64.65.64 GPU vspg2v lin5 full Reject 1.92 0 0 6 17883 5049 in REJECT again !

.

GreyWhiskers · Post by **GreyWhiskers** » Thu Jun 23, 2011 6:31 am

I've extracted from the server log selected columns for 77 times between May 29 and today, June 22 when the CONNECT status was anything but Accepting.

As a "frequent flyer" for 171.64.65.64, I'm wondering if there is something systemic that makes this server down so much - is it the server code, the hardware, the WUs that are loaded, or what. Is that likely to be fixed anytime soon? Or should we adjust our expectations that it will be periodically off-line?

Since this is serving WUs to very fast Fermi GPUs who come back to the well every couple of hours, there seems to be a lot of dead time while clients are trying to unload completed WUs, and can't get new WUs from one of the other servers for Fermi WUs that are still up until X number of failed attempts to connect to 171.64.65.64.

I'm appreciative of members of PG, including Dr. Pande, in giving updates in this thread. Maybe it's time for another brief update.

Re: 171.64.65.64 overloaded
by yslin » Sat May 21, 2011 1:49 pm

Hi,

I've been working on this server but it might take more to fix. Sorry for the inconveniences!

yslin

yslin
Pande Group Member

Re: 171.64.65.64 overloaded
by VijayPande » Sat May 21, 2011 3:56 pm

It's still having problems, we so we're doing a hard reboot. The machine will likely fsck for a while. We'll give you an update when we know more.
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University

VijayPande
Pande Group Member

Code: Select all

                             SERVER IP     WHO  STATUS     CONNECT    CPU LOAD   NET LOAD   DL WUs AVAIL   WUs to go  WUs WAIT
Sun May 29 17:05:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.64         44 31      135700     135700     135700
Sun May 29 20:00:10 PDT 2011 171.64.65.64  lin5 full       Reject             0.7         40 44      135596     135596     135596
Mon May 30 12:00:10 PDT 2011 171.64.65.64  lin5 full       Reject            1.02         26 36      135674     135674     135674
Tue May 31 08:05:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.74         52 36      135886     135886     135886
Tue May 31 11:35:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.59         44 36      135978     135978     135978
Tue May 31 19:10:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.66         44 29      135973     135973     135973
Wed Jun 1 10:30:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.67         37 25      135771     135771     135771
Wed Jun 1 14:35:11 PDT 2011  171.64.65.64  lin5 full       Reject            0.76         37 31      135856     135856     135856
Thu Jun 2 01:45:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.62         45 33      135894     135894     135894
Thu Jun 2 07:10:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.83         54 25      136006     136006     136006
Thu Jun 2 12:25:10 PDT 2011  171.64.65.64  -    full       DOWN       -          -           34-           -          -
Thu Jun 2 20:40:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.42         43 33      135845     135845     135845
Fri Jun 3 04:30:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.59         69 27      135852     135852     135852
Fri Jun 3 16:10:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.67         80 30      135683     135683     135683
Sun Jun 5 08:55:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.61         52 22      135881     135881     135881
Sun Jun 5 14:45:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.97         33 26      135884     135884     135884
Sun Jun 5 17:05:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.87         19 30      135759     135759     135759
Mon Jun 6 05:35:10 PDT 2011  171.64.65.64  lin5 full       Reject            1.02          0 30      135671     135671     135671
Mon Jun 6 09:45:10 PDT 2011  171.64.65.64  lin5 full       Reject            1.08          0 23      135712     135712     135712
Mon Jun 6 15:35:10 PDT 2011  171.64.65.64  lin5 full       Reject             0.6         52 32      135820     135820     135820
Mon Jun 6 22:00:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.99         34 31           0          0          0
Tue Jun 7 19:15:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.86         45 32      135975     135975     135975
Tue Jun 7 21:35:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.62         25 25      135738     135738     135738
Wed Jun 8 13:40:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.68         43 40      135959     135959     135959
Wed Jun 8 17:45:10 PDT 2011  171.64.65.64  lin5 full       Reject            1.19         40 37      135953     135953     135953
Thu Jun 9 03:50:10 PDT 2011  171.64.65.64  lin5 full       Reject            1.04         41 30      135867     135867     135867
Thu Jun 9 12:00:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.97         39 25           0          0          0
Thu Jun 9 17:15:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.92         61 20      135880     135880     135880
Thu Jun 9 19:35:10 PDT 2011  171.64.65.64  lin5 full       Reject            0.87         41 20      135973     135973     135973
Thu Jun 9 21:55:10 PDT 2011  171.64.65.64  lin5 full       Reject            1.03         46 25      135907     135907     135907
Thu Jun 9 23:40:11 PDT 2011  171.64.65.64  lin5 full       Reject               1         58 29      135722     135722     135722
Fri Jun 10 03:55:10 PDT 2011 171.64.65.64  lin5 full       Reject            1.09        102 29      135833     135833     135833
Fri Jun 10 14:25:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.98         47 20      136055     136055     136055
Fri Jun 10 19:05:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.62         47 21      136026     136026     136026
Fri Jun 10 21:25:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.66         41 28      135969     135969     135969
Sat Jun 11 08:35:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.56         58 18      136001     136001     136001
Sat Jun 11 12:40:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.96         42 21      135974     135974     135974
Sat Jun 11 20:15:10 PDT 2011 171.64.65.64  lin5 full       Reject            1.01         75 25           0          0          0
Sun Jun 12 04:35:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.85         53 18      135859     135859     135859
Sun Jun 12 12:10:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.84         50 23      136084     136084     136084
Sun Jun 12 13:55:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.36         50 23      135951     135951     135951
Sun Jun 12 23:15:10 PDT 2011 171.64.65.64  lin5 full       Reject            1.09          0 28      135780     135780     135780
Mon Jun 13 01:00:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.73         63 17      136218     136218     136218
Mon Jun 13 05:40:10 PDT 2011 171.64.65.64  lin5 full       Reject            1.03          0 39      135797     135797     135797
Mon Jun 13 08:35:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.92         45 39      135854     135854     135854
Mon Jun 13 15:00:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.65         61 26      136054     136054     136054
Tue Jun 14 02:55:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.75         56 15      136042     136042     136042
Wed Jun 15 03:05:10 PDT 2011 171.64.65.64  lin5 standby    Not Accept        0.61         41 25      135992     135992     135992
Wed Jun 15 05:25:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.89         21 20      135759     135759     135759
Wed Jun 15 08:20:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.74         89 23      135805     135805     135805
Wed Jun 15 10:40:10 PDT 2011 171.64.65.64  lin5 full       Reject             0.7         69 23      135897     135897     135897
Thu Jun 16 06:55:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.91         57 11      119531     119531     119531
Thu Jun 16 13:55:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.73         41 15      112457     112457     112457
Thu Jun 16 15:40:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.89         81 16           0          0          0
Fri Jun 17 07:40:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.83         28 28       99541      99541      99541
Fri Jun 17 11:10:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.55         66 28       98062      98062      98062
Sat Jun 18 10:45:11 PDT 2011 171.64.65.64  lin5 full       Reject            0.49         50 14       96491      96491      96491
Sat Jun 18 14:20:10 PDT 2011 171.64.65.64  lin5 full       Reject            1.02        134 16           0          0          0
Sat Jun 18 16:05:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.44         38 17       96379      96379      96379
Sat Jun 18 21:55:10 PDT 2011 171.64.65.64  lin5 full       Reject            1.01         31 19           0          0          0
Sun Jun 19 04:00:11 PDT 2011 171.64.65.64  lin5 full       Reject            1.06         36 14       96604      96604      96604
Sun Jun 19 11:00:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.49         53 13       96562      96562      96562
Mon Jun 20 02:55:10 PDT 2011 171.64.65.64  lin5 standby    Not Accept        1.49         58 19       96788      96788      96788
Mon Jun 20 04:45:10 PDT 2011 171.64.65.64  lin5 full       Reject            1.08          0 12       96622      96622      96622
Mon Jun 20 05:20:11 PDT 2011 171.64.65.64  lin5 full       Reject            0.84         45 12       96904      96904      96904
Mon Jun 20 07:05:11 PDT 2011 171.64.65.64  lin5 full       Reject            0.97        305 12           0          0          0
Mon Jun 20 10:00:10 PDT 2011 171.64.65.64  lin5 full       Reject               1          0 17       96553      96553      96553
Mon Jun 20 10:35:11 PDT 2011 171.64.65.64  lin5 full       Reject            0.61         45 17       96740      96740      96740
Mon Jun 20 18:10:10 PDT 2011 171.64.65.64  lin5 full       Reject            2.61          0 33       96348      96348      96348
Mon Jun 20 18:45:11 PDT 2011 171.64.65.64  lin5 full       Reject             2.6          0 33       96348      96348      96348
Mon Jun 20 19:20:10 PDT 2011 171.64.65.64  lin5 full       Reject            2.65          0 33       96348      96348      96348
Mon Jun 20 19:55:10 PDT 2011 171.64.65.64  lin5 full       Reject            1.67        971  1           0          0          0
Mon Jun 20 22:15:10 PDT 2011 171.64.65.64  lin5 full       Reject            0.31         68  1       96951      96951      96951
Tue Jun 21 14:20:10 PDT 2011 171.64.65.64  lin5 full       Reject            2.02        479  9           0          0          0
Tue Jun 21 16:05:10 PDT 2011 171.64.65.64  lin5 full       Reject            2.14        212  7           0          0          0
Wed Jun 22 09:05:10 PDT 2011 171.64.65.64  lin5 full       Reject            1.92          0  6       96429      96429      96429

Post by **bruce** » Thu Jun 23, 2011 1:04 pm

The Pande Group is aware of the problem. The ultimate fix is complex and will probably not happen quickly.

I don't think that taking the server off-line is what anybody wants to do.

noorman · Post by **noorman** » Thu Jun 23, 2011 1:37 pm

Where is the redundancy?

Post by **VijayPande** » Thu Jun 23, 2011 2:43 pm

We've continued to have trouble with the CS code. Joe is in the process of overhauling it. The current code works well under medium loads, but doesn't scale well. Joe's new scheme greatly simplifies how the CS works to help it scale better. Joe has been working on this the last few weeks, which has slowed down his v7 client work, but this is a very high priority in my mind for situations like this one.

Post by **VijayPande** » Thu Jun 23, 2011 2:45 pm

PS I've taken this server weight down to try to help balance it with the other GPU servers.

Also, I emailed Dr. Lin to have her push to get new projects going on a new server which has been assigned to her projects. That new server is much more powerful so it can take a much greater load.

Post by **bruce** » Thu Jun 23, 2011 3:07 pm

noorman wrote:Where is the redundancy?

There are three different answers to your question, depending on which problem you're referring to. The title of the topic says "overloaded" but a few posts up, GreyWhiskers asked a question about the server status being other than Accepting and that's an entirely different question.

Redundancy for downloading new WUs comes from the Assignment Server sending requests for new work to another Work Server. This concept works fine when one work server is down or is out of work. In that regard, it's better for the server to take itself off-line rather than to be overloaded. You'll notice that there are other GPU servers, so that doesn't seem to be a problem. Dr. Pande's statement about adjusting the weights is important, too. This helps balance the load when there are several Work Servers with WUs available (helping to keep any one of them from being overloaded unless they're all overloaded). Adding more projects to a different Work Server helps, too, but can't be done as quickly.

When WUs are finished, they need to find their way back to the same server that assigned them. If the server is overloaded with uploads or is off-line, the upload redundancy comes from the Collection Servers. That's an entirely different question than redundancy for downloads but is clearly being worked on. Although many people may gripe about it, it's not really a problem from an individual's point of view since there is no QRB (yet) for GPUs and all clients manage un-uploaded WUs by retrying later. Yes, this does delay the science, but that's a problem for the Pande Group, not for you, Please recognize that they need to address it based on PG priorities, not based on donor opinions.

noorman · Post by **noorman** » Thu Jun 23, 2011 3:29 pm

.
Sorry, it delays the science, but it also puts lots of systems in idle ...
That doesn't help anyone and is very inefficient (costs for donors that bring nothing for anyone)
.

Post by **bruce** » Thu Jun 23, 2011 11:13 pm

No, it does NOT put systems in idle as long as there are other servers that have WUs to assign -- and that appears to be true.

Which of the three problems I mentioned are you having?

noorman · Post by **noorman** » Fri Jun 24, 2011 12:07 am

.
I 'm not having trouble (I stopped F@H because of energy prices being too high over here and fin. situation), my friend is.
He 's not getting WU's and not able to send back results on an almost regular basis.
A consequence of the overload (and from being in REJECT)
.

Post by **VijayPande** » Fri Jun 24, 2011 12:47 am

Sorry to hear you're still having problems. The server is not in REJECT and has been working pretty well today. Right now, it only has 23 connections, which is pretty low. I wonder if there's something else going on other than the machine being loaded? Any chance it's an issue with your friends ISP? At times, we see weird things with ISP's.

GreyWhiskers · Post by **GreyWhiskers** » Fri Jun 24, 2011 6:43 am

I wanted to toss a few numbers out to show that while the server in question has had a lot of downtime, the affect on my folding over the last month has been minimal. Hats off to PG and to the flexibility of the system.

I'm running one GTX 560Ti, still with v6, so I can, and do, track all of the wu-by-wu stats in the HFM WU history log. I typically run DatAdmin 3 to export the MySQL DB to a CSV file for massaging with Excel.

Bottom line. During the month of June, I have 9 instances out of 226 completed GPU WUs where the "turn around time" has been anomalous. That is the time from the completion of one WU until the start of the next.

158 of the 226 June WUs were P6801, which involve 171.64.65.64. In the last couple of weeks, more and more of the WUs are from other projects.

96% of my WUs in June turned around in 10-30 seconds. The 9 anomalous instances range from 1:02 (mm:ss) to 17:35, with one outlier at over 4 hours. This is the period where so many of the 171.64.65.64 REJECT periods occurred.

The anomalous turn arounds were mostly correlated with either 171.64.65.64 REJECT periods, or with heavy "WU Received" periods, as reported in the Server Stats. And, some of these WU RCV loads have been heavy - see charts at bottom of post

That outlier occurred during one of the server REJECT periods. I can't remember exactly, but I may have stopped the GPU folding for a while to play with my SMP settings.

My 96% quick turn around could have been a little better in v7, since v6 won't attempt to get a new WU until after several failed attempts to upload the just-completed WU. v7 separates the upload/download, so if the assignment server recognizes not to assign me to the downed work server, then I could pick up a new WU possibly sooner.

Bottom line, for me at least. Good job to PG. The SYSTEM seems to be working well. I was surprised, once I looked at the actual data, how good the overall turn-around for my series of Core 16 Nvidia GPU projects was.

Code: Select all

"anomalous" finish-to-start times
hh:mm:ss
00:05:09
00:04:53
00:13:04
00:12:34
00:01:02
04:13:30
00:01:59
00:02:36
00:17:25

While I was looking at the server stats, I ran a couple of excel spreadsheets and charts. These two charts show how busy this server has been. No deep message or analysis here, just some interesting collateral information. The timescale is each individual half-hour update to the stats page when I pulled it a couple of days ago http://fah-web.stanford.edu/logs/171.64.65.64.log.html

NETLOAD tells how busy the server is by netstat (i.e. how many current connections the server is handling). Too many connections means that the server is heavily loaded. How many are "too many" depends on the server, but most of our servers can now handle a couple hundred connections without a problem.

NOTE LOG SCALE

WUS RCVD shows how many WUs have been received since the last time the servers WUs were updated into the stats. This shows the relative number of WUs being received on the different servers (if all is fine) or which servers are not being inputted into the stats db if there is some problem.

NOTE LINEAR SCALE

noorman · Post by **noorman** » Fri Jun 24, 2011 7:18 am

VijayPande wrote:Sorry to hear you're still having problems. The server is not in REJECT and has been working pretty well today. Right now, it only has 23 connections, which is pretty low. I wonder if there's something else going on other than the machine being loaded? Any chance it's an issue with your friends ISP? At times, we see weird things with ISP's.

.

I 'm very sure my friend is very capable to distinguish local network (or ISP) problems from server problems.
He has many systems running, all with multiple GPU card setups.
He 'd just like his systems to be running F@H without almost regular interruption because a finished WU (results) cannot be returned or because there is no new Work available or because the server is overloaded.
Specifically this server, with Fermi Work, has to be a heavy duty system because of the very fast turnaround of the Work that is coming from it.
If, in future, Fermi is used better, the return of Work might increase still, which would load that server even more than it is already.

Another point: I 'm sure it 's not possible that a network or ISP problem at my friend's could explain a 'low' connection count to named server ...
.

Post by **bruce** » Fri Jun 24, 2011 1:19 pm

noorman wrote:I 'm sure it 's not possible that a network or ISP problem at my friend's could explain a 'low' connection count to named server ...
.

Minor technicality: 171.64.65.64 is NOT a named server as far as FAH is concerned. Yes, the server has a name, but FAH does not reference DNS, it uses the IP address.

GreyWhiskers wrote:WUS RCVD shows how many WUs have been received since the last time the servers WUs were updated into the stats. This shows the relative number of WUs being received on the different servers (if all is fine) or which servers are not being inputted into the stats db if there is some problem.

Minor question: how do you account for the WUs Received count being reset when the stats are uploaded to the stats server?