171.64.65.56

Moderators: Site Moderators, FAHC Science Team

weedacres
Posts: 138
Joined: Mon Dec 24, 2007 11:18 pm
Hardware configuration: UserNames: weedacres_gpu ...
Location: Eastern Washington

Re: 171.64.65.56

Post by weedacres »

The last few days have been particularly bad.
Between 171.67.108.25 and 171.64.65.56 getting 503's and only occasionally completing an upload, and -send rarely working I've had to do a lot of restarts to get anything uploaded.
Image
Grandpa_01
Posts: 1122
Joined: Wed Mar 04, 2009 7:36 am
Hardware configuration: 3 - Supermicro H8QGi-F AMD MC 6174=144 cores 2.5Ghz, 96GB G.Skill DDR3 1333Mhz Ubuntu 10.10
2 - Asus P6X58D-E i7 980X 4.4Ghz 6GB DDR3 2000 A-Data 64GB SSD Ubuntu 10.10
1 - Asus Rampage Gene III 17 970 4.3Ghz DDR3 2000 2-500GB Segate 7200.11 0-Raid Ubuntu 10.10
1 - Asus G73JH Laptop i7 740QM 1.86Ghz ATI 5870M

Re: 171.64.65.56

Post by Grandpa_01 »

Yes they are having some pretty bad issues with this server most of the -smp WU's I complete take between 1 and 3 hours from the time I complete them for the server to accept them. I am sure Stanford will be doing something about the issue soon :? after all the purpose of the smp and bonus points was to encourage us to get them back as fast as possible. I don't think this hurry up and wait thing was part of the plan. :ewink:
Image
2 - SM H8QGi-F AMD 6xxx=112 cores @ 3.2 & 3.9Ghz
5 - SM X9QRI-f+ Intel 4650 = 320 cores @ 3.15Ghz
2 - I7 980X 4.4Ghz 2-GTX680
1 - 2700k 4.4Ghz GTX680
Total = 464 cores folding
Grandpa_01
Posts: 1122
Joined: Wed Mar 04, 2009 7:36 am
Hardware configuration: 3 - Supermicro H8QGi-F AMD MC 6174=144 cores 2.5Ghz, 96GB G.Skill DDR3 1333Mhz Ubuntu 10.10
2 - Asus P6X58D-E i7 980X 4.4Ghz 6GB DDR3 2000 A-Data 64GB SSD Ubuntu 10.10
1 - Asus Rampage Gene III 17 970 4.3Ghz DDR3 2000 2-500GB Segate 7200.11 0-Raid Ubuntu 10.10
1 - Asus G73JH Laptop i7 740QM 1.86Ghz ATI 5870M

Re: 171.64.65.56

Post by Grandpa_01 »

bruce wrote:
FlipBack wrote:My work unit has finally been returned successfully. After like 22 hours. Oh well, at least it is in. Sounds like the server just has a high load or something...
Something like that. The Net Load is around 200, which is probably the limit that it can handle. . . . too many people all hitting that same server at the same time.

I suppose everybody is running Langouste. Doesn't it increase the number of connections that each person has? The standard client uses one at a time but I think now folks are using three and the server can't handle it. One gets aborted but the server probably doesn't realize that, a second is started for an immediate download, and a third for the upload.
I looked at the server log after reading your comment about Langoueste. The Log only goes back to the 7th but the server has been pretty busy since then. If it is Langouste that is causing this problem perhaps they can figure out a way to get the ban hammer on those that are using it.
Image
2 - SM H8QGi-F AMD 6xxx=112 cores @ 3.2 & 3.9Ghz
5 - SM X9QRI-f+ Intel 4650 = 320 cores @ 3.15Ghz
2 - I7 980X 4.4Ghz 2-GTX680
1 - 2700k 4.4Ghz GTX680
Total = 464 cores folding
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 171.64.65.56

Post by bruce »

Grandpa_01 wrote:I looked at the server log after reading your comment about Langoueste. The Log only goes back to the 7th but the server has been pretty busy since then. If it is Langouste that is causing this problem perhaps they can figure out a way to get the ban hammer on those that are using it.
I do not know if it's fair to call this the cause. It's strictly speculation on my part about what might have changed. If it does happen to be the cause, it seems a shame that for the sake of saving a relatively small amount of time required to upload a WU before downloading a new one is worth causing the congestion that we see and the extra hassle that it's causing everyone.

By default, the client tries to upload once or twice when the WU is finished. If those attempts happen to fail, the client usually does not retry for 6 hours. Why was the client is designed that way? -- so that people do not contribute to an overloaded server. Under those conditions, I've seen reports of people repeatedly restarting their client (with or without -send all) to try to force their WU to upload. This DOES contribute to a server overload. Each person who intentionally attempts to jump to the front of the line not only makes others wait longer, but they add to the turbulence in the line, which slows others down even more. I can understand selfish behavior, but that doesn't mean I condone it.

Hopefully Stanford will stop assigning any new work from that server until most of the backlog of uploads can be accepted -- and then assign new work very sparingly. I have no idea if that will mean that folks won't be able to get their favorite project or not, but that might happen, but that would at least be a small improvement.
weedacres
Posts: 138
Joined: Mon Dec 24, 2007 11:18 pm
Hardware configuration: UserNames: weedacres_gpu ...
Location: Eastern Washington

Re: 171.64.65.56

Post by weedacres »

From what I saw of Langoueste, it does not try to send data any more than the the client itself. If it can't upload it waits for the client to try again in 6 hours. It's only function from what I could see was to immediately download a new workunit while trying to upload the last. It uses the -send command to accomplish this, and with -send being so unreliable I ended up with more stuck work units than without it, so I stopped using it.

The problem is that often after 6 hours the server is no more able to receive the now getting aged files than it was when it was completed. PG's bonus system encourages people to get the data in as soon as it's completed. If they can't handle the load then perhaps they should change something.
Image
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 171.64.65.56

Post by bruce »

weedacres wrote:From what I saw of Langoueste, it does not try to send data any more than the the client itself. If it can't upload it waits for the client to try again in 6 hours. It's only function from what I could see was to immediately download a new workunit while trying to upload the last. It uses the -send command to accomplish this, and with -send being so unreliable I ended up with more stuck work units than without it, so I stopped using it.

The problem is that often after 6 hours the server is no more able to receive the now getting aged files than it was when it was completed. PG's bonus system encourages people to get the data in as soon as it's completed. If they can't handle the load then perhaps they should change something.
The every 6 hours bit is only part of the problem. Doesn't it (A) Abort a connection (which the server probably waits an hour to close), (B) Start a download, and (C) Start an upload. By my count that's three connections whereas the original client would only be using one.

I'm not saying I'm sure of how it works, but it does open more connections, and if that's clogging up an over-committed server, it's a problem. Sure, you could do the same thing manually, but that is either of those actions contribute to congestion that really needs to be reduced. I have the same problem with people who restart their client unnecessarily.

I have no doubt that Stanford will figure out how to manage the server better, but that's probably not going to happen until tomorrow, and the real question is whether we, as a community, continue to fight over the over-committed resources or if we collectively decide to help each other between now and then.
Hyperlife
Posts: 192
Joined: Sun Dec 02, 2007 7:38 am

Re: 171.64.65.56

Post by Hyperlife »

I'm sure tear will jump in here, but from my experience with it, Langouste doesn't appear to open any more connections than the standard client does.

The main client's initial connection to upload a completed WU never makes it to the server because Langouste interrupts it (by using a proxy port, I believe). The forked client then creates the first (and only) upload connection while the main client moves on to request a new WU. The request gets sent to an assignment server initially, and then to a work server which may or may not be the same work server that's receiving the WU from the forked client.

If the forked client fails to upload the WU, then Langouste does nothing more until the next WU is completed -- the main client will then retry every six hours as usual.

My guess is that Langouste is not causing the problem here. More likely it's from people manually restarting their clients instead of waiting for the six-hour retry -- they don't want to lose bonus points by waiting that long. They now have an incentive to hit the server over and over again for the faster return.

People who haven't actually used Langouste or looked at the source code shouldn't be speculating about its role in this problem.
Image
weedacres
Posts: 138
Joined: Mon Dec 24, 2007 11:18 pm
Hardware configuration: UserNames: weedacres_gpu ...
Location: Eastern Washington

Re: 171.64.65.56

Post by weedacres »

Hyperlife wrote:I'm sure tear will jump in here, but from my experience with it, Langouste doesn't appear to open any more connections than the standard client does.

The main client's initial connection to upload a completed WU never makes it to the server because Langouste interrupts it (by using a proxy port, I believe). The forked client then creates the first (and only) upload connection while the main client moves on to request a new WU. The request gets sent to an assignment server initially, and then to a work server which may or may not be the same work server that's receiving the WU from the forked client.

If the forked client fails to upload the WU, then Langouste does nothing more until the next WU is completed -- the main client will then retry every six hours as usual.

My guess is that Langouste is not causing the problem here. More likely it's from people manually restarting their clients instead of waiting for the six-hour retry -- they don't want to lose bonus points by waiting that long. They now have an incentive to hit the server over and over again for the faster return.

People who haven't actually used Langouste or looked at the source code shouldn't be speculating about its role in this problem.
Hyperlife's description matches what I've observed.
Image
Grandpa_01
Posts: 1122
Joined: Wed Mar 04, 2009 7:36 am
Hardware configuration: 3 - Supermicro H8QGi-F AMD MC 6174=144 cores 2.5Ghz, 96GB G.Skill DDR3 1333Mhz Ubuntu 10.10
2 - Asus P6X58D-E i7 980X 4.4Ghz 6GB DDR3 2000 A-Data 64GB SSD Ubuntu 10.10
1 - Asus Rampage Gene III 17 970 4.3Ghz DDR3 2000 2-500GB Segate 7200.11 0-Raid Ubuntu 10.10
1 - Asus G73JH Laptop i7 740QM 1.86Ghz ATI 5870M

Re: 171.64.65.56

Post by Grandpa_01 »

So from what I am reading in the descriptions of what Langouste is doing it is interrupting the upload attempt and making the connection.That is 1 connection per FAH client. Then the FAH client makes a connection to download the next WU that appears to be 2 connections. So if my math is right 100 people running Langouste will use as many connections as 200 people not using it. Somehow in my mind it seams Langouste could very easily be contributing to the problem.
Image
2 - SM H8QGi-F AMD 6xxx=112 cores @ 3.2 & 3.9Ghz
5 - SM X9QRI-f+ Intel 4650 = 320 cores @ 3.15Ghz
2 - I7 980X 4.4Ghz 2-GTX680
1 - 2700k 4.4Ghz GTX680
Total = 464 cores folding
Hyperlife
Posts: 192
Joined: Sun Dec 02, 2007 7:38 am

Re: 171.64.65.56

Post by Hyperlife »

Grandpa_01 wrote:So from what I am reading in the descriptions of what Langouste is doing it is interrupting the upload attempt and making the connection.That is 1 connection per FAH client. Then the FAH client makes a connection to download the next WU that appears to be 2 connections. So if my math is right 100 people running Langouste will use as many connections as 200 people not using it. Somehow in my mind it seams Langouste could very easily be contributing to the problem.
Which is no different than a client running without Langouste. Every client will make two connections (actually 3, if you count the assigment server connection) during a WU-return-and-request cycle; a client running with Langouste merely makes the second (WU download) connection happen a few minutes earlier than before. The second connection is also completed in a few seconds, so the two connections are hardly equivalent in server processor and/or net load cost.

Langouste was released nearly a year ago in September 2009. Don't you think that every work server would have experienced high net load from day one if Langouste was the problem?
Image
7im
Posts: 10179
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: 171.64.65.56

Post by 7im »

Server issues have been ongoing, one might even say since late 2009. And Langouste for Windows was just recently released, potentially increasing the problems we are seeing.

See Hyperlife, it's just as easy to infer that Langouste IS the problem as you were to infer it ISN'T.

I'm not taking sides. Unsupported conjecture is not helpful, for or against. Until someone can state authoritatively how Langouste does or does not make connections (and how many), and how Stanford's servers do or do not timeout those connections, we're not helping the situation by pointing fingers back and forth.

Gather facts first. Then if any finger pointing or defending needs to be done, it can be.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
toTOW
Site Moderator
Posts: 6359
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: 171.64.65.56

Post by toTOW »

This is the server for p670x ... many people must be happy to net get them :roll:
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
Hyperlife
Posts: 192
Joined: Sun Dec 02, 2007 7:38 am

Re: 171.64.65.56

Post by Hyperlife »

7im wrote:Gather facts first. Then if any finger pointing or defending needs to be done, it can be.
I've used Langouste since soon after release, and I've looked at the source code. Have you?

Please follow your own advice. I've already done so.
Image
7im
Posts: 10179
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: 171.64.65.56

Post by 7im »

Hyperlife wrote:
7im wrote:Gather facts first. Then if any finger pointing or defending needs to be done, it can be.
I've used Langouste since soon after release, and I've looked at the source code. Have you?

Please follow your own advice. I've already done so.

My own advice is to not point fingers, and I have followed that advice.

Since you seem to be an authority on Langouste, please break it down for us and describe how the connections are made, and in what order, to what parts of Stanford or local ports, etc. Thanks.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
AtwaterFS
Posts: 30
Joined: Wed Jan 21, 2009 9:08 pm

Re: 171.64.65.56

Post by AtwaterFS »

Hyperlife wrote: My guess is that Langouste is not causing the problem here. More likely it's from people manually restarting their clients instead of waiting for the six-hour retry -- they don't want to lose bonus points by waiting that long. They now have an incentive to hit the server over and over again for the faster return.
I think the problem is the whole bonus scheme - it's pretty obnoxious... It's a "thumbed nose" to part-time folders and in conjunction w/ buggy SMP3 has led to a LOT of aggravation.

Bonus scheme -> emphasis on faster results and full-time folding -> more results produced -> increased strain on infra -> downtime and pulled projects -> loss of points due to bonus system -> donor aggro...
ImageImage
Locked