108.11 & 108.21 GPU servers both down

Teddy · Post by **Teddy** » Wed Apr 28, 2010 6:10 am

As title suggests these GPU servers are both down causing a large backlog of work awaiting upload.
Is there any update on this? & the speed with which the GPU crunch proteins surely the wuresults will be overwritten soon, esp if the other GPU also start to give grief...?

Teddy

Edit Saw this on the news page guess that is the GPU servers?

One physical server (and several virtual interfaces) down for maintenance

We needed to take one physical server down, but that takes down several interfaces, including

vsp07, vsp07v, vsp07b, vsp17, vsp17v, vsp22, vsp22v

The machine is fscking now, so it may take a few hours for it to come back.

I guess that means a while still to go as that was posted 3.17pm I assume local California time? So that was 8hrs ago is 16:15 local time here in Canberra.

7im · Post by **7im** » Wed Apr 28, 2010 6:40 am

You know as much as anyone here, having read the news page. fscking takes a long time on large volumes.

Teddy · Post by **Teddy** » Wed Apr 28, 2010 7:24 am

7im wrote:You know as much as anyone here, having read the news page. fscking takes a long time on large volumes.

I don't even know what fscking it, I assume it is some sort of rebuild & HDD's only spin so fast...

Teddy

DrSpalding · Post by **DrSpalding** » Wed Apr 28, 2010 2:06 pm

Teddy wrote: I don't even know what fscking it, I assume it is some sort of rebuild & HDD's only spin so fast...

Depends on the context, but in *nix, it stands for "File System ChecK", i.e. fsck. It is the original chkdsk I guess.

In a non OS but geek context, it is also used to replace the word "fXXk" to get past filters. As in (and I don't mean this!), "Go fsck yourself".

Ragnar Dan · Post by **Ragnar Dan** » Wed Apr 28, 2010 4:38 pm

And because your collection server(s) all too often lack information that they should have, they reject the uploaded WU's that you allowed to go out anyway on a server you knew you were taking down.

Post by **bruce** » Wed Apr 28, 2010 4:56 pm

Ragnar Dan wrote:And because your collection server(s) all too often lack information that they should have, they reject the uploaded WU's that you allowed to go out anyway on a server you knew you were taking down.

I understand your perspective, but I don't understand the concern. When a server cannot accept an upload, the WU remains on your local disk and will upload whenever it's possible to do so. If the servers had refused to issue WUs earlier, by next week, you would have completed one less WU. It's much better to have work to do even if it can't be uploaded immediately than to be idle because you can't get work.

In my experience (perhaps applicable here, perhaps not) when a server has to be taken off-line, you don't always know it in advance. Moreover, you almost NEVER know how long it will take to bring it back on-line. The fact that they had to check the filesystem either means that it had to do a forced shutdown

or something bad had already happened to the filesystem,

so I doubt they had enough time to stop assigning new work, even if there had been a reason to do that.

Ragnar Dan · Post by **Ragnar Dan** » Wed Apr 28, 2010 5:57 pm

I suppose my "perspective" was informed by the use of the terms "down for maintenance" in the headline on http://folding.typepad.com/'s April 27 entry. Once you know a server is going to be removed, can't you move its store of WU's to a different server so that when it is offline the client doesn't keep pounding on a dead IP address? If, however, the server shut down was unexpected, then I would hope that your system would have a central DB holding assigned WU's, the server that assigned them, and that could be read by the collection servers so that the error happening right now does not happen.

It may not mean too much right now for GPU clients, but since your SMP clients now give bonus points for how quickly A3 core WU's are returned, that can make a big difference if the problem occurs there.

Teddy · Post by **Teddy** » Wed Apr 28, 2010 7:37 pm

"UPDATE 7am PST Apr 28 2010: This machine is still down and may be having trouble coming back from its restart. We will know more when the sysadmins report on this later today."

Bommer looks like more of a wait now...

kevinksu · Post by **kevinksu** » Wed Apr 28, 2010 7:52 pm

Will those WUs on 21 and 11 servers receive credits if time will be expired by the servers up?

Post by **bruce** » Wed Apr 28, 2010 10:37 pm

Ragnar Dan wrote:Once you know a server is going to be removed, can't you move its store of WU's to a different server so that when it is offline the client doesn't keep pounding on a dead IP address?

One reason it takes about a day to do a fsck is because the RAID on those servers is HUGE. If you understood the magnitude of the FAH projects, the idea of moving all that data somewhere else temporarily wouldn't occur to you. Moving the data is a bigger task than running fsck. Of course you'd also have to shut down two servers for a day or two so that there wouldn't be any updates while the data was being moved, even if you had a spare server to move it to.

Post by **bruce** » Wed Apr 28, 2010 10:40 pm

kevinksu wrote:Will those WUs on 21 and 11 servers receive credits if time will be expired by the servers up?

I don't know. The deadlines supposedly include an extra day to allow for contingencies like that. Stanford does not award bonuses if a SMP WU can't be returned. They do say they'll make a concerted effort to keep the servers operational.

The servers have been up for a couple of hours now.

Ragnar Dan · Post by **Ragnar Dan** » Thu Apr 29, 2010 4:14 am

Thinking about what I said in my post it's obvious you're right (I was forgetting about the size of SMP WU's for one thing), but there should be a way to make it so that a downed server doesn't cause disruption. Spend a few billion on SSD's, heh.

BTW, my WU finally went through about 57 minutes before your post time.

Post by **bruce** » Thu Apr 29, 2010 7:32 am

Ragnar Dan wrote:Thinking about what I said in my post it's obvious you're right (I was forgetting about the size of SMP WU's for one thing), but there should be a way to make it so that a downed server doesn't cause disruption. Spend a few billion on SSD's, heh.

BTW, my WU finally went through about 57 minutes before your post time.

I though we were talking about GPU WUs, but I'm not sure it matters that much. It's not so much the size of the WUs but the fact that Stanford needs ALL of them. Say a project has 50 runs, 250 clones, and is expected to run for 200 - 300 Gens (the actual numbers will vary). Say a server has 20 or 30 active projects plus others that may have been completed but not yet analyzed the server and others being developed. You do the math.

Ragnar Dan · Post by **Ragnar Dan** » Fri Apr 30, 2010 3:41 am

Yes it was about GPU's, but I was thinking about moving the WU's around and since I don't know much about what servers are doing what and how many are assigned to the various tasks, it just was a point of consideration I brought up to indicate my realization. I'm sure GPU's are data heavy too, especially since they generally turn around much faster.

And as your information evidences, there's obviously a lot to contend with.

Folding Forum

108.11 & 108.21 GPU servers both down

108.11 & 108.21 GPU servers both down

Re: 108.11 & 108.21 GPU servers both down

Re: 108.11 & 108.21 GPU servers both down

Re: 108.11 & 108.21 GPU servers both down

Re: 108.11 & 108.21 GPU servers both down

Re: 108.11 & 108.21 GPU servers both down

Re: 108.11 & 108.21 GPU servers both down

Re: 108.11 & 108.21 GPU servers both down

Re: 108.11 & 108.21 GPU servers both down

Re: 108.11 & 108.21 GPU servers both down

Re: 108.11 & 108.21 GPU servers both down

Re: 108.11 & 108.21 GPU servers both down

Re: 108.11 & 108.21 GPU servers both down

Re: 108.11 & 108.21 GPU servers both down