108.11 & 108.21 GPU servers both down

Moderators: Site Moderators, FAHC Science Team

Post Reply
Teddy
Posts: 134
Joined: Tue Feb 12, 2008 3:05 am
Location: Canberra, Australia
Contact:

108.11 & 108.21 GPU servers both down

Post by Teddy »

As title suggests these GPU servers are both down causing a large backlog of work awaiting upload.
Is there any update on this? & the speed with which the GPU crunch proteins surely the wuresults will be overwritten soon, esp if the other GPU also start to give grief...?

Teddy


Edit Saw this on the news page guess that is the GPU servers?

One physical server (and several virtual interfaces) down for maintenance

We needed to take one physical server down, but that takes down several interfaces, including

vsp07, vsp07v, vsp07b, vsp17, vsp17v, vsp22, vsp22v

The machine is fscking now, so it may take a few hours for it to come back.


I guess that means a while still to go as that was posted 3.17pm I assume local California time? So that was 8hrs ago is 16:15 local time here in Canberra.
7im
Posts: 10179
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: 108.11 & 108.21 GPU servers both down

Post by 7im »

You know as much as anyone here, having read the news page. fscking takes a long time on large volumes.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
Teddy
Posts: 134
Joined: Tue Feb 12, 2008 3:05 am
Location: Canberra, Australia
Contact:

Re: 108.11 & 108.21 GPU servers both down

Post by Teddy »

7im wrote:You know as much as anyone here, having read the news page. fscking takes a long time on large volumes.
I don't even know what fscking it, I assume it is some sort of rebuild & HDD's only spin so fast...


Teddy
DrSpalding
Posts: 136
Joined: Wed May 27, 2009 4:48 pm
Hardware configuration: Dell Studio 425 MTS-Core i7-920 c0 stock
evga SLI 3x o/c Core i7-920 d0 @ 3.9GHz + nVidia GTX275
Dell 5150 + nVidia 9800GT

Re: 108.11 & 108.21 GPU servers both down

Post by DrSpalding »

Teddy wrote: I don't even know what fscking it, I assume it is some sort of rebuild & HDD's only spin so fast...
Depends on the context, but in *nix, it stands for "File System ChecK", i.e. fsck. It is the original chkdsk I guess. ;)

In a non OS but geek context, it is also used to replace the word "fXXk" to get past filters. As in (and I don't mean this!), "Go fsck yourself".
Not a real doctor, I just play one on the 'net!
Image
Ragnar Dan
Posts: 52
Joined: Fri Dec 07, 2007 3:21 am
Location: U.S. (TechReport.com's Team 2630)

Re: 108.11 & 108.21 GPU servers both down

Post by Ragnar Dan »

And because your collection server(s) all too often lack information that they should have, they reject the uploaded WU's that you allowed to go out anyway on a server you knew you were taking down.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 108.11 & 108.21 GPU servers both down

Post by bruce »

Ragnar Dan wrote:And because your collection server(s) all too often lack information that they should have, they reject the uploaded WU's that you allowed to go out anyway on a server you knew you were taking down.
I understand your perspective, but I don't understand the concern. When a server cannot accept an upload, the WU remains on your local disk and will upload whenever it's possible to do so. If the servers had refused to issue WUs earlier, by next week, you would have completed one less WU. It's much better to have work to do even if it can't be uploaded immediately than to be idle because you can't get work.

In my experience (perhaps applicable here, perhaps not) when a server has to be taken off-line, you don't always know it in advance. Moreover, you almost NEVER know how long it will take to bring it back on-line. The fact that they had to check the filesystem either means that it had to do a forced shutdown :( or something bad had already happened to the filesystem, :( :( so I doubt they had enough time to stop assigning new work, even if there had been a reason to do that.
Ragnar Dan
Posts: 52
Joined: Fri Dec 07, 2007 3:21 am
Location: U.S. (TechReport.com's Team 2630)

Re: 108.11 & 108.21 GPU servers both down

Post by Ragnar Dan »

I suppose my "perspective" was informed by the use of the terms "down for maintenance" in the headline on http://folding.typepad.com/'s April 27 entry. Once you know a server is going to be removed, can't you move its store of WU's to a different server so that when it is offline the client doesn't keep pounding on a dead IP address? If, however, the server shut down was unexpected, then I would hope that your system would have a central DB holding assigned WU's, the server that assigned them, and that could be read by the collection servers so that the error happening right now does not happen.

It may not mean too much right now for GPU clients, but since your SMP clients now give bonus points for how quickly A3 core WU's are returned, that can make a big difference if the problem occurs there.
Teddy
Posts: 134
Joined: Tue Feb 12, 2008 3:05 am
Location: Canberra, Australia
Contact:

Re: 108.11 & 108.21 GPU servers both down

Post by Teddy »

"UPDATE 7am PST Apr 28 2010: This machine is still down and may be having trouble coming back from its restart. We will know more when the sysadmins report on this later today."

Bommer looks like more of a wait now...
kevinksu
Posts: 1
Joined: Tue Feb 02, 2010 2:41 pm
Hardware configuration: 5_Stars_Generals

Re: 108.11 & 108.21 GPU servers both down

Post by kevinksu »

Will those WUs on 21 and 11 servers receive credits if time will be expired by the servers up?
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 108.11 & 108.21 GPU servers both down

Post by bruce »

Ragnar Dan wrote:Once you know a server is going to be removed, can't you move its store of WU's to a different server so that when it is offline the client doesn't keep pounding on a dead IP address?
One reason it takes about a day to do a fsck is because the RAID on those servers is HUGE. If you understood the magnitude of the FAH projects, the idea of moving all that data somewhere else temporarily wouldn't occur to you. Moving the data is a bigger task than running fsck. Of course you'd also have to shut down two servers for a day or two so that there wouldn't be any updates while the data was being moved, even if you had a spare server to move it to.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 108.11 & 108.21 GPU servers both down

Post by bruce »

kevinksu wrote:Will those WUs on 21 and 11 servers receive credits if time will be expired by the servers up?
I don't know. The deadlines supposedly include an extra day to allow for contingencies like that. Stanford does not award bonuses if a SMP WU can't be returned. They do say they'll make a concerted effort to keep the servers operational.

The servers have been up for a couple of hours now.
Ragnar Dan
Posts: 52
Joined: Fri Dec 07, 2007 3:21 am
Location: U.S. (TechReport.com's Team 2630)

Re: 108.11 & 108.21 GPU servers both down

Post by Ragnar Dan »

Thinking about what I said in my post it's obvious you're right (I was forgetting about the size of SMP WU's for one thing), but there should be a way to make it so that a downed server doesn't cause disruption. Spend a few billion on SSD's, heh. ;)

BTW, my WU finally went through about 57 minutes before your post time.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 108.11 & 108.21 GPU servers both down

Post by bruce »

Ragnar Dan wrote:Thinking about what I said in my post it's obvious you're right (I was forgetting about the size of SMP WU's for one thing), but there should be a way to make it so that a downed server doesn't cause disruption. Spend a few billion on SSD's, heh. ;)

BTW, my WU finally went through about 57 minutes before your post time.
I though we were talking about GPU WUs, but I'm not sure it matters that much. It's not so much the size of the WUs but the fact that Stanford needs ALL of them. Say a project has 50 runs, 250 clones, and is expected to run for 200 - 300 Gens (the actual numbers will vary). Say a server has 20 or 30 active projects plus others that may have been completed but not yet analyzed the server and others being developed. You do the math.
Ragnar Dan
Posts: 52
Joined: Fri Dec 07, 2007 3:21 am
Location: U.S. (TechReport.com's Team 2630)

Re: 108.11 & 108.21 GPU servers both down

Post by Ragnar Dan »

Yes it was about GPU's, but I was thinking about moving the WU's around and since I don't know much about what servers are doing what and how many are assigned to the various tasks, it just was a point of consideration I brought up to indicate my realization. I'm sure GPU's are data heavy too, especially since they generally turn around much faster.

And as your information evidences, there's obviously a lot to contend with.
Post Reply