108.11 & 108.21 GPU servers both down
Moderators: Site Moderators, FAHC Science Team
108.11 & 108.21 GPU servers both down
As title suggests these GPU servers are both down causing a large backlog of work awaiting upload.
Is there any update on this? & the speed with which the GPU crunch proteins surely the wuresults will be overwritten soon, esp if the other GPU also start to give grief...?
Teddy
Edit Saw this on the news page guess that is the GPU servers?
One physical server (and several virtual interfaces) down for maintenance
We needed to take one physical server down, but that takes down several interfaces, including
vsp07, vsp07v, vsp07b, vsp17, vsp17v, vsp22, vsp22v
The machine is fscking now, so it may take a few hours for it to come back.
I guess that means a while still to go as that was posted 3.17pm I assume local California time? So that was 8hrs ago is 16:15 local time here in Canberra.
Is there any update on this? & the speed with which the GPU crunch proteins surely the wuresults will be overwritten soon, esp if the other GPU also start to give grief...?
Teddy
Edit Saw this on the news page guess that is the GPU servers?
One physical server (and several virtual interfaces) down for maintenance
We needed to take one physical server down, but that takes down several interfaces, including
vsp07, vsp07v, vsp07b, vsp17, vsp17v, vsp22, vsp22v
The machine is fscking now, so it may take a few hours for it to come back.
I guess that means a while still to go as that was posted 3.17pm I assume local California time? So that was 8hrs ago is 16:15 local time here in Canberra.
-
- Posts: 10179
- Joined: Thu Nov 29, 2007 4:30 pm
- Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
- Location: Arizona
- Contact:
Re: 108.11 & 108.21 GPU servers both down
You know as much as anyone here, having read the news page. fscking takes a long time on large volumes.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
Tell me and I forget. Teach me and I remember. Involve me and I learn.
Re: 108.11 & 108.21 GPU servers both down
I don't even know what fscking it, I assume it is some sort of rebuild & HDD's only spin so fast...7im wrote:You know as much as anyone here, having read the news page. fscking takes a long time on large volumes.
Teddy
-
- Posts: 136
- Joined: Wed May 27, 2009 4:48 pm
- Hardware configuration: Dell Studio 425 MTS-Core i7-920 c0 stock
evga SLI 3x o/c Core i7-920 d0 @ 3.9GHz + nVidia GTX275
Dell 5150 + nVidia 9800GT
Re: 108.11 & 108.21 GPU servers both down
Depends on the context, but in *nix, it stands for "File System ChecK", i.e. fsck. It is the original chkdsk I guess.Teddy wrote: I don't even know what fscking it, I assume it is some sort of rebuild & HDD's only spin so fast...
In a non OS but geek context, it is also used to replace the word "fXXk" to get past filters. As in (and I don't mean this!), "Go fsck yourself".
Not a real doctor, I just play one on the 'net!
-
- Posts: 52
- Joined: Fri Dec 07, 2007 3:21 am
- Location: U.S. (TechReport.com's Team 2630)
Re: 108.11 & 108.21 GPU servers both down
And because your collection server(s) all too often lack information that they should have, they reject the uploaded WU's that you allowed to go out anyway on a server you knew you were taking down.
Re: 108.11 & 108.21 GPU servers both down
I understand your perspective, but I don't understand the concern. When a server cannot accept an upload, the WU remains on your local disk and will upload whenever it's possible to do so. If the servers had refused to issue WUs earlier, by next week, you would have completed one less WU. It's much better to have work to do even if it can't be uploaded immediately than to be idle because you can't get work.Ragnar Dan wrote:And because your collection server(s) all too often lack information that they should have, they reject the uploaded WU's that you allowed to go out anyway on a server you knew you were taking down.
In my experience (perhaps applicable here, perhaps not) when a server has to be taken off-line, you don't always know it in advance. Moreover, you almost NEVER know how long it will take to bring it back on-line. The fact that they had to check the filesystem either means that it had to do a forced shutdown or something bad had already happened to the filesystem, so I doubt they had enough time to stop assigning new work, even if there had been a reason to do that.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 52
- Joined: Fri Dec 07, 2007 3:21 am
- Location: U.S. (TechReport.com's Team 2630)
Re: 108.11 & 108.21 GPU servers both down
I suppose my "perspective" was informed by the use of the terms "down for maintenance" in the headline on http://folding.typepad.com/'s April 27 entry. Once you know a server is going to be removed, can't you move its store of WU's to a different server so that when it is offline the client doesn't keep pounding on a dead IP address? If, however, the server shut down was unexpected, then I would hope that your system would have a central DB holding assigned WU's, the server that assigned them, and that could be read by the collection servers so that the error happening right now does not happen.
It may not mean too much right now for GPU clients, but since your SMP clients now give bonus points for how quickly A3 core WU's are returned, that can make a big difference if the problem occurs there.
It may not mean too much right now for GPU clients, but since your SMP clients now give bonus points for how quickly A3 core WU's are returned, that can make a big difference if the problem occurs there.
Re: 108.11 & 108.21 GPU servers both down
"UPDATE 7am PST Apr 28 2010: This machine is still down and may be having trouble coming back from its restart. We will know more when the sysadmins report on this later today."
Bommer looks like more of a wait now...
Bommer looks like more of a wait now...
Re: 108.11 & 108.21 GPU servers both down
Will those WUs on 21 and 11 servers receive credits if time will be expired by the servers up?
Re: 108.11 & 108.21 GPU servers both down
One reason it takes about a day to do a fsck is because the RAID on those servers is HUGE. If you understood the magnitude of the FAH projects, the idea of moving all that data somewhere else temporarily wouldn't occur to you. Moving the data is a bigger task than running fsck. Of course you'd also have to shut down two servers for a day or two so that there wouldn't be any updates while the data was being moved, even if you had a spare server to move it to.Ragnar Dan wrote:Once you know a server is going to be removed, can't you move its store of WU's to a different server so that when it is offline the client doesn't keep pounding on a dead IP address?
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
Re: 108.11 & 108.21 GPU servers both down
I don't know. The deadlines supposedly include an extra day to allow for contingencies like that. Stanford does not award bonuses if a SMP WU can't be returned. They do say they'll make a concerted effort to keep the servers operational.kevinksu wrote:Will those WUs on 21 and 11 servers receive credits if time will be expired by the servers up?
The servers have been up for a couple of hours now.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 52
- Joined: Fri Dec 07, 2007 3:21 am
- Location: U.S. (TechReport.com's Team 2630)
Re: 108.11 & 108.21 GPU servers both down
Thinking about what I said in my post it's obvious you're right (I was forgetting about the size of SMP WU's for one thing), but there should be a way to make it so that a downed server doesn't cause disruption. Spend a few billion on SSD's, heh.
BTW, my WU finally went through about 57 minutes before your post time.
BTW, my WU finally went through about 57 minutes before your post time.
Re: 108.11 & 108.21 GPU servers both down
I though we were talking about GPU WUs, but I'm not sure it matters that much. It's not so much the size of the WUs but the fact that Stanford needs ALL of them. Say a project has 50 runs, 250 clones, and is expected to run for 200 - 300 Gens (the actual numbers will vary). Say a server has 20 or 30 active projects plus others that may have been completed but not yet analyzed the server and others being developed. You do the math.Ragnar Dan wrote:Thinking about what I said in my post it's obvious you're right (I was forgetting about the size of SMP WU's for one thing), but there should be a way to make it so that a downed server doesn't cause disruption. Spend a few billion on SSD's, heh.
BTW, my WU finally went through about 57 minutes before your post time.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 52
- Joined: Fri Dec 07, 2007 3:21 am
- Location: U.S. (TechReport.com's Team 2630)
Re: 108.11 & 108.21 GPU servers both down
Yes it was about GPU's, but I was thinking about moving the WU's around and since I don't know much about what servers are doing what and how many are assigned to the various tasks, it just was a point of consideration I brought up to indicate my realization. I'm sure GPU's are data heavy too, especially since they generally turn around much faster.
And as your information evidences, there's obviously a lot to contend with.
And as your information evidences, there's obviously a lot to contend with.