Page 9 of 25
Re: 171.64.65.56 is in Reject status
Posted: Tue Dec 02, 2008 9:19 pm
by VijayPande
ArVee wrote:.65.56 is in Reject mode again. I really have to ask how many times this needs to be reported before it sinks in that there may be a workload or balancing problem. I mean c'mon, this is beyond ineptitude and right into ridiculous. Why don't you just get to it and address this with an eye to a permanent solution?
This issue has been answered in another thread as well as a recent blog post (giving the full reason, timeline for update, etc). I can understand your frustration here, and I suggest that you check out the details of our plan:
http://folding.typepad.com/news/2008/11 ... asily.html
Re: 171.64.65.56 is in Reject status
Posted: Tue Dec 02, 2008 9:58 pm
by codysluder
VijayPande wrote:...The SMP and GPU servers will get it last, since they need additional code to bring the v5 code path up to spec with what the GPU and SMP needs. However, we expect this won't be too onerous to get done.
The high performanance projects will get fixed last and they're the ones with the shortest deadlines so logically they should be first. How about setting up (or reserving) additional Collection Server resources so that the high performance WUs can be returced promptly even after the old binaries hang? The uniprecessor loads on the collection servers may not be huge, but it's easier for them to handle additional delays.
Also, how about some small SMP projects for those of us on dial-up?
Re: 171.64.65.56 is in Reject status
Posted: Wed Dec 03, 2008 4:20 am
by AgrFan
ArVee wrote:.65.56 is in Reject mode again. I really have to ask how many times this needs to be reported before it sinks in that there may be a workload or balancing problem. I mean c'mon, this is beyond ineptitude and right into ridiculous. Why don't you just get to it and address this with an eye to a permanent solution?
I'd suggest running the Windows SMP client until more work is available from this server. 171.64.65.64 is fully functional and has plenty of work available.
VijayPande wrote:This issue has been answered in another thread as well as a recent blog post (giving the full reason, timeline for update, etc). I can understand your frustration here, and I suggest that you check out the details of our plan:
http://folding.typepad.com/news/2008/11 ... asily.html
I really don't understand how faulty server code is the answer to this problem. This server was fully functional for quite some time when the 2605 WUs were readily available.
The big questions are a) why is this server low on work, b) why does work not get uploaded to a collection server when this server goes down, and c) why can't a temporary fix be implemented in the low-level HTTP code/library to stop the binaries from hanging when this server gets overloaded?
Re: 171.64.65.56 is in Reject status
Posted: Wed Dec 03, 2008 10:20 am
by bruce
AgrFan wrote:The big questions are a) why is this server low on work, b) why does work not get uploaded to a collection server when this server goes down, and c) why can't a temporary fix be implemented in the low-level HTTP code/library to stop the binaries from hanging when this server gets overloaded?
All servers make new WUs from the results that are returned. If a server is overloaded, it may be able to assign all the WUs (small data transfers) while not being able to accept the large data transfers associated with uploads. Also, projects do end. Then it takes human thought to learn what an old project told them and devise a project to answer the new questions that were discovered. Then new projects must be prepared and tested before they can be distributed. When an individual server is overloaded or down, there usually is redundancy provided by assigning work from other servers.
The collection servers are running at maximum capacity. I'm not sure when new server capacity will come on-line or how it will be allocated.
Vijay said "
likely in the low-level HTTP code/library" which means they have not been able to fully identify why the binary hangs so they also don't know exactly what to fix. Older versions of Linux contain bugs that are fixed in newer versions so the best approach is to upgrade to a new version. (If anyone knows how to do the suggested temporary fix, let us know.)
BTW, the server appears to be running fine right now with a reasonably light load.
Re: 171.64.65.56 is in Reject status
Posted: Thu Dec 04, 2008 3:13 am
by AgrFan
bruce wrote:
All servers make new WUs from the results that are returned. If a server is overloaded, it may be able to assign all the WUs (small data transfers) while not being able to accept the large data transfers associated with uploads. Also, projects do end. Then it takes human thought to learn what an old project told them and devise a project to answer the new questions that were discovered. Then new projects must be prepared and tested before they can be distributed. When an individual server is overloaded or down, there usually is redundancy provided by assigning work from other servers.
It sounds like many of the 26xx projects may be ending soon with new projects coming online in the near future. This would explain the lack of work on this server. It would be helpful if Stanford gave periodic status updates of the projects being serviced by this server, specifically the A2 units since they are the units people are the most interested in for the higher point production. If it was known that A2 units are getting scarce, donors may be willing to run other clients temporarily until more A2 units are available. I stopped running the CPU client for the bonus Amber units when Vince Voelz posted in the forum that those projects were soon coming to an end and their priority was going to be lowered. I know Stanford doesn't want to regularly publicize this kind of information but as a donor it does make us feel like our contributions are valued and adds a nice touch to the whole folding experience.
Re: 171.64.65.56 is in Reject status
Posted: Thu Dec 04, 2008 8:39 pm
by kasson
The 26xx projects are far from monolithic (they represent a fairly broad range of different scientific projects that I'm working on--some are A2 and some are not). Looking specifically at projects 2668,2669,2671,2673, and 2675, we have obtained a number of interesting results and will certainly provide details once the papers complete scientific review. We do anticipate more projects in this series, but it is always difficult to predict such scientific directions in advance.
Re: 171.64.65.56 is in Reject status
Posted: Fri Dec 05, 2008 1:17 am
by WickedPixie
kasson wrote: Looking specifically at projects 2668,2669,2671,2673, and 2675, we have obtained a number of interesting results and will certainly provide details once the papers complete scientific review. We do anticipate more projects in this series, but it is always difficult to predict such scientific directions in advance.
According to the Psummary description,
These projects study how influenza virus recognizes and infects cells. We are developing new simulation methods to better understand these processes.
Are there any SMP projects, or Uniprocessor projects for that matter, doing AD & PD research?
I'd be glad to switch to Uniprocessor projects if it has something to do with AD & PD research...
Re: 171.64.65.56 is in Reject status
Posted: Tue Dec 16, 2008 10:52 pm
by bollix47
Tue Dec 16 14:40:21 PST 2008 171.64.65.56 SMP vspg4 kasson full Reject 2.14 63 12 17883 1877 0 2.01 144 144
Re: 171.64.65.56 is in Reject status
Posted: Tue Dec 16, 2008 11:19 pm
by kasson
It restarted at 14:52 and is now functioning.
Re: 171.64.65.56 is in Reject status
Posted: Tue Dec 16, 2008 11:57 pm
by bollix47
Thanks for the update.
It seems a pity when a WU that only takes 6 hours to complete has to wait another 2-3 hours (or 6 without intervention) to upload because a server has a problem and the collection server has no record of the WU.
Makes it difficult to comply with Pande Group's desire for setting up our computers to return WUs as fast as possible.
It seems that the auto restart for servers takes 2-3 hours to kick in. If that is the case perhaps the client's autosend feature could be changed from 6 hours to 3, at least with the SMP and GPU clients where a fast turnaround is expected?
Hopefully we'll see less of this as the server code is updated.
Re: 171.64.65.56 is in Reject status
Posted: Wed Dec 24, 2008 4:06 am
by 314159
171.64.65.56 is once again in Reject status.
Help!
Thanks,
John
Re: 171.64.65.56 is in Reject status
Posted: Wed Dec 24, 2008 4:06 am
by JadeMiner
171.64.65.56 SMP vspg4 kasson full Reject
Looks like this server is in reject status again.
Re: 171.64.65.56 is in Reject status
Posted: Wed Dec 24, 2008 5:21 am
by 314159
Hey JadeMiner,
Looks as if we were posting simultaneously.
The GOOD NEWS is that this server came back up about 2 minutes later (an early Xmas Gift, I guess).
I had another machine successfully complete, send, and receive a WU from this server just moments ago.
Thanks to whomever fixed this! (if it was a software-reset, thanks to whomever coded that).
An early
Season's Greetings to ALL; especially to our good and dedicated friends at the Pande Group.
John
Re: 171.64.65.56 is in Reject status
Posted: Wed Dec 24, 2008 2:49 pm
by VijayPande
bollix47 wrote:
Hopefully we'll see less of this as the server code is updated.
Yes, further rolling out the new server code is a very high priority for 2009 (hopefully in January, but that will depend on QA for GPU2 and SMP server codes).
Re: 171.64.65.56 is in Reject status
Posted: Fri Jan 02, 2009 6:15 am
by 314159
This server has been relatively reliable recently.
It was in REJECT MODE for close to an hour but that had apparently been corrected a few minutes ago. (THANK YOU!)
My question:
While the server no longer shows REJECT and is pingable, a column on the serverstats page labeled "NMJ" is presently colored blue and has a "1" in it. This was the only server in this condition when I last looked.
May I ask what this "NMJ" status means? - hopefully not "no more jobs" -
<--sad - not mad
Thank you.
John