Page 1 of 1

out of work?????

Posted: Sun Mar 15, 2020 7:00 pm
by Forcinghavok
yea for the past few days the GPU slot was waiting on work, now both the CPU and GPU slot are waiting for work. It also took like an hour for the work to upload. Any idea as to what is going on? Do we know when work will be available again?

Thanks

Re: out of work?????

Posted: Sun Mar 15, 2020 7:08 pm
by Forcinghavok
Sorry, I missed the above forum where it stated server outages along with running out of work. I will patiently wait for work :D

Re: out of work?????

Posted: Mon Mar 16, 2020 7:24 pm
by Forcinghavok
Wow, this is unbelievable that we have so many new donors that the servers, work load and my computer can process the work faster than we can get work. Just incredible!! I wish all these people folded like this full time :D

Re: out of work?????

Posted: Mon Mar 16, 2020 7:25 pm
by Nathan_P
Soon, the servers are issuing work but as you said, too many donors for the infrastructure to keep up. New servers are on the way as well as other enhancements so please bear with

Re: out of work?????

Posted: Mon Mar 16, 2020 7:44 pm
by scerbera
Has anyone thought of asking someone like google/Microsoft/etc or similar to assist with providing more servers, surely someone would be interested?

Re: out of work?????

Posted: Mon Mar 16, 2020 7:47 pm
by Forcinghavok
That's a great idea, I am sure some of those companies would be willing to donate to a really good cause.

Re: out of work?????

Posted: Mon Mar 16, 2020 7:53 pm
by Nathan_P
Discussions are being had, the requirements are steep though, fast i/o, fast network connects and tons of storage like 100TB+ per work server. Lets see what happens

Re: out of work?????

Posted: Mon Mar 16, 2020 8:07 pm
by codecaine
Actually, I think they're not that steep. I have experience with AWS, and I'm sure Google / Azure could manage the same. I've posted in the forums recently about asking to help with this stuff.

Just specifically mentioning this, because I worked on AWS recently so take it with a grain of salt that it's specific to one cloud provider in my writing here (as an example's sake).

Requirements mentioned from @Nathan_P:
Fast I/O - check - uses SSD drives (don't need 100TB here, see last item)
Fast network connection - check - depends on server size, and also they have specific instances for high network workloads such as this one. Take for example the "c5n" EC2 instance
Tons of fast, scalable and highly available storage - check - can mount an EFS drive using NFS 4.1 protocol, and have it do the encryption at rest, and TLS for in transit. Also, can apply a policy to help with storage to save on some cost based on when it was last used.



Admittedly, I don't know too much about the specifics of folding on GPUs, but what I really nerd out on is server infrastructures and high availability.

Re: out of work?????

Posted: Mon Mar 16, 2020 9:21 pm
by v00d00
The problem isnt just the servers. Researchers have to generate new workunits and I dont think its an automatic system. Also their are a finite number of hours in the day. I havent had work for 24 hours. Admittedly its been a long time since we've had shortages on this level and ive been folding since the project started. It will get fixed, but the timetable on it goes along the lines of "how long is a piece of string?".

Re: out of work?????

Posted: Tue Mar 17, 2020 3:34 am
by Caprichosol
"I haven't had work for 24 hours"

It's been at least 48 hours for me without work. Appreciate the herculean efforts the folding at home folks are making to get the work units out. Waiting eagerly in anticipation.

Re: out of work?????

Posted: Tue Mar 17, 2020 4:52 am
by QuintLeo
For prerspective, one team ALONE has added about 18,000 new users in the last week (PC Master Race).
There is no way to tell how many new "anonymous because of the default client setup" folders have been added to the default team - but I'd guess HUNDREDS OF THOUSANDS and perhaps as high as a MILLION over the last week.

For perspective, Coreweave is contributing "more than 6000 Tesla V100" GPUs out of their 45,000 thousand GPU render farm - and they're WAY down the scale in comparison to the default team or anonymous user in work down.

The primary issue appears to be lack of work units - the "60,000" that was added to one of the servers represented less than 6 HOURS of work at the current F@H participation level.

My personal guess is that participation in number of people has multiplied by at least 10 times, more likely over 100 times, and possibly by A THOUSAND times in the last week. Number of clients hasn't gone up as much as many of us longer-term folders have been "heavy hitters" with multiple machines/clients.
This would put serious strain on ANY organization.

Might be worth talking to Bill Gates about the infrastructure issue - he's not part of Microsoft any more, but with his personal/foundation STOCK ownership I'm sure he's still got a lot of PULL there, and his foundation IS focused on medical-related issues.
This one seems to be right up the alley.

Re: out of work?????

Posted: Tue Mar 17, 2020 10:52 am
by JimF
If they had needed more servers to handle the work, they probably would have gotten them before Covid-19 came along. It is more likely a lack of work. They have only so many scientists, who were busy enough even before this virus came along.

That is all separate from the temporary server outage they have been facing, which is being solved, if it has not been already.

Re: out of work?????

Posted: Tue Mar 17, 2020 4:43 pm
by tulanebarandgrill
Contacting Bill Gates is a very good idea. It's possible he would assign someone and it just gets done. FWIW I am a Principal Systems Architect with a rather large service provider so if there's any volunteer-ship along those lines needed I'm here :) I don't know what servers you're using but I have a lot of experience with Cisco UCS. (I do NOT work for Cisco). I don't know if it has been considered, but probably a more distributed approach will scale better. Larger numbers of smaller systems. Doing aggregation right becomes key at that point. Still, it sounds like WU generation WILL be the biggest challenge. If it were me I would try to see how much of that process can be automated / re-used to maximize the use of researchers' time.

Re: out of work?????

Posted: Tue Mar 17, 2020 8:57 pm
by Jesse_V
The high levels of growth have been eaten up all the workunits in the queue and they are scrambling to deploy new servers to meet the demand. That's a good problem to have I suppose. I'm very happy to see the outpouring of support and collaborative efforts here!

Re: out of work?????

Posted: Tue Mar 17, 2020 9:38 pm
by xixou
I don't start FAH anymore, there is just no WU.