Page 2 of 3
Re: My Beowulf cluster
Posted: Sat Dec 31, 2011 12:57 am
by ATG
Aha! see that's a problem. I switched the units over to a more unique username. "Aaron_Miller" I'm not sure why they are showing up in the system as "ATG" still. I had a feeling there was a problem. Okay so is there any way to fix this on my end? My configuration files show that the username and passkey are correct to the new username. I was originally using ATG as a short way of writing my usual handle "AaroneusTheGreat", but after seeing the other "ATG"s in the system I figured that the username would cause conflicts, that's when the change occurred. Is there any kind of username or machine information I should send you in a personal message so you can figure out what's going on? for instance the passkey?
The correct ATG would be this one:
#3. ATG 336 4 213973
The team number I setup for the units I run is the one above, currently, they all should be showing up as the previously mentioned new username "Aaron_Miller" and my stats show twenty some odd work units done under that name, but only 4 active CPU's which doesn't make much sense to me.
Now in the logs, I see a bunch of progress, ending in completion and a sleep cycle, are those individual sections of a PRCG group? or are those individual work units? I think I may be confusing myself here. I could also grab all the logs and send them to you in a pm or something if it comes to that, although I'm sure you've got plenty to do any may not have the time to pick through my 7 log files. I would like to try to make your job easier in identifying the crossed wires here. Thanks again for the help.
Re: My Beowulf cluster
Posted: Sat Dec 31, 2011 3:14 am
by bruce
I don't need anything in a PM (since you already said "Aaron_Miller"). It should be noted that as each machine completes a WU, it should be counted under the new name and depending on the hardware and the exact assignment, that time can vary. The new UserName will need to complete 10 bonus WUs before it's eligible for a bonus.
Re: My Beowulf cluster
Posted: Sat Dec 31, 2011 3:58 am
by ATG
Oh I wasn't aware of the bonus system. By the way, I wanted to know what the suggestion was when it came to size of work units. The info on the systems are as follows:
I haven't taken them apart in a while, so I can't tell you exactly what's in them in terms of hardware.
beowulf0:
AMD semperon at ~2 gigahertz
691mb RAM (DDR2)
beowulf1:
AMD athlon 2400 at ~2 gigahertz
1010mb RAM (DDR2)
beowulf2:
Pentium II or similar (probably) ~1 gigahertz
256mb RAM (DDR)
beowulf3
Pentium II or similar at ~700 megahertz (the oldest unit lol)
192mb RAM (SDRAM)
beowulf4
Pentium IV at ~ 3 gigahertz
512mb RAM (DDR)
beowulf5
Pentium IV at ~3 gigahertz
256mb RAM (DDR)
beowulf6
Pentium IV at ~3 gigahertz
256mb RAM (DDR)
Kind of lame in terms of new computers, but perfectly acceptable for running one client on each system. I have some parts I've been meaning to throw in these units to spruce them up a little. So perhaps those RAM numbers will change.
Given the old nature of these units, what size would you suggest for them? I have highspeed internet, and a fast network switch between them (a gigabit ethernet adapter), so the network bottleneck is nearly nil. Anyways any suggestions you have would be great. thanks.
Re: My Beowulf cluster
Posted: Sat Dec 31, 2011 4:09 am
by bruce
Networking isn't important. Not that I'm recommending you change anything, but I can remember running FAH on individual computers similar to those over dial-up. The Small/Normal/Big setting is pretty much outdated. Leave it at big except maybe the 256MB machine can be Normal. If there ever was a concrete recommendation, I've both forgotten it for Linux and forgotten where to find it.
The P-IVs probably have Hyperthreading, which might be worthwhile in case you run something else on them. Ignore it for FAH.
Re: My Beowulf cluster
Posted: Sat Dec 31, 2011 8:21 pm
by ATG
The P-IV's do have hyperthreading in fact. I'm not sure how to set it to ignore that honestly. I'll see if I can find it in the guides. I've just added an 8th unit to the cluster and set it up with FAH and all that. It's got it's correct config with the big units and all that. I'll check on everything else periodically and see if its all working right. Plus I may end up changing the configs to reflect your recommendations about the unit size. My network is pretty fast, a cable connection that averages about 16 to 17 mbits down and about 5 to 6 mbits up. So it's not really an issue to send larger chunks, I don't do a whole lot on here to be able to use all that up.
I would think that changing the configs for the size of the units will only change the unit size when the current units are finished, am I correct there? I don't want to interrupt any useful work at this stage, the 7 units have been running for a couple of days, so I imagine they've gotten quite a bit done. Sorry for the numerous questions all back to back like this, I'm trying to figure out a lot of this on my own instead of bugging you guys with questions I'm sure you've answered a million times.
Re: My Beowulf cluster
Posted: Sat Dec 31, 2011 11:13 pm
by bruce
You don't have to set anything for HT. Just don't waste your time trying to get FAH to use it (since FAH doesn't gain much from it.)
Same with size. Keep whatever settings you have unless you discover a problem.
Re: My Beowulf cluster
Posted: Sun Jan 01, 2012 1:38 am
by ATG
Okay thanks for the input. I'll let you know if I have any unexpected issues that I can't figure out on my own. Especially if it turns out to be a bug or something. I don't expect anything from this point on. I've got my ship running pretty tight now. I'm just playing the waiting game at this point to see what the stats say, and adding nodes as I go along.
By the way, are there any known conflicts that are documented anywhere with running things like OpenMPI programs while FAH is running? If so a link to read about them would be fantastic. I want to use this thing to run some other simulations of my own when I get to that point. In general, in the name of good programming practice, where would be a good place to start reading about some of the more technical things about FAH and how it works in Linux, so I don't step on any toes while programming? I imagine if I avoid using variables like the PATH variable it should be pretty easy to encapsulate my own code from affecting FAH, but I want to be sure. I've poked around in the documentation for a good long while, but I feel like I may be missing some stuff that may be important to know, and I felt I should ask before possibly breaking something and having to fix it.
Re: My Beowulf cluster
Posted: Tue Jan 03, 2012 3:18 am
by ATG
It appears that several of my units missed their deadlines. I'm not sure what the deal is there. They aren't running much of anything else. I am guessing that it might be a simple matter of processing power. I'm going to go back through the system logs and see if there have been any problems with downtime when I'm not monitoring them. It has to be a problem on my end at this point. Is there a list of reasons somewhere which could explain why units miss their deadlines and how to fix the issues? I'll get on reading that stuff if I know where to start.
Re: My Beowulf cluster
Posted: Tue Jan 03, 2012 6:13 am
by bruce
SSE has become an essential feature in order to meet deadlines, which will disqualify the PII machines. A PIV and the Athlon XP and semperon should all have SSE. I think the rest of the machines would be fast enough to meet the deadlines for the FahCore_78 projects. I'm not sure if all of the FahCore_a4 projects have been validated against relatively slow machines. Please report what you find.
Re: My Beowulf cluster
Posted: Tue Jan 03, 2012 6:25 am
by 7im
Wouldn't hurt to add -forceasm switch. It's possible SSE optimizations got turned off if a client instance wasn't shut down gracefully, and then client runs at 1/3rd the normal speed until restarted gracefully. The switch forces SSE to on no matter how the client was shut down last time.
Re: My Beowulf cluster
Posted: Wed Jan 04, 2012 4:04 am
by ATG
That very well could have happened. I'll check out what it says about the -forceasm and SSE optimizations. I just added an old HP system with an Intel Celeron processor to the mix. This one has 768mb of ram I believe, so it's not terrible. I also cut down the installations, completely getting rid of my GUI on all my systems to save resources. When I started I relied on the GUI due to not knowing much about Linux, as the familiarity grew, my need for the GUI shrank. So that said, I wonder if my units will make their next deadlines.
I am still reading through a lot of documentation (understatement) so I'll try and post what I can with respect to documenting any weirdness I encounter, but bear in mind that some of this may just be my ignorance causing problems, so there may not be anything to report in regard to the client. Thanks for the suggestions. I'm too tired tonight, but I'll look through the logs probably tomorrow and figure out if the SSE optimizations were the problem, and add the -forceasm switch to my startup scripts and see what happens.
The units which had missed their deadlines downloaded new units and started working on them already, I checked the progress of some of them, and based on loose math, I figured that several would meet their deadlines as it is right now, so I may have gotten some of this under control, but that's a very tentative projection given the difficulty I've had with these old systems.
Re: My Beowulf cluster
Posted: Thu Jan 05, 2012 5:18 am
by ATG
Okay so I've gone back through and looked at the individual setups for each unit, and I think I've discovered that running the FAH clients with my custom startup scripts may have been the issue. Somehow or another, the clients weren't running correctly, most likely due to being started at an improper stage in the boot process. I'm not sure why that happened, but it did. I removed my startup scripts from all the systems, and have started to run the clients in the normal fashion as processes started by me, which might be a little tiresome if I have to go back in and reboot a system for whatever reason, the good news is that I have only had to do that when I've changed something drastic in the systems. So that shouldn't be often at all. They are pretty stable and don't seem to require much help to continue functioning, the only quirks they seem to have are during the boot process, but as long as I've got them wired how they are, they should be just fine.
This may be helpful for the next guy coming through here with a similar problem. It also seems that because the clients were started by root rather than my usual username, there was a conflict there as well, probably due to the fact that I wrote my own startup script and added it to the boot stuff myself. Which can be tricky from everything that I've been reading. It was a bit misleading because I saw the processes chugging along when I checked on them, and they sometimes completed and sent in work, but not consistently, it was as if they were having to fight for position or something in the system. I don't understand everything I know about Linux just yet, if that makes any sense, so it was most likely a mistake on my part. When I get better at it I may rewrite a new startup script that (hopefully) works without any issues. For now I'll just restart the clients myself if I have to. By the way, is it the most graceful method to send the process the term signal before rebooting? Supposing I end up having to reboot the machines in the future, I would like to be able to do it an a safe manner for the FAH clients running, so that I don't lose work again like I have recently. Very frustrating to say the least.
Re: My Beowulf cluster
Posted: Sat Feb 20, 2021 1:54 am
by inOr
Well, considering your cluster as a series of networked machines is a good place to start because that how FAH is going to treat it.
I'm a newbie and I haven't built my Linux cluster yet, so my understanding of clusters may be deficient. But I want to learn about them and how to use one on FAH problems.
That said, "A series of networked machines" sounds like FAH treating each node of a cluster separately, as though they were separate machines. For instance, if I run an application on some offline cluster that my computer is part of, it seems to me that it would act as one machine, not many. The nodes are merely executing tasks in parallel on a single machine. So, why doesn't FAH recognize a cluster as just one machine that happens to run its tasks in parallel using the cluster for computation?
Re: My Beowulf cluster
Posted: Sat Feb 20, 2021 5:05 am
by JimboPalmer
inOr wrote: So, why doesn't FAH recognize a cluster as just one machine that happens to run its tasks in parallel using the cluster for computation?
Welcome to Folding@Home!
That 'machine' would operate at the speed of the network, not the speed of the CPUs. A cluster is good at some tasks and poor at others, F@H is best as a single machine, as you do not want to wait to access RAM on another node.
https://webhome.phy.duke.edu/~rgb/Beowu ... ode10.html
Details some advantages and disadvantages.
Remember that the name is Folding@Home and few are going to set up a Beowulf at home, so the program is not optimized for a cluster.
Re: My Beowulf cluster
Posted: Tue Feb 23, 2021 2:55 am
by v00d00
Still its a cool project to do, especially if you are looking at certain tasks, brute force cracking crypto, a render farm, solving other mathemtical problems. But not Folding. You'd be better buying a graphics card or two and folding that way.
Im afraid building a Beowulf is real dark arts. Instructions exist, but they arent complete. i built one long ago from a bunch of duron 1300's about 20 or so years back and their wasnt any instructions per se. i read a lot of manuals that i found on an MIT FTP server and eventually after around 6 months created something that worked. It requires you to be a Linux power user at the bare minimum. Be able to handle and compile source code, hack it preferably and have some basic understanding of networking. But if you are contemplating building one you probably know all of this. Good luck and have fun. its a cool project.