Page 3 of 3

Re: My Beowulf cluster

Posted: Tue Feb 23, 2021 7:03 pm
by bruce
I think the official FAH position is "We don't support Beowulf Clusters."

You might think of FAH as a huge cluster where every node runs a single WU. If you build one locally, you'll probably end up running a different WU on every node, which means you're not really using the cluster's features.

Re: My Beowulf cluster

Posted: Tue Feb 23, 2021 7:36 pm
by bruce
Consider the situation where FAH is asked to compute a single WU across N CPU cores within a single physical CPU. Divide up the total number of atoms into (N) pieces. With a single CPU_Core, compute the forces between atoms within a single segment while other cores work on other segments of the protein. This makes good use of each CPU_Core (and would work fine on a cluster).

Now devise a plan to compute all the forces between each atom in a single segment (i) with atoms in a different segment (j). To do that, you have to know where every other atom is so the atoms all need to be in local RAM.

Now allow the simulated clock to run for some small increment of time so the atoms can move a little. From the new positions calculate new forces and new motions. Repeat as necessary.

Once the atoms move, the coordinates for the entire protein needs to be updated, not just the ones in the local segment. At the end of every time-step, the positions of the atoms in each segment must be re-distributed to all the other nodes. (On a cluster, this synchronization process would be prohibitively slow.)

Even if your cluster's nodes are interconnected by a really, really fast network, it will be spending much more time re-synchronizing than actually computing motions.

FAH manages this process by distributing a different Run/Clone/Gen to each node with no need for direct re-synchronization.

Re: My Beowulf cluster

Posted: Tue Nov 15, 2022 9:28 pm
by CarlHolmberg
Jumping in here with a few thoughts since I'm about to attempt Beowulfing the situation on Linux using HTCondor as a job scheduler for fah client V7:
1. The job scheduler is your friend. The user will usually write a configuration file, which specifies a job's resource requirements, name of and arguments for the executable, paths for data and log files, and usually a max run time.
2. While the client can handle dividing up a multi-socket, multi-core server into multiple CPU sockets, I suspect it would simplify things to tailor jobs to request all of the local CPU resources on a single node.
3. The job setup would prep to run one WU, then exit (including a client execution string something like 'fahclient --finish --idle-seconds=0 --cpus=0 --smp=true --log-redirect=true --log=</path/log.txt> --run-as=<user name> --power=full --passkey=<fah passkey> --team=<fah team ID> --user=<fah user ID>).
4. The job teardown would copy the log file to a permanent location, and nuke the fah client process.
5. Alternatively, one might set the fah job on each node to continue open ended, with a user-specified method to detect when a job is hung so it can be terminated, with the scheduler handling clean up and relaunch.