Page 1 of 1

other than FAH work

Posted: Wed Jan 18, 2012 7:15 am
by beer
as far as I know (pleace correct me if I am wrong) FAH is nearly only using floating operations of the CPU. I am wondering if someone has tried to use folding@home smp and another non-floating-operations-heavy program for science?

Re: other than FAH work

Posted: Wed Jan 18, 2012 11:38 am
by Napoleon
I haven't done so specifically for scientific apps, but I've been pleasantly surprised how well HyperThreading does on my 2C/4T Atom330, allowing me to run my occasional ALU-intensive workloads concurrently with CPU folding. Not quite as good as having four real CPU cores, no, but I can get quite close in some specific real life scenarios of mine. One recent example in viewtopic.php?f=66&t=20407&start=30#p203413 / scenario 7. CPU TPF increased roughly 13min => 17min, combined compression speed dropped about 800kBps => 600kBps, a far cry from 13min => 26min and 800kBps => 400kBps I would expect without HT doing its magic. :D

Then again, my scenario 7 was a nice symmetric workload, quite friendly for FAH and HT. Got to take another look at BOINC, for example, in case there is something similar available.

Re: other than FAH work

Posted: Wed Jan 18, 2012 7:06 pm
by bruce
beer wrote:as far as I know (pleace correct me if I am wrong) FAH is nearly only using floating operations of the CPU. I am wondering if someone has tried to use folding@home smp and another non-floating-operations-heavy program for science?
Napoleon has explained it very well. Here's some additional detail:

There are two limiting factors here. One is whether the FPU and the ALU can both be active at the same time and the other is whether the OS can manage your workload.

Hyperthreading (or bulldozer) gives the OS the capability of running two threads that have to share the same FPU. If both tasks need the FPU, they compete with each other and both slow down to about half speed. If only one needs the FPU and the other uses the ALU, both tasks can run at (almost) full speed.

Assume a Quad plus HT which gives your OS 8 threads to work with. A) Run SMP8 and there's nothing free to run anything else. Add 4 ALU tasks and they'll have to compete for OS resources, slowing things down. B) Run SMP4 plus 4 other tasks that use just the ALU and (if the OS assigns them in the optimum order) it's possible that all 8 will run at "normal" speed. In one case, HT gives you no extra performance; in the other case, you get twice the througput as you would without HT. That's why the advertisements for HT are very careful to use the words "depending on..." or "as much as..."

In fact, in an empty machine, SMP8 does give maybe 15% faster results than SMP4 but that's because no code uses ONLY the FPU. One FAH task uses maybe 90% of the FPU and uses the ALU the rest of the time. Another SMP thread can use the unused resources. My "about half speed" allows for that extra capacity.

Re: other than FAH work

Posted: Mon Jan 23, 2012 12:03 am
by Napoleon
Unless I'm mistaken, distributed.net OGR is ALU-only. This particular math puzzle seems to have some interesting practical applications:
OGR's have many applications including sensor placements for X-ray crystallography and radio astronomy. Golomb rulers can also play a significant role in combinatorics, coding theory and communications, and Dr. Golomb was one of the first to analyze them for use in these areas.
X-ray crystallography is used in protein studies. Who knows, OGR might even benefit FAH indirectly. According to Wikipedia, one of the coordinators of distributed.net in its current form is certain Adam L. Beberg. Hmm, why does the name Beberg sound vaguely familiar... :ewink:

For starters, I was surprised to see that running 4 OGR crunchers instead of just 2 gave me over 1.5x performance boost. Then again, ALUs are much simpler than FPUs and HyperThreads have some resources duplicated, so I presume the ALU side of my 2C/4T gets closer to a true quad than the FPU side. Consider AMD BullDozer, for example; ALU side is true octocore but it actually has only 4 FPUs... Without further ado, let's have FAH and OGR duke it out on my 2C/4T CPU.

OGR only:
  • 28 Mnodes / s == 36ms / Mnode (2 crunchers, 50% CPU)
  • 43 Mnodes / s == 23ms / Mnode (4 crunchers, 100% CPU)
2x P6892 uniprocessor CPU WUs only:
  • 27min TPF (50% CPU)
2x P6892 + 2x OGR :
  • 29min TPF + 20Mnodes / s == 50ms / Mnode (50% uni + 48% OGR, 100% total)
2x P6892, P5770 and P7630:
  • 29min TPF (50% uni + 0.5% GPU2 + 3.5% GPU3)
2x P6892, P5770 and P7630 + 2x OGR:
  • 34min TPF + 17 Mnodes /s == 59ms / Mnode (50% uni + 0.5% GPU2 + 3.5% GPU + 43% OGR, 100% total)
I use Process Lasso to tweak priorities and affinities. Here are some further details and observations:
  • OGR is Low priority and running on cores 0 and 3 along with GPU cores, ensuring access to both physical ALUs
  • Uniprocessor slots are Above Normal priority and running on cores 1 and 2, ensuring access to both physical FPUs as well as minimal preemption from normal processes
  • GPU2 slot is High priority and running on core 0, preempting just about everything on it
  • GPU3 slot is High priority and running on core 3, --- "" ----
  • GPU folding performance remains unaffected in all cases, no surprises there
  • GPU folding requires also some kernel time because it needs to access the GPU hardware through drivers
  • CPU kernel time produced by GPU folding seems to stick to cores 0 and 3 according to Task Manager graphs
HT is providing decent concurrency in the 2x P6892 + 2x OGR case. CPU is fully utilized and P6892 frame times increase only about (29-27) / 27 * 100% == 7.4%. Since FAH is my priority charity, it's nice to see that uniprocessor slots are going strong while OGR takes the bigger hit in milliseconds per Mnode, (50-36) / 36 * 100% == 39%, . I don't quite understand why uniprocessor frame times increase from 29min to 34min when I run everything concurrently. Maybe Task Manager doesn't show every little detail after all, and with 2x uni + 2x GPU + 2x OGR there shall certainly be frequent scheduling clashes on cores 0 and 3. OGR's 50 ==> 59 ms / Mnode increase is easily explained by the CPU overhead from GPU folding, though.

Conclusion: I'm going to stick with 2x uni + 2x GPU + 2x OGR. About (34-29) / 29 * 100% == 17% increase in uniprocessor TPF introduced by adding OGR to the mix isn't that bad. Uniprocessor WUs have long deadlines anyway, and for some reason I never get any A4 WUs, so it's not like I'm losing any QRBes either.

Re: other than FAH work

Posted: Mon Jan 23, 2012 3:53 am
by gwildperson
Napoleon wrote:Unless I'm mistaken, distributed.net OGR is ALU-only, and they even have a 64bit client available.
I don't understand why you mentioned 64bit. If we're still talking about HT and sharing ALU-only with FPU-mostly code, it shouldn't matter whether it's 32bit or 64bit.

Perhaps you're responding to the discussions about the absence of a 64bit v7 client for F@h. If so, perhaps someone needs to remind you that F@h has a 64bit SMP core which works with the 32bit client and covers those bigadv cases where large amounts of RAM are needed.

Perhaps you meant something else.

Re: other than FAH work

Posted: Mon Jan 23, 2012 6:05 am
by Napoleon
Okay, I edited the offending sentence and corrected a typo or two. What I meant is that distributed.net has both 32bit and 64bit client versions available. All the rest is strictly about "HT and sharing ALU-only with FPU-mostly code", as you put it. My choice of words would have been "HT and running ALU-only code concurrently with FPU-mostly code".

Gee, I had no idea that merely mentioning 64bit in a single subordinate clause could make all the other sentences in my post appear to be offtopic. Better now? Bear with me, english isn't my native language. :roll:

Re: other than FAH work

Posted: Tue Jan 24, 2012 9:50 am
by Amaruk
Napoleon wrote:...english isn't my native language...
Trust me, your english is much better than many 'native' speakers... :wink: