Highwater Mark

alancabler · Post by **alancabler** » Tue Mar 25, 2008 1:20 pm

We just keep getting faster and faster.
The project has now passed 1.5 PFLOPS

OS Type          Current TFLOPS* Active CPUs Total CPUs
Windows                     179       188452    1965339
Mac OS X/PowerPC              7         8762     113529
Mac OS X/Intel               21         6724      43180
Linux                        43        25559     279954
GPU                          27          465       5261
PLAYSTATION®3              1225        40577     469305
Total                      1502       270539    2876568

Edit: Fixed table layout -UF

John Naylor · Post by **John Naylor** » Tue Mar 25, 2008 1:35 pm

What dya think... First project to 10PFLOPS by 2015?

alancabler · Post by **alancabler** » Tue Mar 25, 2008 2:04 pm

Hi John,

John Naylor wrote:What dya think... First project to 10PFLOPS by 2015?

If the project speed increases according to Moore's Law,then we should be somewhere around 16 PFLOPS by 2015. Since our production depends on the number of donors, as well as machine speed and time spent folding, there are more variables to consider. OTOH, the Moore's Law timescale has been getting shorter, and there are some extraordinary tech advances in view.
Personal TFLOP contributions will become commonplace, and soon...

alancabler · Post by **alancabler** » Thu Mar 27, 2008 1:49 pm

Since this thread was started, just 2 days ago, donors have added 31 TFLOPS of folding power to the project.
http://fah-web.stanford.edu/cgi-bin/mai ... pe=osstats
That's more computing power than the world's 38th largest supercomputer (27.38 TFLOPS) at Lawrence Livermore National Laboratory. http://www.top500.org/list/2007/11/100

The Folding@home project is currently running faster than the Top 12 supercomputers in the world combined, plus the aforementioned Lawrence Livermore system (as of Nov. '07). Additionally, the project's power is calculated from sustained, actual performance, and not from theoretical peak performance.

note: F@h is a distributed computing project and is neither a classically defined supercomputer, nor a supercluster.
You will only find F@h mentioned rarely, if at all, in the publications dealing with supercomputers or superclusters.
Cluster computing is anathema to many supercomputing purists (their reasons have merit). Still, those involved in such pursuits studiously ignore the 800 pound gorilla in the room.

butc8 · Post by **butc8** » Fri Mar 28, 2008 4:13 pm

With the new GPU client on the way and a 3870 at +-1TF/core, we will see a 2x3870x2 run +-4TF on one computer

, probably next month, thats faster than some DC projects on just one computer! 1 yottaflop is just around the corner.

http://en.wikipedia.org/wiki/Peta-

Foxery · Post by **Foxery** » Fri Mar 28, 2008 6:04 pm

GPU 27 465 5261
I'm glad you posted this, so we have a reference saved for after the new GPU client arrives

Today it's:
GPU 27 450 5266

butc8 wrote:With the new GPU client on the way and a 3870 at +-1TF/core, we will see a 2x3870x2 run +-4TF on one computer , probably next month, thats faster than some DC projects on just one computer! 1 yottaflop is just around the corner. http://en.wikipedia.org/wiki/Peta-

Actually, the X1900 XT was also advertised as 1 TFLOP, so take it with a grain of salt. This doesn't directly translate into actual calculations relevant to protein folding. 27 TFLOPs / 450 GPUs = roughly 60 GFLOPs average per card. Even after considering that many don't crunch 24/7, and many are 1600-series, the actual work performed is nowhere near 1000 GFLOPs apiece, largely because Folding is more complex than rendering polygons.

A more conservative guess would be to believe that new cards are at least ~2X the speed, and at least ~2X as many people will Fold on them. I'd expect to see the GPU Client stats to be in the realm of 150 GFLOPs at the end of April. Hopefully the truth will be double this much again, as I think I am greatly underestimating. I would not, however, expect the GPU total to outpace PS3s any time soon.

zorzyk · Post by **zorzyk** » Fri Mar 28, 2008 6:12 pm

How can we translate power of SMP client into flops?
I'd like to know, what is the percentage of SMP clients in total number 179 TFlops of Windows OS.
Is it possible to estimate?

butc8 · Post by **butc8** » Fri Mar 28, 2008 6:28 pm

AFAIK the SMP client is effient (it splits up the WU), on PC Wizard Im getting about 13GF on my [email protected].

Edit: There was this post that explained the GPU as the drag racer and CPU as mini van, now with the SMP they gonna bring in trailers and roof racks and off road tires haha

Foxery · Post by **Foxery** » Fri Mar 28, 2008 7:20 pm

I wonder if Stanford can break up the Client Statistics page into Uniprocessor and SMP figures.

edit:
Sorry Beberg. My "bad math" was worse than I thought.

My technical knowledge is too outdated to be more specific.

When my Friday headache clears, I'll do some reading!

Post by **Beberg** » Fri Mar 28, 2008 9:01 pm

Foxery wrote:The Core 2 architecture has a pipeline depth of 14, meaning it takes 14 cycles to finish a complex instruction.

That's not how pipelining works. Please consult Wikipedia...

Foxery · Post by **Foxery** » Sat Mar 29, 2008 12:05 am

Shoot... Editted out my previous garbage entirely.

One of Core2's many improvements over the Pentium4 stems from its shorter pipeline, but glancing at a few old articles from its introduction, I didn't fully absorb the details.

butc8, I think the figure you are looking at shows either Integer ops, or possibly SSE-optimized instructions.

Now that I'm home and have some peace and quiet, I went out and downloaded LINPACK, an old, classic benchmark. Full description quoted below for the curious. Results for one core out of my C2Duo, running at 3.4 GHz, rated me at 1.85 GFLOPs. This implies that all four cores in a Q6600, similarly overclocked, put out 7.4 GFLOPs. (A general purpose metric would be: 0.544 GFLOPs per-GHz per-Core.)

I also have a copy that's optimized for SSE2 instructions, which reports 2.08 GFLOPs/core... but Folding cores benefit far more from SSE2, so I'm not sure what to make of that.

LINPACK
The LINPACK Benchmarks are a measure of a system's floating point computing power. Introduced by Jack Dongarra, they measure how fast a computer solves a dense n by n system of linear equations Ax=b, which is a common task in engineering. It was written in Fortran by Jack Dongarra, Jim Bunch, Cleve Moler, and Pete Stewart, and was intended for use on supercomputers in the 1970s and early 1980s.

alancabler · Post by **alancabler** » Sat Mar 29, 2008 2:43 am

Greetings Foxery,

Foxery wrote: ...I also have a copy that's optimized for SSE2 instructions, which reports 2.08 GFLOPs/core... but Folding cores benefit far more from SSE2, so I'm not sure what to make of that.

Pande Group has achieved 3.8 GFLOPS sustained performance from a 3.0 GHz P4 CPU using SSE intrinsics and Intel's C++ Compiler v9.0 running highly optimized, hand-coded Gromacs code, i.e. folding algorithms.*1.

A QX9650 will yield 96 GFLOPS. *2

GPUs- R580 (X1900) achieves 20-40X 2.8GHz P4 and has been shown in the old forum to also run FAHcode at + 96 GFLOPS. Current generation (RV670) ATI chips have demonstrated close to .5 TFLOPS performance.*3

PS3 Cell processors running FAHcode achieved ~ 83 GFLOPS in the early implementations of the FAH client, but several code improvements have increased the PS3's performance. Sorry, all of my links to the info detailing the PS3's power were through the old folding-community.org forum, which system failures have rendered inaccessible.

*1 N-Body Simulations on GPUs p.5
*2 A Portable Run-Time Interface for Multi-level Memory Hierarchies p.8
*3 ibid p.101 also see GPGPU

Post by **bruce** » Sat Mar 29, 2008 5:37 am

butc8 wrote:With the new GPU client on the way and a 3870 at +-1TF/core, we will see a 2x3870x2 run +-4TF on one computer , probably next month, thats faster than some DC projects on just one computer! 1 yottaflop is just around the corner.

Foxery wrote:Actually, the X1900 XT was also advertised as 1 TFLOP, so take it with a grain of salt. This doesn't directly translate into actual calculations relevant to protein folding. 27 TFLOPs / 450 GPUs = roughly 60 GFLOPs average per card. Even after considering that many don't crunch 24/7, and many are 1600-series, the actual work performed is nowhere near 1000 GFLOPs apiece, largely because Folding is more complex than rendering polygons.

A more conservative guess would be to believe that new cards are at least ~2X the speed, and at least ~2X as many people will Fold on them. I'd expect to see the GPU Client stats to be in the realm of 150 GFLOPs at the end of April. Hopefully the truth will be double this much again, as I think I am greatly underestimating. I would not, however, expect the GPU total to outpace PS3s any time soon.

You're both making one critical mistake. You're assuming that the number of FLOPS actually means something.

Well, it does mean something, but it's not as meaningful as you're making it.

In the hypothetical 2x3870x2 machine, the limiting factor is going to be the PCI-e connection between main RAM and the gpu's VRAM. It's just not possible to move all of the data needed for protein folding in and out of the GPU fast enough to keep it 100% busy. That means that the useful flops are a lot smaller than the potential number that they like to advertise. Moreover if you actually figured out how to get four GPUs folding from the same motherboard, they'd have to share some or all of the bandwidth of the PCI-e bus so you'd get less than 4 times what one would do by itself.

Two x1950xtx's in the same machine fold a little faster than one, but certainly not double.

Foxery · Post by **Foxery** » Sat Mar 29, 2008 5:35 pm

Right, FLOPS only means Floating (Point) Operations. The result depends on the type of Operations you are testing. Linpack is based on Linear Algebra, Folding@Home involves geometry. Popular ones these days are FutureMark, PCMark, Sandra, etc. Don't know what type of math they run, but the big numbers look exciting, right?

It might be more helpful to show the relative speeds of machines throughout the years, rather than fixating on some magical number to represent "performance." Here are a few common examples, using Linpack as the benchmark of choice:

(source: http://freespace.virgin.net/roy.longbot ... esults.htm)

Code: Select all

CPU                MHz    MFLOPS 
Pentium III         450     61
Athlon              500    180
Pentium III        1000    316
Athlon-TBird       1000    372
Pentium 4          1700    382 (Note the poor performance vs. slower Athlons/P3s)
Athlon-Barton      1800    659 (Marketed as a "2500" vs. a Pentium4)
Opteron-?          2000    753
Pentium 4          3066    840
Athlon 64          2200    838
Core 2 Duo-1 Core  2400   1315
Core 2 Duo-1 Core  3400   1844

Post by **bruce** » Sat Mar 29, 2008 7:36 pm

Foxery wrote:the big numbers look exciting, right?

right

It might be more helpful to show the relative speeds of machines throughout the years, rather than fixating on some magical number to represent "performance." Here are a few common examples, using Linpack as the benchmark of choice:

But they're not reasonable comparison.

Suppose my favorite benchmark is calculating a lot of square roots. On machine A, square root is caculated with a small subprogram that typically requires about 20 Floating Point OPerations. I want to compare that to machine B but this hardware happens to have a specialized Floating Point OPeration called SQRT that can perform the entire operation in one operation without the help of a subprogram. Machine B running at 200 MFLOPS is doing 20 times as much work as machine A which also is rated at 200 MFLOPS.

Now add machine C which can do 1,000 SQRT operations in the same time that machine B performs one of them (so it is rated at 200,000 MFLOPS or 4,000,000 MFLOPS, depending on which method you used in the first answer) but there's a catch. It can perform those operations only if the data is already in VRAM, and it can only load 200 M numbers per second into VRAM. I can load one value, find it's square root, but then it has to sit idle for 999 more operations before it gets the next number to work on. This machine is no faster than machine B, but it has a much higher MFLOP rating.

So now we have to ask the question: Does my benchmark that shows that machine C can only do 1/1000th USEFUL operations out of the rated MFLOPs representative of FAH code? Probably not, but you do get the point that USEFUL operations are all that really matters, not the big numbers.

Folding Forum

Highwater Mark

Highwater Mark

Re: Highwater Mark

Re: Highwater Mark

Re: Highwater Mark

Re: Highwater Mark

Re: Highwater Mark

Re: Highwater Mark

Re: Highwater Mark

Re: Highwater Mark

Re: Highwater Mark

Re: Highwater Mark

Re: Highwater Mark

Re: Highwater Mark

Re: Highwater Mark

Re: Highwater Mark