MtM wrote:Amount of science done is not dependant on throughput of wu's can't you read or are you just trying to perpetuate your wrong arguments with more wrong arguments?
That sounds like a contradiction in terms to me. For the record, I understand there is some extra usefulness in a WU being completed quicker, but even though this is the only argument you could be using against what I said, you seem not have gotten to that point (or were you partaking in this discussion with too limited an understanding?).
MtM wrote:FLOPS and PPD don't have that tie when you talk about single core vs multi core ( or even comparing single cores with eachother but see 7im's post for that ), multi core enabels more complex workunits which are diffrent then the single core work units in their complexity and/or simulation length.
Actually, I was only talking about SMP clients with SMP work units. I never made any mention, nor was I intending to imply using single-processor WUs. Everything I said was
only referring to SMP WUs (a2, 4-thread ones for quad-core CPUs, to be most specific).
MtM wrote:2x2 = more then 1x4 because you run those complex work units on them which couldn't be ran ( within reasonable times or due to hw limitiations ) on a 1x4.
There are no such hardware limitations (even theoretically). Parallelized task can run at best scale linearly, although achieving such scalability this is highly unusual in the real world. You can run an SMP client on a single-core processor. As long as that CPU is at least around a 1.6GHz Core2 class processor, provided it is running 24/7 with no other significant load, it will still make the deadline for most SMP WUs. I'm not sure how that holds for 3 or 8-core WUs, the deadlines may require a faster single core than 1.6GHz, I haven't tested this configuration.
MtM wrote:The scaling is very much non linear, and I'm not adressing utilisation of cores, but scaling in work unit complexity and that is the tie to scientific results. Not your constant refering to ppd being a good indication while you know it is not.
Actually, any SMP WU will quite happily run on a single-core processor. There is nothing that SMP provides other than that WU completes faster. There is no extra complexity that using multiple processors magically solves, which is what you seem to be implying. The problem is that overheads eat much of the gain, as I explained. The MPI process controller seems to use a naive round-robin scheduler, which makes the throughput suffer very heavily when the machine has any other load going on. Running 4 threads on an idle 4-core CPU will yield very good results. But the moment you do something else, and you generate, say,50% load on one core, what'll end up happening is that all 4 workers will only run at 50%, because it only seems to scale at 4x the slowest worker thread - and that is
not including the problem of migration of processes between processor cores, which will make the scaling slightly worse still.
Specifically, on the setup that I'm using, I have 6x GPU clients and a quad-core CPU. Each GPU client consumes around 20% of a single core (it's actually a bit less, but 20% is 1/5, a nice round number for the sake of the explanation. Here's an ASCII-art example of what happens when all run at the same time, with a
single SMP client.
First line is the CPU core number. G is GPU client's CPU usage, I is idle CPU time, S is SMP client's CPU time.
Code: Select all
0 1 2 3
G G G G
G I G I
S S S S
S S S S
S S S S
Essentially, it means that there's 40% of one core not being used, due to the fact that the MPI scheduler not being quite up to the task of balancing the workload distributed to the cores. What happens then is that the OS process scheduler notices that the SMP cores want to use more CPU, so it tries to reschedule things to optimize the CPU utilization, and starts throwing processes around from core to core trying to do a better job. This starts introducing CPU migration latencies (typically 100-150ns) all over the place, and the performance drops through the floor. If you had a single core (or bound a process set, such as a single instance of an SMP client to a single core) this imbalance and process migration wouldn't happen, thus yielding a massive saving in wasted CPU time, which will increase the WU throughput. As I said, I have observed a difference of 2x under real-world conditions.
Even theoretically, 1x4GHz will come out at worst equal and on average significantly ahead of a 2x2GHz solution. Trust me - I'm a computer scientist.
MtM wrote:If I didn't know any better. I would say you where trying to get into an argument with me by taking the opposite stance in this... again.
I think you need to read the quote above again.
I promise you, I'm not picking on you specifically. I apologize if it looks that way. Perhaps I shouldn't have inserted a snippet from your post, and stuck purely to answering the original question asked. I'll try to not repeat that mistake.
uncle_fungus wrote:Calm down people, there's no need to get hot under the collar about this.
Sorry. I hope there's no offense taken.