High System CPU Usage on WUs (7611 mostly)
Moderators: Site Moderators, FAHC Science Team
High System CPU Usage on WUs (7611 mostly)
The issues is not that the WU errors but that the WU is folding in a way that is not efficient for my AMD FX-8120 processor. Other WUs have similar issues but this WU has it the worst of all. I run Ubuntu 11.10 so when I use top I can see where the CPU usage is going. This is what I see:
Cpu(s): 0.3%us, 41.1%sy, 58.5%ni, 0.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16430528k total, 1537932k used, 14892596k free, 47508k buffers
Swap: 15624996k total, 0k used, 15624996k free, 551428k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1385 fahclien 39 19 1117m 57m 3000 S 793 0.4 175:19.95 FahCore_a4
This fluctuates between 40 and 50%. Therefore, nearly half of the CPU is lost somewhere in hardware level rather than actually producing results. With other WUs this may be 10-15% which is still bad, because on my Phenom II X6 machine the most %sy is only 2-4 which is normal. This abnormal spike in %sy leads me to believe that there is some sort of software issue. I would also point out that on this WU my FX is slightly slower than my Phenom II X6 despite being twice as fast as my X6, leading to believe that although I'm folding 2x faster on other WUs, this one is not being used efficiently.
I also note, that World Community Grid utilizes this processor a lot better.
Cpu(s): 1.2%us, 0.2%sy, 98.6%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16430528k total, 1725316k used, 14705212k free, 46188k buffers
Swap: 15624996k total, 0k used, 15624996k free, 633784k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2723 boinc 39 19 34092 28m 1432 R 100 0.2 5:07.55 wcg_hcc1_img_6.
2726 boinc 39 19 34092 29m 1740 R 100 0.2 5:14.89 wcg_hcc1_img_6.
2732 boinc 39 19 34092 29m 1740 R 100 0.2 5:16.12 wcg_hcc1_img_6.
2720 boinc 39 19 34092 29m 1736 R 100 0.2 5:05.79 wcg_hcc1_img_6.
2735 boinc 39 19 34272 29m 1740 R 100 0.2 4:57.65 wcg_hcc1_img_6.
2630 boinc 39 19 35244 29m 1740 R 100 0.2 11:46.59 wcg_hcc1_img_6.
2729 boinc 39 19 33964 28m 1736 R 100 0.2 5:03.18 wcg_hcc1_img_6.
2738 boinc 39 19 33964 28m 1736 R 99 0.2 4:49.86 wcg_hcc1_img_6.
I'm guessing the way these a3/4/5 WUs are doing SMP it is poor with Bulldozer processors and a good portion of potential is lost. (10-50%)
Edit: Now my X6 is seeing similar results
Tasks: 157 total, 1 running, 156 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 12.0%sy, 87.1%ni, 0.7%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 3797580k total, 1619412k used, 2178168k free, 211800k buffers
Swap: 975868k total, 0k used, 975868k free, 875288k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5379 mmstick 39 19 402m 37m 3024 S 598 1.0 1724:05 FahCore_a4
Yet not as bad, this is with a 7600 WU.
Cpu(s): 0.3%us, 41.1%sy, 58.5%ni, 0.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16430528k total, 1537932k used, 14892596k free, 47508k buffers
Swap: 15624996k total, 0k used, 15624996k free, 551428k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1385 fahclien 39 19 1117m 57m 3000 S 793 0.4 175:19.95 FahCore_a4
This fluctuates between 40 and 50%. Therefore, nearly half of the CPU is lost somewhere in hardware level rather than actually producing results. With other WUs this may be 10-15% which is still bad, because on my Phenom II X6 machine the most %sy is only 2-4 which is normal. This abnormal spike in %sy leads me to believe that there is some sort of software issue. I would also point out that on this WU my FX is slightly slower than my Phenom II X6 despite being twice as fast as my X6, leading to believe that although I'm folding 2x faster on other WUs, this one is not being used efficiently.
I also note, that World Community Grid utilizes this processor a lot better.
Cpu(s): 1.2%us, 0.2%sy, 98.6%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16430528k total, 1725316k used, 14705212k free, 46188k buffers
Swap: 15624996k total, 0k used, 15624996k free, 633784k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2723 boinc 39 19 34092 28m 1432 R 100 0.2 5:07.55 wcg_hcc1_img_6.
2726 boinc 39 19 34092 29m 1740 R 100 0.2 5:14.89 wcg_hcc1_img_6.
2732 boinc 39 19 34092 29m 1740 R 100 0.2 5:16.12 wcg_hcc1_img_6.
2720 boinc 39 19 34092 29m 1736 R 100 0.2 5:05.79 wcg_hcc1_img_6.
2735 boinc 39 19 34272 29m 1740 R 100 0.2 4:57.65 wcg_hcc1_img_6.
2630 boinc 39 19 35244 29m 1740 R 100 0.2 11:46.59 wcg_hcc1_img_6.
2729 boinc 39 19 33964 28m 1736 R 100 0.2 5:03.18 wcg_hcc1_img_6.
2738 boinc 39 19 33964 28m 1736 R 99 0.2 4:49.86 wcg_hcc1_img_6.
I'm guessing the way these a3/4/5 WUs are doing SMP it is poor with Bulldozer processors and a good portion of potential is lost. (10-50%)
Edit: Now my X6 is seeing similar results
Tasks: 157 total, 1 running, 156 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 12.0%sy, 87.1%ni, 0.7%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 3797580k total, 1619412k used, 2178168k free, 211800k buffers
Swap: 975868k total, 0k used, 975868k free, 875288k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5379 mmstick 39 19 402m 37m 3024 S 598 1.0 1724:05 FahCore_a4
Yet not as bad, this is with a 7600 WU.
-
- Posts: 10179
- Joined: Thu Nov 29, 2007 4:30 pm
- Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
- Location: Arizona
- Contact:
Re: High System CPU Usage on WUs (7611 mostly)
Your FX-8120 is trying to funnel 8 cores worth of data down through only 4 Floating Point Units (FPUs). That may account for the increased cpu slack times. Also note your X6 has a full 6 FPUs. Clock speed isn't everything when FAH uses FPUs so heavily.
Also, SMP moves a lot of data around. SMP will never scale up to 100% usage. One or more of the CPU cores is often waiting to sync data from one or more of the other CPU cores. 80-90% usage is great, all things considered.
Also, SMP moves a lot of data around. SMP will never scale up to 100% usage. One or more of the CPU cores is often waiting to sync data from one or more of the other CPU cores. 80-90% usage is great, all things considered.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
Tell me and I forget. Teach me and I remember. Involve me and I learn.
Re: High System CPU Usage on WUs (7611 mostly)
Has anybody made a similar observation of how top reports FAH when running on an i7? Although both AMD and Intel will deny they're using the same ideas, the i7 also tries to "funnel 8 cores worth of data down through only 4 FPUs" although they call them "virtual" cores. The only reports I've seen have been from Windows task manager, which doesn't provide information like "0.2%sy, 98.6%ni"
From that statement, I'm going to assume that the WCG analysis is doing integer calculations rather than floating point calculations. (That may not be true, but it's a rational guess.) Lots of computer code never needs an FPU so chip manufacturers can reduce the transistor count for those folks who don't need that hardware. As 7im points out, the "cores" in an X6 are different than the "cores" in an FX-8120. For more information, see any Intel discussion about hyperthreading, because from my perspective, AMD has done exactly the same thing without ever using the words "virtual cores" for what's in the FX-8120.mmstick wrote:I also note, that World Community Grid utilizes this processor a lot better.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 1164
- Joined: Wed Apr 01, 2009 9:22 pm
- Hardware configuration: Asus Z8NA D6C, 2 [email protected] Ghz, , 12gb Ram, GTX 980ti, AX650 PSU, win 10 (daily use)
Asus Z87 WS, Xeon E3-1230L v3, 8gb ram, KFA GTX 1080, EVGA 750ti , AX760 PSU, Mint 18.2 OS
Not currently folding
Asus Z9PE- D8 WS, 2 [email protected] Ghz, 16Gb 1.35v Ram, Ubuntu (Fold only)
Asus Z9PA, 2 Ivy 12 core, 16gb Ram, H folding appliance (fold only) - Location: Jersey, Channel islands
Re: High System CPU Usage on WUs (7611 mostly)
Hmm, I always thought that %CPU on the process line showed how much %age was being used by that particular process, in your case its showing 793 which is about as good as it gets. If I am wrong though please correct me as in no way am I fimilar with linux apart from installing ubuntu and running f@H.
For comparison i ran top on my l5640 machine and got values ranging from 4.7% to 7.2% for %sy over the space of a couple of minutes with it changing between those limits constantly. my folding process cpu %age was maxed out at 2395 out of 2400 (24 thread machine so 24x100)
Edit: Just run a quick google and i was correct. your %sy and %ni are the percentages that were executed at the system level and at the user level with nice priority respectively.
Your CPU is running F@H fine - Linux is just having difficulty with where to run everything
2nd edit - thanks bruce
For comparison i ran top on my l5640 machine and got values ranging from 4.7% to 7.2% for %sy over the space of a couple of minutes with it changing between those limits constantly. my folding process cpu %age was maxed out at 2395 out of 2400 (24 thread machine so 24x100)
Edit: Just run a quick google and i was correct. your %sy and %ni are the percentages that were executed at the system level and at the user level with nice priority respectively.
Your CPU is running F@H fine - Linux is just having difficulty with where to run everything
2nd edit - thanks bruce
Re: High System CPU Usage on WUs (7611 mostly)
I wouldn't be taking the FPU units in Bulldozer lightly and comparing them to the Phenom II X6 FPUs. Phenom II FPU != Phenom FX FPU. The FPUs in FX are strong FPUs that I have benched to be somewhere around close to double to floating point capacity of the Phenom II X6. The FX has four 256-bit FPUs that is capable of running eight 128-bit FPU instructions at the same time, therefore, it is very much identical to what Intel does with their cores, only FX has eight real ALUs and four strong FPUs versus four strong ALUs and four strong FPUs. I also assure you that this issue only happens with Folding@Home and nothing else. Even running a floating point heavy benchmark debunks the theory that Linux and Bulldozer are unable to keep up because of the shared FPUs. WCG is floating-point heavy as well, as most scientific apps are. I'm fine with 5% lost due to how difficult it is to code SMP, but when 40-50% is lost due to issues with how SMP is done with these WUs there is obviously something wrong.
-
- Posts: 1164
- Joined: Wed Apr 01, 2009 9:22 pm
- Hardware configuration: Asus Z8NA D6C, 2 [email protected] Ghz, , 12gb Ram, GTX 980ti, AX650 PSU, win 10 (daily use)
Asus Z87 WS, Xeon E3-1230L v3, 8gb ram, KFA GTX 1080, EVGA 750ti , AX760 PSU, Mint 18.2 OS
Not currently folding
Asus Z9PE- D8 WS, 2 [email protected] Ghz, 16Gb 1.35v Ram, Ubuntu (Fold only)
Asus Z9PA, 2 Ivy 12 core, 16gb Ram, H folding appliance (fold only) - Location: Jersey, Channel islands
Re: High System CPU Usage on WUs (7611 mostly)
I didn't say BD couldn't keep up, i just said that linux was maybe having issues running things in the correct user space. What frame times are you getting for a selection of projects? That will be the real indicator of how good your machine is. According to the top report that you posted your cpu is running f@h 99.125% of the time
The linux guys are already reporting much better performance than windows, however that does not mean that this cannot be improved upon even further. Perhaps the cores do need a recompile but so far the PPD that you state in your sig is nearly the same as i7 2600k - not bad for a brand new cpu architecture that has been on general release for 4 weeks
The linux guys are already reporting much better performance than windows, however that does not mean that this cannot be improved upon even further. Perhaps the cores do need a recompile but so far the PPD that you state in your sig is nearly the same as i7 2600k - not bad for a brand new cpu architecture that has been on general release for 4 weeks
-
- Posts: 2522
- Joined: Mon Feb 16, 2009 4:12 am
- Location: Greenwood MS USA
Re: High System CPU Usage on WUs (7611 mostly)
If you have 8 ALUs instead of 4 ALUs, won't 4 (~50%) be waiting for a FPU most of the time?mmstick wrote:The FX has four 256-bit FPUs that is capable of running eight 128-bit FPU instructions at the same time, therefore, it is very much identical to what Intel does with their cores, only FX has eight real ALUs and four strong FPUs versus four strong ALUs and four strong FPUs.
...
but when 40-50% is lost due to issues with how SMP is done with these WUs there is obviously something wrong.
It would be interesting to use SMP -4 and see if this 'problem' disappears.
Tsar of all the Rushers
I tried to remain childlike, all I achieved was childish.
A friend to those who want no friends
I tried to remain childlike, all I achieved was childish.
A friend to those who want no friends
-
- Posts: 10179
- Joined: Thu Nov 29, 2007 4:30 pm
- Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
- Location: Arizona
- Contact:
Re: High System CPU Usage on WUs (7611 mostly)
WCG is FPU intensive? Does it heat up the CPUs like fah does?
If this is an SMP problem, then the Intel chips should have the same problem, right? So are they?
As a double check, does the BD perform at the same speed as i7s in WCG, if, as you claim, the BD and i7 are so close in architecture and performance that should be easy to show?
And I've seen the marketing info on the "strong FPUs" taking in double the bits, but if that were happening, BD would be producing more than double the points of i7s in fah, and they do not.
So there is a bottleneck somewhere. But without more comparative data, there is no way to be sure where the problem is. SMP SSE code? AMD's first attept at hyperthreading as compared to Intel's seasoned solution? Differences in the OS schedulers? Let's not point fingers without some hard numbers to back it up.
If this is an SMP problem, then the Intel chips should have the same problem, right? So are they?
As a double check, does the BD perform at the same speed as i7s in WCG, if, as you claim, the BD and i7 are so close in architecture and performance that should be easy to show?
And I've seen the marketing info on the "strong FPUs" taking in double the bits, but if that were happening, BD would be producing more than double the points of i7s in fah, and they do not.
So there is a bottleneck somewhere. But without more comparative data, there is no way to be sure where the problem is. SMP SSE code? AMD's first attept at hyperthreading as compared to Intel's seasoned solution? Differences in the OS schedulers? Let's not point fingers without some hard numbers to back it up.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
Tell me and I forget. Teach me and I remember. Involve me and I learn.
-
- Posts: 450
- Joined: Tue Dec 04, 2007 8:36 pm
Re: High System CPU Usage on WUs (7611 mostly)
Gromacs optimizations use SSE instructions to perform 4 single-precision operations simultaneously. That's 4x32-bit operations performed by single instructions. I'm not sure if that counts as one 128-bit instruction or not. Quoting sales figures for other types of operations doesn't tell us how Gromacs will perform.
When running FAH, yes, that's true. When running "normal" applications that only need an ALU, the FPUs will be idle.
If you have 8 ALUs instead of 4 ALUs, won't 4 (~50%) be waiting for a FPU most of the time?[/quote]JimboPalmer wrote:but when 40-50% is lost due to issues with how SMP is done with these WUs there is obviously something wrong.
When running FAH, yes, that's true. When running "normal" applications that only need an ALU, the FPUs will be idle.
In fact, you could run -smp 4 concurrently with clients for 4 ATi GPUs. The ATi GPU code uses 100% of one ALU, which is terribly wasteful compared to what nvidia does, but with four free ALUs, who cares. [This trick was discovered on the i7 which also has 8 ALUs plus 4 FPUs.]It would be interesting to use SMP -4 and see if this 'problem' disappears.
Re: High System CPU Usage on WUs (7611 mostly)
OP, have you patched your Linux kernel with the F15h IC (cache) aliasing patch?
http://www.phoronix.com/scan.php?page=a ... sing&num=1
http://www.phoronix.com/scan.php?page=a ... sing&num=1
-
- Site Admin
- Posts: 7937
- Joined: Tue Apr 21, 2009 4:41 pm
- Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2 - Location: W. MA
Re: High System CPU Usage on WUs (7611 mostly)
Given prior experience with the differences between AMD and Intel in how they implement their similar CPU features, I suspect taking full advantage of the FX's FPU's will take using different compiler flags, and possibly tweaks to the code. How well this works out for folding depends on how those fit in with the core developers set of tools, and the need to maintain compatibility with other architectures. Given how recent the FX is, would they even have access to a setup with one to test yet?
iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3