Page 2 of 3
Re: Enormous job: how did I get it, how can I fix it?
Posted: Wed Jun 03, 2020 5:13 pm
by JimboPalmer
It starts by reserving one thread for the GPU alright
19:38:12:WU00:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\Sony\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/avx/Core_a7.fah/FahCore_a7.exe -dir 00 -suffix 01 -version 706 -lifeline 8728 -checkpoint 15 -np 7
The ending means the number of Processors is 7
However the CPU Core_a7 hates prime numbers over 3 (and numbers with prime factors over 3)
19:38:13:WU00:FS00:0xa7:Reducing thread count from 7 to 6 to avoid domain decomposition by a prime number > 3
19:38:13:WU00:FS00:0xa7:Calling: mdrun -s frame30.tpr -o frame30.trr -x frame30.xtc -cpt 15 -nt 6
Now we see that the number of threads is 6 again at the end of the line.
Re: Enormous job: how did I get it, how can I fix it?
Posted: Wed Jun 03, 2020 5:19 pm
by Rwolf01
>>> If your WU passes Timeout, the Work Server reissues the WU to the next folder.
Good to know. If it is going to duplicate the work anyway, it sounds like I should dump the job as soon as I see it is going to run long. (hopefully that will get the job assigned to faster GPU quicker too)
So I'll go ahead and terminate that other job and restart it on a different job.
Re: Enormous job: how did I get it, how can I fix it?
Posted: Wed Jun 03, 2020 5:43 pm
by Rwolf01
>> Now we see that the number of threads is 6 again at the end of the line.
Thanks for the tutorial. I (sort of) understand.... (but you gave me something cool to Google and read
I was just going by all 8 CPU graphs being fairly full, but I guess the 6 threads hop around between the cores to create that impression.
So I'm losing 1/4 of the CPU threads to support the GPU, not 1/8th. But the PC still has other OS stuff to do, so I imagine having one thread available to do that work improves the uptime of the 6 threads, partially making up for having lost one of them?
But even if it fully doubles the price I pay to support the GPU, it doesn't alter the conclusion, unless there is an "exchange rate" between the value of CPU and GPU points which I don't know about yet.
Re: Enormous job: how did I get it, how can I fix it?
Posted: Wed Jun 03, 2020 6:09 pm
by JimboPalmer
I was just trying to explain why I was more optimistic about the gain for CPU only than you.
I have much older and slower CPUs than you and can get 11,000 PPD with them. (My total is like 950k PPD, but I have multiple 'new' midrange desktop GPUs)
Re: Enormous job: how did I get it, how can I fix it?
Posted: Wed Jun 03, 2020 6:59 pm
by Rwolf01
I appreciate the education. It's also clear that GPUs have improved at a faster rate than CPUs in the last 10 years or so. So they have a much wider dynamic range of HW on the GPU side.
It has convinced me to spring for a new graphics card for my old HP home PC.
Say is there a way to deliberately constrain the # of CPU threads? My husband has a late model LT with a lot of excess capacity, but I know he would hate the lag of of fully utilizing it. I'm thinking I could use 6 or 8 of the 12 threads and the GPU for F@H and he'd still have a tollerably responsive user experience.
Re: Enormous job: how did I get it, how can I fix it?
Posted: Wed Jun 03, 2020 7:08 pm
by peterjammo
Rwolf01 wrote:>>> If your WU passes Timeout, the Work Server reissues the WU to the next folder.
Good to know. If it is going to duplicate the work anyway, it sounds like I should dump the job as soon as I see it is going to run long. (hopefully that will get the job assigned to faster GPU quicker too)
So I'll go ahead and terminate that other job and restart it on a different job.
I'm not 100% sure, but I think that the problem with dumping a wu that's not going to make Timeout is that a dumped wu is not reported back to the work server, so it waits for Timeout to reissue anyway. If you're going to complete between Timeout and Expiry, it's worth completing, as the reissued wu may go to another slow gpu, or fail, perhaps more than once. In both those circumstances your slow wu, if finished successfully could still be the fastest one back by a worthwhile margin.
I'm folding with a variety of slow CPU machines only, but they are all capable of almost always meeting Timeout. If they couldn't I would sideline them too. If there was still significant periods of difficulty in picking up CPU wu, I'd be thinking twice about folding with the slowest ones.
From what I've seen posted, it seems that there are still wu supply problems on the GPU side, so folding with a slow GPU would seem likely to be holding things up, rather than helping. On the other hand, despite possibly gaining fewer points, adding an extra 2 cores to your decent spec CPU would seem to me to be a better idea for now. I strongly suspect that the current points system may be offering perverse incentives in some cases.
Re: Enormous job: how did I get it, how can I fix it?
Posted: Wed Jun 03, 2020 7:27 pm
by Joe_H
By setting a specific value for the number of CPU threads using Configure in FAHControl, the system will always use that number. By default there is a '-1' there which leaves control to the software and the slider position.
The slider position of Light corresponds to using half the available CPU threads when that '-1' is in the configuration for the CPU slot.
In any case, the CPU folding runs at the lowest priority, so usually does not cause lag as it releases resources to higher priority processes. Most lag comes from GPU folding on lower power cards as each access to the GPU needs to wait for the previous item to reach a pause point.
I would like to address this:
Rwolf01 wrote:>> Is it worth folding with a very slow GPU: I'd say no.
That's a straw man argument. I am not convinced the GPU is that slow. (certainly there are faster ones, but I'm comparing the CPU and GPU in their ability to earn points. Assuming the points are a fair measure of the computational value of the work, it's a good metric
Under normal conditions and project deadlines I would agree with you on this being a bit of a straw man argument. The deadlines were longer before a WU would timeout, and lower powered GPUs would finish within them. Some small delay on one trajectory or another was expected in projects that might run 6 months to a year or more before the researcher would have collected all of the data needed to analyze and either write a paper based on it or use as part of their PhD dissertation.
With the COVID-19 related research the timeout and deadline values are shorter in comparison to prior work. They are trying to complete in a much shorter time the research so it can be analyzed towards identifying possible usable binding sites and drugs to connect with them. The shorter time frame is changing how useful slow versus fast GPUs are.
P.S. Ending up with duplicate WU returns is not as useless as some make it out to be. I know of at least one researcher who has posted elsewhere that they do occasional comparisons of duplicates to see if they actually are the same. I would not be surprised if others do the same just to keep an a check on the software's computational behaviour.
Re: Enormous job: how did I get it, how can I fix it?
Posted: Wed Jun 03, 2020 10:05 pm
by bruce
the i7-3612QM has 8 threads (FAH calls them CPUs) A reasonable intelligent OS will assign FAH's threads to hardware with separate FPUs -- sometimes incorrectly called "real CPUs" and then begin assigning tasks to CPU threads which share the FPU with a previous hardware CPU (sometimes incorrectly called virtual CPUs). The performance of the first half will increase more rapidly than for the second half.
Re: Enormous job: how did I get it, how can I fix it?
Posted: Thu Jun 04, 2020 4:53 am
by PantherX
Rwolf01 wrote:...The CPU has a job worth 525 base points (2449 estimated) that is 26% done after 1:29:18. Converting that to points/day I see the CPU (which is keeping all 8 threads busy, btw) as being worth 2201 base points/day and 10,268 estiamted points/day.
The GPU on the same machine has a job worth 2500 base points (5368 estiamted) that is 86% done after 6:41:39. That works out to 7708 base points/day or 16,551 estimated.
If it only costs me 1 thread out of 8 to support the GPU, I am paying 275 points/day to get 7700. That seems like a no brainer. Doing the same math on extended points says I'm paying 1280 points to get 16,550. Also a clear win...
Mathematically, it might make sense but allow me to introduce a teeny tiny wrench in there... comparison of CPU points to GPU points isn't a fair one.
CPUs are able to perform serial tasks and highly specialized calculations. GPUs are able to perform parallel tasks and (comparatively) simple calculations. The points are an indicator of how much science a CPU/GPU is doing. The scientific value of CPU and GPU are exactly the same (in other words, both CPU and GPU are equally valuable for researchers). Using points was an easier way for an average Donor to see how much scientific contribution their system was providing.
Re: Enormous job: how did I get it, how can I fix it?
Posted: Thu Jun 04, 2020 8:14 am
by HugoNotte
Rwolf01 wrote:>>> If your WU passes Timeout, the Work Server reissues the WU to the next folder.
Good to know. If it is going to duplicate the work anyway, it sounds like I should dump the job as soon as I see it is going to run long. (hopefully that will get the job assigned to faster GPU quicker too)
So I'll go ahead and terminate that other job and restart it on a different job.
If you keep on dumping WUs because they seem to exceed timeout, you are requesting more WUs more often of which you probably are going to dump some again.
To the project servers it doesn't make a difference whether or not you allow the WU to finish or dump it, there is no further communication between your computer and the server. The server only realizes that the WU has not been returned by timeout and then puts it into the queue to be send out again.
If you increase the polling of WUs by dumping large ones, you add to the strain put on servers by polling more often. Inadvertently you would receive by chance more large WUs, which you won't finish again, which would have to be re-issued again (after timeout).
So the practice of regularly dumping WUs is actually quite bad for the entire system. In my opinion we are her to help science and not to cherry pick WUs that seem to gain the most points.
As to whether or not your GPU is slow or not, it's all relative. Any mid range GPU of a certain generation / age is a lot faster and would make for more points than a CPU of the same age / generation. Your CPU's integrated GPU would even be faster than the CPU it's attached to. If your GPU is not able to finish all WUs it receives by timeout, it's a clear indication that it is very much at the bottom end of the performance spectrum. FAH WUs are certainly not sized in such a way that only top of the range hardware can cope.
After all a lower mid range graphics card from 6 years ago is still able to complete all WUs received before timeout and receive QRB.
Re: Enormous job: how did I get it, how can I fix it?
Posted: Sun Oct 04, 2020 11:16 pm
by midhart90
I will point out that it would be useful to be able to configure the client to only accept WUs from projects with at least a specified minimum timeout. I noticed that I kept getting dealt WUs with 1-day timeouts only to have them finish in 1.3-1.5 days (depending on how much I use the computer for other stuff) while there are plenty of other active projects with longer timeouts.
Just now, I got a WU that appears to be about the same size as the previous ones, but this one has a 2-day timeout, meaning it will actually finish on time (with quite a bit to spare, at that!).
I'd rather be able to have the client reject WUs at the outset (and communicate this to the server so the WU in question can be reassigned immediately) if they have no reasonable chance of finishing on time, as opposed to the current situation where it continues in vain and the WU gets duplicated, often just a few hours before finishing.
Re: Enormous job: how did I get it, how can I fix it?
Posted: Mon Oct 05, 2020 4:22 am
by PantherX
Welcome to the F@H Forum midhart90,
If you can describe your hardware and use-case, we might be able to discuss this with the researcher and see what we can do.
Re: Enormous job: how did I get it, how can I fix it?
Posted: Mon Oct 05, 2020 7:59 am
by aetch
GPUs have the species list to breakdown the capabilities and rough performance of each card.
Why not do something similar for processors?
Re: Enormous job: how did I get it, how can I fix it?
Posted: Mon Oct 05, 2020 8:36 am
by PantherX
The process for GPU which is currently in place was fit for purpose ~10 years ago. However, it is no longer ideal so is being overhauled to be fit for purpose.
When it comes to CPUs, it is more tricky since you can always specify X CPU but you can't do the same for GPU. Moreover, you also have a lot more CPUs than GPU so manually classifying them isn't feasible. In addition to that, CPUs would have a lot more variation than GPUs since CPUs are used by a lot more applications than GPUs so it makes it even more tricky to figure out how to classify that CPU.
Maybe on the automated classification of GPU is in production and working without issues, than something similar can be done for the CPUs which would be harder than GPUs so it would be logical to start of with something easier and then move on to something not so easy.
Re: Enormous job: how did I get it, how can I fix it?
Posted: Mon Oct 05, 2020 9:17 am
by hnapel
Rwolf01 wrote:
1: How do I kill a job? (and properly report to the mother-ship that I am punting it, so it can get reassigned ASAP)
One method is to remove the slot(s) that can process it (then adding it back later), when the F@H client detects it has no resources to process a WU, it will report them failed (or whatever), so it can be sent to another volunteer.