Page 1 of 4

Suggestion [or how to eliminate slack time between WUs?]

Posted: Sat Sep 13, 2008 6:00 pm
by fridgepants
I have noticed that once a work unit completes there is a dead time of no GPU utilization until the results are either received or put into the queue. It seems to average about 10 minutes of no work occurring every time a WU ends. Though it matters less with these new longer WUs, is there any reason you can't have overlapping GPU work and result sending?

Interestingly, if I shutdown and then start-up FAH mid unit, the exact thing I suggest occurs. The WU continues to progress but results are sent concurrently.

This is what I like to see:

[17:41:44] Completed 45%
[17:42:18] - Couldn't send HTTP request to server
[17:42:18] + Could not connect to Work Server (results)
[17:42:18] (171.64.65.103:8080)
[17:42:18] + Retrying using alternative port
[17:43:39] Completed 46%
[17:44:01] - Couldn't send HTTP request to server
[17:44:01] + Could not connect to Work Server (results)
[17:44:01] (171.64.65.103:80)
[17:44:01] - Error: Could not transmit unit 01 (completed September 13) to work server.
[17:44:01] - Read packet limit of 540015616... Set to 524286976.


[17:44:01] + Attempting to send results [September 13 17:44:01 UTC]
[17:44:02] - Couldn't send HTTP request to server
[17:44:02] (Got status 503)
[17:44:02] + Could not connect to Work Server (results)
[17:44:02] (171.64.122.86:8080)
[17:44:02] + Retrying using alternative port
[17:44:02] - Couldn't send HTTP request to server
[17:44:02] (Got status 503)
[17:44:02] + Could not connect to Work Server (results)
[17:44:02] (171.64.122.86:80)
[17:44:02] Could not transmit unit 01 to Collection server; keeping in queue.
[17:44:02] Project: 4744 (Run 8, Clone 59, Gen 2)
[17:44:02] - Read packet limit of 540015616... Set to 524286976.


[17:44:02] + Attempting to send results [September 13 17:44:02 UTC]
[17:44:37] - Couldn't send HTTP request to server
[17:44:37] + Could not connect to Work Server (results)
[17:44:37] (171.64.65.103:8080)
[17:44:37] + Retrying using alternative port
[17:45:33] Completed 47%
[17:47:25] + Results successfully sent
[17:47:25] Thank you for your contribution to Folding@Home.
[17:47:25] + Number of Units Completed: 68

[17:47:28] Completed 48%
[17:49:23] Completed 49%

Re: Suggestion

Posted: Sun Sep 14, 2008 9:54 pm
by slavas
seems more like ISP/internet connection issues than FAH or it was some temporal issue with server in that case nothing you can do :)

as here everything looks normal:

Code: Select all

[20:06:02] Completed 100%
[20:07:02] 
[20:07:02] Finished Work Unit:
[20:07:02] - Reading up to 30216 from "work/wudata_00.trr": Read 30216
[20:07:02] trr file hash check passed.
[20:07:02] - Reading up to 1940412 from "work/wudata_00.xtc": Read 1940412
[20:07:02] xtc file hash check passed.
[20:07:02] edr file hash check passed.
[20:07:02] logfile size: 67676
[20:07:02] Leaving Run
[20:07:03] - Writing 2039376 bytes of core data to disk...
[20:07:03] Done: 2038864 -> 1969384 (compressed to 96.5 percent)
[20:07:03]   ... Done.
[20:07:04] - Shutting down core 
[20:07:04] 
[20:07:04] Folding@home Core Shutdown: FINISHED_UNIT
[20:07:08] CoreStatus = 64 (100)
[20:07:08] Sending work to server
[20:07:08] Project: 4742 (Run 3, Clone 67, Gen 10)
[20:07:08] - Read packet limit of 540015616... Set to 524286976.


[20:07:08] + Attempting to send results [September 14 20:07:08 UTC]
[20:07:16] + Results successfully sent
[20:07:16] Thank you for your contribution to Folding@Home.
[20:07:16] + Number of Units Completed: 111

[20:07:21] - Preparing to get new work unit...
[20:07:21] + Attempting to get work packet
[20:07:21] - Connecting to assignment server
[20:07:22] - Successful: assigned to (171.64.65.103).
[20:07:22] + News From Folding@Home: GPU folding beta
[20:07:22] Loaded queue successfully.
[20:07:25] + Closed connections
[20:07:25] 
[20:07:25] + Processing work unit
[20:07:25] Core required: FahCore_11.exe
[20:07:25] Core found.
[20:07:25] Working on queue slot 01 [September 14 20:07:25 UTC]
[20:07:25] + Working ...
[20:07:25] 
[20:07:25] *------------------------------*
[20:07:25] Folding@Home GPU Core - Beta
[20:07:25] Version 1.10 (Tue Aug 12 10:03:11 PDT 2008)
[20:07:25] 
[20:07:25] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86 
[20:07:25] Build host: amoeba
[20:07:25] Board Type: AMD
[20:07:25] Core      : 
[20:07:25] Preparing to commence simulation
[20:07:25] - Looking at optimizations...
[20:07:25] - Created dyn
[20:07:25] - Files status OK
[20:07:25] - Expanded 88131 -> 447304 (decompressed 507.5 percent)
[20:07:25] Called DecompressByteArray: compressed_data_size=88131 data_size=447304, decompressed_data_size=447304 diff=0
[20:07:25] - Digital signature verified
[20:07:25] 
[20:07:25] Project: 4747 (Run 1, Clone 5, Gen 26)
[20:07:25] 
[20:07:25] Assembly optimizations on if available.
[20:07:25] Entering M.D.
[20:07:32] Working on p4747_lam5w_300K
[20:07:32] Client config found, loading data.
[20:07:32] Starting GUI Server

Re: Suggestion

Posted: Sun Sep 14, 2008 11:41 pm
by fridgepants
Hmmm. Maybe the duration of the time I usually see is unique to me which might make it somewhat of a moot point. As I said before, my suggestion is to simply overlap the sending of results with the utilization of the GPU as it seems to be a more efficient use of resources. If other people aren't seeing noticeable pause maybe it's not a big deal.

Re: Suggestion

Posted: Mon Sep 15, 2008 5:34 am
by db597
I have the same problem too. My GPU is usually idle between 2-15mins inbetween WUs. Though some times it can be idle for much longer if it is unsuccessful at uploading the WU. This is already a big improvement since 6.20r1 came out - prior to that I used to have idle periods of a few hours!

The best solution is to have a WU buffer. If PG is worried, they can monitor and provide the buffer to clients that are found to complete units in <80% of the time to the deadline.

Re: Suggestion

Posted: Mon Sep 15, 2008 6:14 am
by 7im
Thank you for the suggestion.

FYI, there has always been lag time between work units, longer times on some client types than for others. Yes, figuring out a way to lessen the lag time would improve the performance of the project. However, this has been a perpetual suggestion over the many years of the project. That tells me there are reasons for Stanford to not have changed this behavior yet. I do know that caching multiple work units to reduce the lag does not improve the throughput of work units due to the serial nature of work units. I don't know all the other reasons, and I'm not here to debate this topic again, just to say that I know Pande Group is aware of the behavior, and they may change it as needed to best improve the project. Sorry if this seems like a brush off, it's not intended that way. Just thought it would help you to know a little background info.

Re: Suggestion

Posted: Mon Sep 15, 2008 7:47 am
by MstrBlstr
The main issue, is the way that the AS logic is setup. The AS expects you to return the last unit that it gave your system. When this does not occur, the AS has no way of knowing what happened to the unit that it sent you, so it will resend you the same unit again for processing.

So you see, a unit has to do one of the following:
  • Upload successfully
  • EUE
  • Get placed in your queue
BEFORE the client can tell the AS what is going on, so it knows how to proceed. If it tries to get a new unit before one of the three things occurs, the AS will assume that something went seriously wrong with the unit, and will reassign you the same (PRCG) unit again.

So you can see why parallel upload/download or download/upload of work units is not practical for FAH.

Re: Suggestion

Posted: Mon Sep 15, 2008 8:36 am
by osgorth
What could be done is that when the client is finished with a unit, it would send a message to the server telling it has been completed successfully.
Then, the client could pull a new unit right away, while spawning a background process that uploads the finished unit - simultaneously. When the upload is completed, credit is given and the client just sits back and waits for the next finished unit.

This would certainly help for people that have slower internet connections, as some of the results are pretty large and will take some time to upload.
I think the worst that could happen is that results are delayed a couple of hours. The benefit being that if there <re upload problems, at least work is being done at the same time.
And, if the client would tell the servers that the unit has been successfully completed, it shouldn't cause any problems science-wise either - merely delaying a couple of hours at most.

What do you think?

Re: Suggestion

Posted: Mon Sep 15, 2008 9:13 am
by MstrBlstr
So what you are saying, is for the client to first queue the finished work unit for upload, request a new unit for upload (telling the AS that the prior unit is awaiting upload in the queue), download a new unit, then upload the queued unit after the new unit starts processing.

I suppose that it is possible, but I am not sure that PG has an easy way to implement it.

As that stands, it would try to upload the queued unit at the six hour mark. So, that would be a six hour delay. Not that I think that is an issue, or they would have made it try at a more frequent interval. I am not sure if they have a way to make make the client attempt an auto send at the start of a unit or not.

I will see if I can find out.

Re: Suggestion

Posted: Mon Sep 15, 2008 9:36 am
by P5-133XL
Is there a reason people can't place multiple clients with different machineID's all running on the same gpu using the -gpu 0 flag? That way, when one is done, the others will still be processing ...

Re: Suggestion

Posted: Mon Sep 15, 2008 9:46 am
by MstrBlstr
Yes, but I will not go into the details. Other than to say that the GPU client is a strange beast. And that is is called a "high performance client" for a reason.

The method that you suggest is not even recommended for the CPU client.

Re: Suggestion

Posted: Mon Sep 15, 2008 10:07 am
by osgorth
MstrBlstr wrote:So what you are saying, is for the client to first queue the finished work unit for upload, request a new unit for upload (telling the AS that the prior unit is awaiting upload in the queue), download a new unit, then upload the queued unit after the new unit starts processing.
Yes, exactly. :)

That should be fairly simple to do, hopefully they'll see the benefits.

One very simple way to do it would be to have a separate thread that monitors the queue and uploads anything that has been completed and then wait for an hour, check again and so on.

But I think the safest way is if the client flags to the server when the unit has been completed, and then simply launch the upload in a separate thread while continuing on with the work.
That way the server will know that the current unit was completed successfully - and the client can safely get the next unit to work on.
What's required is the messaging system and perhaps some validation on the client side, that the unit was indeed completed. The rest of the code is there already, it's a just matter of using it slightly differently.

Re: Suggestion

Posted: Mon Sep 15, 2008 2:25 pm
by 7im
Yes, but how much does delaying the return of a completed work unit to download a new WU hurt the performance of the project, and does the few minutes of idle time you avoid more than overcome that delay?

With such a large focus on getting WUs completed and returned to Stanford faster and faster from the likes of SMP and GPU clients, I would offer a guess that because of the serial nature of work units, getting the WU back sooner is more helpful to the project than reducing the slack time between WUs.

Again, I'm not here to debate, just say that in almost 7 years of the project, if ending this slack and been significantly helpful, Stanford would have changed it.

Re: Suggestion

Posted: Mon Sep 15, 2008 6:29 pm
by codysluder
I support this idea but do understand the limitations.
MstrBlstr wrote:So what you are saying, is for the client to first queue the finished work unit for upload, request a new unit for upload (telling the AS that the prior unit is awaiting upload in the queue), download a new unit, then upload the queued unit after the new unit starts processing.
osgorth wrote:One very simple way to do it would be to have a separate thread that monitors the queue and uploads anything that has been completed and then wait for an hour, check again and so on.
That already exists, except it's set for six hours (probably becuase if a server is down or busy, it's not likely to be fixed that soon.
But I think the safest way is if the client flags to the server when the unit has been completed, and then simply launch the upload in a separate thread while continuing on with the work.
I think it's very simple: When a WU finishes, download a new WU first, then attempt to upload. Once a new WU has been downloaded, the client goes to work and the upload can be processed concurrently. For those with dial-up connections, this would normally still be completed within a single connection. The only real requirement has already been mentioned: The download has to know if the previous WU is ready to upload.

Re: Suggestion

Posted: Mon Sep 15, 2008 7:58 pm
by Ren02
The upload and download can happen at the same time. This has already been implemented when you start the client and there are results to upload but new WU has not been downloaded yet. In such a case the client downloads a new WU and uploads the results simultaneously. Usually it takes much less time to download a new WU, so the client is already happily processing while the upload still continues in the background. No idea why Stanford went half-way on this and doesn't use this approach all the time.

Re: Suggestion

Posted: Mon Sep 15, 2008 8:53 pm
by codysluder
Ren02 wrote:Usually it takes much less time to download a new WU, so the client is already happily processing while the upload still continues in the background. No idea why Stanford went half-way on this and doesn't use this approach all the time.
My theory: They wanted to try the upload more than once if it failed. By sequentially uploading, then downloading, then re-uploading there's enough of a pause that the server overload might have changed enough for the second upload to go through.

Also, for those poor souls who are still on dial-up, simultaneous upload and download go really, really slow compared to doing one at a time. (This could be enabled/disabled based on the client's knowledge of recent upload/download speeds but now we're talking increased complexity.)