Lost Time

Moderators: Site Moderators, FAHC Science Team

noorman
Posts: 270
Joined: Sun Dec 02, 2007 2:26 pm
Hardware configuration: Folders: Intel C2D E6550 @ 3.150 GHz + GPU XFX 9800GTX+ @ 765 MHZ w. WinXP-GPU
AMD A2X64 3800+ @ stock + GPU XFX 9800GTX+ @ 775 MHZ w. WinXP-GPU
Main rig: an old Athlon Barton 2500+ @2.25 GHz & 2* 512 MB RAM Apacer, Radeon 9800Pro, WinXP SP3+
Location: Belgium, near the International Sea-Port of Antwerp

Re: Lost Time

Post by noorman »

bruce wrote:
noorman wrote:There 's a bug somewhere; they just don't deem it important enough to fix it.
Oh, come now. It's not a question of importance, it's a question of reproducibility. Every time they test their fix, it's going to work correctly but then when it gets out in the field, it's going to fail 1% of the time (or however often it fails now.) They can't fix a bug that doesn't happen when they test it.

If you can demonstrate a reproducible method to make this happen, they'd be glad to fix it -- and quickly, I suppose.
.


The immediate upload and almost simultaneous download happened every time I repaired the faulty queue.dat (twice Qfix and a -delete x inbetween) ...

Is that reproducible enough ?
# SMP Client ##################################################################
###############################################################################

Folding@Home Client Version 6.02beta

http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/noorman/Folding@Home
Executable: ./fah6
Arguments: -smp -delete 02

[09:37:02] - Ask before connecting: No
[09:37:02] - User name: noorman (Team 734)
[09:37:02] - User ID: 48B83D25538777D9
[09:37:02] - Machine ID: 1
[09:37:02]
[09:37:03] Loaded queue successfully.
[09:37:03] Deleting work unit #2 from work queue...
[09:41:24] - Failed to delete the requested work unit

Folding@Home Client Shutdown.


--- Opening Log file [June 24 09:42:18]


# SMP Client ##################################################################
###############################################################################

Folding@Home Client Version 6.02beta

http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/noorman/Folding@Home
Executable: ./fah6
Arguments: -smp -verbosity 9

[09:42:18] - Ask before connecting: No
[09:42:18] - User name: noorman (Team 734)
[09:42:18] - User ID: 48B83D25538777D9
[09:42:18] - Machine ID: 1
[09:42:18]
[09:42:18] Loaded queue successfully.
[09:42:18] - Autosending finished units...
[09:42:18] Trying to send all finished work units


[09:42:18] + Attempting to send results
[09:42:18] - Reading file work/wuresults_02.dat from core
[09:42:18] (Read 5530530 bytes from disk)
[09:42:18] Connecting to http://171.64.65.56:8080/
[09:42:18] - Preparing to get new work unit...
[09:42:18] + Attempting to get work packet
[09:42:18] - Will indicate memory of 2014 MB
[09:42:18] - Detect CPU. Vendor: AuthenticAMD, Family: 15, Model: 3, Stepping: 2
[09:42:18] - Connecting to assignment server
[09:42:18] Connecting to http://assign.stanford.edu:8080/
[09:42:19] Posted data.
[09:42:19] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[09:42:19] + News From Folding@Home: Welcome to Folding@Home
[09:42:19] Loaded queue successfully.
[09:42:19] Connecting to http://171.64.65.56:8080/
[09:42:23] Posted data.
[09:42:23] Initial: 0000; - Receiving payload (expected size: 2444530)
[09:42:32] - Downloaded at ~265 kB/s
[09:42:32] - Averaged speed for that direction ~485 kB/s
[09:42:32] + Received work.
[09:42:32] + Closed connections
[09:42:32]
[09:42:32] + Processing work unit
[09:42:32] Core required: FahCore_a1.exe
[09:42:32] Core found.
[09:42:32] Working on Unit 03 [June 24 09:42:32]
[09:42:32] + Working ...
[09:42:32] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 03 -checkpoint 3 -verbose -lifeline 14561 -version 602'

[09:42:32]
[09:42:32] *------------------------------*
[09:42:32] Folding@Home Gromacs SMP Core
[09:42:32] Version 1.74 (November 27, 2006)
[09:42:32]
[09:42:32] Preparing to commence simulation
[09:42:32] - Ensuring status. Please wait.
[09:42:49] - Looking at optimizations...
[09:42:49] - Working with standard loops on this execution.
[09:42:49] - Previous termination of core was improper.
[09:42:49] - Going to use standard loops.
[09:42:49] - Files status OK
[09:42:50] - Expanded 2444018 -> 1290766- Starting from initial work packet
[09:42:50]
[09:42:50] Project: 2605 (Run 12, Clone 127, Gen 65)
[09:42:50]
[09:42:50] Entering M.D.
[09:42:50] ne 127, Gen 65)
[09:42:50]
[09:42:50] Entering M.D.
[09:42:57] les
[09:42:57] cal files
[09:42:57] in in POPC
[09:42:57] Writing local files
[09:42:57] Extra SSE boost OK.
[09:42:58] 0000 steps (0 percent)
[09:43:51] Posted data.
[09:43:51] Initial: 0000; - Uploaded at ~57 kB/s
[09:43:52] - Averaged speed for that direction ~57 kB/s
[09:43:52] + Results successfully sent
[09:43:52] Thank you for your contribution to Folding@Home.
[09:43:52] + Number of Units Completed: 2

[09:43:53] + Sent 1 of 1 completed units to the server
[09:43:53] - Autosend completed
[09:45:59] Timered checkpoint triggered.
[09:48:58] Timered checkpoint triggered.
[09:51:58] Timered checkpoint triggered.
[09:54:58] Timered checkpoint triggered.
[09:57:58] Timered checkpoint triggered.
[10:00:58] Timered checkpoint triggered.
[10:03:59] Timered checkpoint triggered.
[10:06:31] Writing local files
[10:06:31] Completed 5000 out of 500000 steps (1 percent)
I wouldn't have posted it if I hadn't seen it more than once; I 'm a lifelong technician/electronics pro, I know better than to report a single event as a bug.

ALSO: the 4 minute delay whilst running -delete (in the v6 client anyway) is also reproducible; it did it every time too and from the 2nd time on I timed it (happened 4 times in total since the end of April 2008).

It is known that the client has a 4 minute wait after every it finishes (100%) before downloading a new WU ...


.
- stopped Linux SMP w. HT on [email protected] GHz
....................................
Folded since 10-06-04 till 09-2010
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Lost Time

Post by bruce »

noorman wrote:Is that reproducible enough ?
Probably. Next time one of my linux SMP clients gets to the end of a WU I'll try it.

Have you tried leaving out the -smp ? A deleted WU is a deleted WU.
Executable: ./fah6
Arguments: -delete 02
7im
Posts: 10179
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: Lost Time

Post by 7im »

Read the known bugs list. -delete xx probably doesn't even work right to begin with. It's like saying the nail is defective while trying to drive it with a broken hammer.

Hey look, no car analogy. :)
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
noorman
Posts: 270
Joined: Sun Dec 02, 2007 2:26 pm
Hardware configuration: Folders: Intel C2D E6550 @ 3.150 GHz + GPU XFX 9800GTX+ @ 765 MHZ w. WinXP-GPU
AMD A2X64 3800+ @ stock + GPU XFX 9800GTX+ @ 775 MHZ w. WinXP-GPU
Main rig: an old Athlon Barton 2500+ @2.25 GHz & 2* 512 MB RAM Apacer, Radeon 9800Pro, WinXP SP3+
Location: Belgium, near the International Sea-Port of Antwerp

Re: Lost Time

Post by noorman »

bruce wrote:
noorman wrote:Is that reproducible enough ?
Probably. Next time one of my linux SMP clients gets to the end of a WU I'll try it.

Have you tried leaving out the -smp ? A deleted WU is a deleted WU.
Executable: ./fah6
Arguments: -delete 02
Ho there !
It is not a delete of the WU that is done, but a delete of the entry in the queue.dat !

The Wu results are 100% OK and are not touched; it 's the queue.dat file that is corrupt in so far that it has a wrong Status value; it says 1 where it should say 0.
The 1 indicates that the Wu still needs work, whilst it has been finished / Because of this error, it doesn't get sent to Stanford (of course).

I 've just followed the sequence I was pointed to by 7im to fix the problem with the old Tools in a v6 client environment.
( see fixing tools for v6 ? )
- stopped Linux SMP w. HT on [email protected] GHz
....................................
Folded since 10-06-04 till 09-2010
leexgx
Posts: 25
Joined: Mon Dec 03, 2007 8:05 am
Hardware configuration: snip

Re: Lost Time

Post by leexgx »

but the delay is there when it wastes 1-30 mins trying to send an work unit when it could be working on the Next work unit as well as uploading 1-30MB file as it does this when it fails to send an unit, it gives up and then downloads the next work unit (trys to send the unit later on) why cant it just start the next one as its uploading work unit trun around be faster if you think on an big scale, the client can do this as it does when upload server is down for that work unit
Image
noorman
Posts: 270
Joined: Sun Dec 02, 2007 2:26 pm
Hardware configuration: Folders: Intel C2D E6550 @ 3.150 GHz + GPU XFX 9800GTX+ @ 765 MHZ w. WinXP-GPU
AMD A2X64 3800+ @ stock + GPU XFX 9800GTX+ @ 775 MHZ w. WinXP-GPU
Main rig: an old Athlon Barton 2500+ @2.25 GHz & 2* 512 MB RAM Apacer, Radeon 9800Pro, WinXP SP3+
Location: Belgium, near the International Sea-Port of Antwerp

Re: Lost Time

Post by noorman »

.

An hour or so ago, I watched a WU finish (100%).
FaH then tried to upload the results and it could.
Then, normally, the wait cycle starts (4 minutes) ...

This time I just shutdown FaH, then restarted it immediately; I immediately got sent a new WU !

So, it is possible ... (sending my system a new WU right after receiving the results of a previous one)


.
- stopped Linux SMP w. HT on [email protected] GHz
....................................
Folded since 10-06-04 till 09-2010
noorman
Posts: 270
Joined: Sun Dec 02, 2007 2:26 pm
Hardware configuration: Folders: Intel C2D E6550 @ 3.150 GHz + GPU XFX 9800GTX+ @ 765 MHZ w. WinXP-GPU
AMD A2X64 3800+ @ stock + GPU XFX 9800GTX+ @ 775 MHZ w. WinXP-GPU
Main rig: an old Athlon Barton 2500+ @2.25 GHz & 2* 512 MB RAM Apacer, Radeon 9800Pro, WinXP SP3+
Location: Belgium, near the International Sea-Port of Antwerp

Re: Lost Time

Post by noorman »

.


I guess we 'll keep on losing time at 4 mins * 250,000 CPU's per Folded WU, 24/7 ...


4 mins * 250,000 = 16666,67 hrs = 694,44 days (or nearly 2 years) per WU finished by every one of those 250,000 CPU's


That 's a lot of Energy lost !
A lot of Folding lost too :(


i don't know from experience what the GPU(2) WU's do, but if they work in the same way, this gives an exponential rise, since they work a WU that much more quickly :e(


.
- stopped Linux SMP w. HT on [email protected] GHz
....................................
Folded since 10-06-04 till 09-2010
leexgx
Posts: 25
Joined: Mon Dec 03, 2007 8:05 am
Hardware configuration: snip

Re: Lost Time

Post by leexgx »

as long as the last work unit completed properly do not see why this is an problem, just use FINISHED_UNIT / CoreStatus = 64 (100) as an trigger to start the download, as it does it if it fails to send an work unit or after it has sent an work unit, it cant be that hard to do

if any other error happens just let it do the norm way send the errored work unit back and then download to prevent Bad pc/GPU from wasteing time

as it is alot of lost time if you work out the math like above
Image
7im
Posts: 10179
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: Lost Time

Post by 7im »

You need to look at the bigger picture noorman. Stanford has limited resources, and can't improve very small issues when there are much bigger improvements to be made. And as I said before, if Stanford hasn't made this improvement back in the day of dialup speeds, they probably aren't going to make a change now that broadband is here. And with broadband getting faster and faster, the small waste gets smaller and smaller each day. Why waste time trying to improve a very small item when that item is slowing fixing itself with faster upload speeds?

And why waste time improving a 2 year waste when you can build a GPU2 client and gain a 10 year advantage. I'd gladly lose 2 years of folding to gain 10. Remember the bigger picture.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
jrweiss
Posts: 704
Joined: Tue Dec 04, 2007 6:56 am
Hardware configuration: Ryzen 7 5700G, 22.40.46 VGA driver; 32GB G-Skill Trident DDR4-3200; Samsung 860EVO 1TB Boot SSD; VelociRaptor 1TB; MSI GTX 1050ti, 551.23 studio driver; BeQuiet FM 550 PSU; Lian Li PC-9F; Win11Pro-64, F@H 8.3.5.

[Suspended] Ryzen 7 3700X, MSI X570MPG, 32GB G-Skill Trident Z DDR4-3600; Corsair MP600 M.2 PCIe Gen4 Boot, Samsung 840EVO-250 SSDs; VelociRaptor 1TB, Raptor 150; MSI GTX 1050ti, 526.98 driver; Kingwin Stryker 500 PSU; Lian Li PC-K7B. Win10Pro-64, F@H 8.3.5.
Location: @Home
Contact:

Re: Lost Time

Post by jrweiss »

Regarding "the bigger picture" and "very small issues," I have to respectfully disagree.

If we all prioritized issues, and ONLY tackled the "much bigger" ones, we would NEVER fix any of the small ones! in fact, as soon as a "bigger" big issue came up, even the former "bigger" one would fall by the wayside.

When I was Chief Test Director of a Flight Test organization, I often had to find enough minutes in the day to squeeze in all the "high priority" issues. However, I found out early on that taking a relatively few minutes every day to resolve the "small" issues only delayed the "bigger" ones by those [insignificant] few minutes, and made MANY people MUCH happier because all their individual problems -- each of which was "small" relative to the department but "big" to the individual -- were resolved in a timely manner.

Personally, I suspect the "project" of tying the download of a new WU to the "FINISHED UNIT" trigger would be a matter of a few minutes each over a few days (for coding, V&V, and testing) and would delay all of the "bigger" issues by an unnoticeable amount of time. The code would then be ready for all the clients for their individual next releases.

While such an issue may seem insignificant to "the project," I submit that a small effort to help keep the 100s of 1000s of volunteers happy will benefit the project overall in the long run.
Ryzen 7 5700G, 22.40.46 VGA driver; MSI GTX 1050ti, 551.23 studio driver
Ryzen 7 3700X; MSI GTX 1050ti, 551.23 studio driver [Suspended]
7im
Posts: 10179
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: Lost Time

Post by 7im »

jrweiss wrote:...
While such an issue may seem insignificant to "the project," I submit that a small effort to help keep the 100s of 1000s of volunteers happy will benefit the project overall in the long run.
I very much doubt more than a handful of people even notice the delay. Certainly not 100s of 1000s, and they are certainly not unhappy. I submit that Pand Group is already making a greater effort to keep the volunteers happy. Can I assume you have seen additional Pande Group members posting on a daily basis to answer questions?

And there is a big difference between fixing a small bug to correct a small issue where the feature doesn't work at all, as compared to this example where it is a process improvement on something that is already working well.

Since you are a tester, then I know you are familiar with dimissing returns. And you are also familiar with the risk of making an insignificant change that has such a long and wide reaching affect. Do it right, and save minutes. Do it wrong, and potentially cripple 300,000+ clients and lose a lot more than a few minutes. Hmm... decisions, decisions... ;)

And again, if it were so simple, why haven't they done it already?
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
noorman
Posts: 270
Joined: Sun Dec 02, 2007 2:26 pm
Hardware configuration: Folders: Intel C2D E6550 @ 3.150 GHz + GPU XFX 9800GTX+ @ 765 MHZ w. WinXP-GPU
AMD A2X64 3800+ @ stock + GPU XFX 9800GTX+ @ 775 MHZ w. WinXP-GPU
Main rig: an old Athlon Barton 2500+ @2.25 GHz & 2* 512 MB RAM Apacer, Radeon 9800Pro, WinXP SP3+
Location: Belgium, near the International Sea-Port of Antwerp

Re: Lost Time

Post by noorman »

.

I don't agree either;

when the GPU client (you speak about) is running 30 mins WU's, then 4 mins is a VERY significant amount of time, 7.5% if I calculating correctly ...

Even on a SMP WU' which can be finished in far less than 1 day (24 hrs), it is still significant.

I agree that it was insignificant in the days I - and many others - were Folding single core WU's which had expiry periods of months !

But now, with every new & faster project, those 4 mins beget a larger and larger part/percentage of the total processing time of a WU.

So, the comparison of 'old' dial-up with broadband doesn't wash; even now, there are many people - mostly in the US I might add - that have 'only' dial-up as their connection to Internet/WWW.
It 's primarily the City-folk that have the possibility of getting broadband !


I rest my case ...


.
- stopped Linux SMP w. HT on [email protected] GHz
....................................
Folded since 10-06-04 till 09-2010
bollix47
Posts: 2957
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: Lost Time

Post by bollix47 »

when the GPU client (you speak about) is running 30 mins WU's, then 4 mins is a VERY significant amount of time, 7.5% if I calculating correctly ...
FYI:

There is no 4 minute delay on the GPU client.

After the 100% message there is a 1 minute pause, which is the same on the SMP client.
After the "Number of Units Completed" message there is a 4-5 second pause, which is 4 minutes on the SMP.
Image
noorman
Posts: 270
Joined: Sun Dec 02, 2007 2:26 pm
Hardware configuration: Folders: Intel C2D E6550 @ 3.150 GHz + GPU XFX 9800GTX+ @ 765 MHZ w. WinXP-GPU
AMD A2X64 3800+ @ stock + GPU XFX 9800GTX+ @ 775 MHZ w. WinXP-GPU
Main rig: an old Athlon Barton 2500+ @2.25 GHz & 2* 512 MB RAM Apacer, Radeon 9800Pro, WinXP SP3+
Location: Belgium, near the International Sea-Port of Antwerp

Re: Lost Time

Post by noorman »

bollix47 wrote:
when the GPU client (you speak about) is running 30 mins WU's, then 4 mins is a VERY significant amount of time, 7.5% if I calculating correctly ...
FYI:

There is no 4 minute delay on the GPU client.

After the 100% message there is a 1 minute pause, which is the same on the SMP client.
After the "Number of Units Completed" message there is a 4-5 second pause, which is 4 minutes on the SMP.
.


Is that 4-5 sec pause one between that message and the start of the download of a fresh WU ?


.
- stopped Linux SMP w. HT on [email protected] GHz
....................................
Folded since 10-06-04 till 09-2010
bollix47
Posts: 2957
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: Lost Time

Post by bollix47 »

Code: Select all

[07:01:26] Completed 100%
[07:02:26] 
[07:02:26] Finished Work Unit:
[07:02:26] - Reading up to 3550680 from "work/wudata_04.trr": Read 3550680
[07:02:26] trr file hash check passed.
[07:02:26] - Reading up to 1143048 from "work/wudata_04.xtc": Read 1143048
[07:02:26] xtc file hash check passed.
[07:02:26] edr file hash check passed.
[07:02:26] logfile size: 129855
[07:02:26] Leaving Run
[07:02:30] - Writing 4824655 bytes of core data to disk...
[07:02:31] Done: 4824143 -> 4146200 (compressed to 85.9 percent)
[07:02:31]   ... Done.
[07:02:31] - Shutting down core
[07:02:31] 
[07:02:31] Folding@home Core Shutdown: FINISHED_UNIT
[07:02:34] CoreStatus = 64 (100)
[07:02:34] Sending work to server


[07:02:34] + Attempting to send results
[07:03:27] + Results successfully sent
[07:03:27] Thank you for your contribution to Folding@Home.
[07:03:27] + Number of Units Completed: 44

[07:03:32] - Preparing to get new work unit...
[07:03:32] + Attempting to get work packet
[07:03:32] - Connecting to assignment server
[07:03:32] - Successful: assigned to (171.64.65.20).
[07:03:32] + News From Folding@Home: GPU folding beta
[07:03:33] Loaded queue successfully.
[07:03:33] + Closed connections
[07:03:33] 
[07:03:33] + Processing work unit
[07:03:33] Core required: FahCore_11.exe
[07:03:33] Core found.
[07:03:33] Working on queue slot 05 [June 30 07:03:33]
[07:03:33] + Working ...
[07:03:33] 
[07:03:33] *------------------------------*
[07:03:33] Folding@Home GPU Core - Beta
[07:03:33] Version 1.06 (Mon Jun 23 10:53:13 PDT 2008)
[07:03:33] 
[07:03:33] Compiler  : 
[07:03:33] Build host: amoeba 
[07:03:33] Preparing to commence simulation
[07:03:33] - Looking at optimizations...
[07:03:33] - Created dyn
[07:03:33] - Files status OK
[07:03:33] - Expanded 43546 -> 246249 (decompressed 565.4 percent)
[07:03:33] Called DecompressByteArray: compressed_data_size=43546 data_size=246249, decompressed_data_size=246249 diff=0
[07:03:33] - Digital signature verified
[07:03:33] 
[07:03:33] Project: 5004 (Run 0, Clone 13, Gen 23)
[07:03:33] 
[07:03:33] Assembly optimizations on if available.
[07:03:33] Entering M.D.
[07:03:40] Working on p5002_supervillin_e1
[07:03:40] Client config found, loading data.
[07:03:41] Starting GUI Server
[07:05:13] Completed 1%
Image
Locked