Page 1 of 2

Project 7808 (0, 406, 9) - ETA 21 DAYS!, TPF 5hrs

Posted: Mon Feb 11, 2013 3:35 pm
by ptschnack
Is this project 7808 WU configured correctly or did an extra 0 get added to the total number of steps which according to the log is 1500000? Or has my SMP performance suddenly gone south?

FAH 7.2.9
Project 7808 (0, 406, 9)
FahCore 0xa4
ETA 21 days !!
TPF 5 hours 13 minutes !!
Estimated PPD 94.11
Estimated Credit 2045.84

After running 7 days, the log entries for the start of this 7808 WU have been scrolled away by GPU WU log entries, so the startup log entries are unavailable.

These have me wondering if this WU was set up correctly:

1) The TPF at 5 hours 13 mins means a lot of work is thrown away if the WU needs to be paused.
2) ETA of 3 weeks is far beyond the longest ETA I've seen for any WU. The longest ETA for a WU I recall has been ~12(?) hours. Also, I could swear I've completed other 7808 WU's, but none that took days, much less weeks.
3) The very low PPD and credit awarded for the WU for 3 weeks of CPU is very different from all other WU's I've seen; based on total time to execute, these are below the usual by a factor of ~100.
4) I've been folding on this machine and configuration for 3 months w/o problems and with steady throughput, so the sudden occurrence of a very, very long WU is a surprise.

All that said, the WU seems to be progressing w/o problems, so I'll let it run to completion. But if it was misconfigured or there's a bug killing performance, you might want to check if other 7808's have the same problem.

My machine:

Win7 Pro
Intel i7-3770 quad core at 3.4 GHz (hyper-threading results in FahCore 0xa4 using smp:8 (8 CPUs))
32GB RAM
GTX 660 TI still running FahCore 0x15 GPU work units at its typical throughput of ~24000 PPD.

======================================== EDITED ================================
Okay, found log.txt.

Here's the start of 7808 (Run 0, Clone 406, Gen 9) -- preceded by end of 7808 (7, 253, 127) and intermixed with 1) the upload of 7808 (7, 253, 127) and 2) progress logging for a GPU WU. I've removed ~60 lines of GPU WU progress logging. I've removed NO lines for the 7808 unit.

Code: Select all

04:15:37:WU00:FS01:0xa4:Folding@home Core Shutdown: FINISHED_UNIT
04:15:37:WU02:FS00:0x15:Completed  14800000 out of 40000000 steps (37%).
04:15:38:WU00:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
04:15:38:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:7808 run:7 clone:253 gen:127 core:0xa4 unit:0x000000be0a3b1e874e30ff1d1b45c50b
04:15:38:WU00:FS01:Uploading 4.05MiB to 171.64.65.99
04:15:38:WU00:FS01:Connecting to 171.64.65.99:8080
04:15:38:WU01:FS01:Starting
04:15:38:WU01:FS01:Running FahCore: "F:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" f:/ProgramData/FAHClient/cores/www.stanford.edu/~pande/Win32/AMD64/Core_a4.fah/FahCore_a4.exe -dir 01 -suffix 01 -version 702 -lifeline 9808 -checkpoint 15 -np 8
04:15:38:WU01:FS01:Started FahCore on PID 4976
04:15:38:Started thread 153 on PID 9808
04:15:38:WU01:FS01:Core PID:4508
04:15:38:WU01:FS01:FahCore 0xa4 started
04:15:39:WU01:FS01:0xa4:
04:15:39:WU01:FS01:0xa4:*------------------------------*
04:15:39:WU01:FS01:0xa4:Folding@Home Gromacs GB Core
04:15:39:WU01:FS01:0xa4:Version 2.27 (Dec. 15, 2010)
04:15:39:WU01:FS01:0xa4:
04:15:39:WU01:FS01:0xa4:Preparing to commence simulation
04:15:39:WU01:FS01:0xa4:- Looking at optimizations...
04:15:39:WU01:FS01:0xa4:- Created dyn
04:15:39:WU01:FS01:0xa4:- Files status OK
04:15:39:WU01:FS01:0xa4:- Expanded 702264 -> 5365960 (decompressed 764.0 percent)
04:15:39:WU01:FS01:0xa4:Called DecompressByteArray: compressed_data_size=702264 data_size=5365960, decompressed_data_size=5365960 diff=0
04:15:39:WU01:FS01:0xa4:- Digital signature verified
04:15:39:WU01:FS01:0xa4:
04:15:39:WU01:FS01:0xa4:Project: 7808 (Run 0, Clone 406, Gen 9)
04:15:39:WU01:FS01:0xa4:
04:15:39:WU01:FS01:0xa4:Assembly optimizations on if available.
04:15:39:WU01:FS01:0xa4:Entering M.D.
04:15:44:WU00:FS01:Upload 30.88%
04:15:45:WU01:FS01:0xa4:Mapping NT from 8 to 8 
04:15:49:WU01:FS01:0xa4:Completed 0 out of 1500000 steps  (0%)
04:15:50:WU00:FS01:Upload 63.30%
04:15:56:WU00:FS01:Upload 91.10%
04:15:58:WU00:FS01:Upload complete
04:15:58:WU00:FS01:Server responded WORK_ACK (400)
04:15:58:WU00:FS01:Final credit estimate, 12708.00 points
04:15:58:WU00:FS01:Cleaning up
04:21:23:WU02:FS00:0x15:Completed  15200000 out of 40000000 steps (38%).
04:27:13:WU02:FS00:0x15:Completed  15600000 out of 40000000 steps (39%).
******************************** Date: 04/02/13 ********************************
04:33:00:WU02:FS00:0x15:Completed  16000000 out of 40000000 steps (40%).
04:38:47:WU02:FS00:0x15:Completed  16400000 out of 40000000 steps (41%).

+++++ ~60 lines GPU logging removed +++++

09:19:08:WU01:FS01:0xa4:Completed 15000 out of 1500000 steps  (1%)
Here's sysinfo and config:

Code: Select all

02:50:55:************************* Folding@home Client *************************
02:50:55:      Website: http://folding.stanford.edu/
02:50:55:    Copyright: (c) 2009-2012 Stanford University
02:50:55:       Author: Joseph Coffland <[email protected]>
02:50:55:         Args: --lifeline 10796 --command-port=36330
02:50:55:       Config: f:/ProgramData/FAHClient/config.xml
02:50:55:******************************** Build ********************************
02:50:55:      Version: 7.2.9
02:50:55:         Date: Oct 3 2012
02:50:55:         Time: 18:05:48
02:50:55:      SVN Rev: 3578
02:50:55:       Branch: fah/trunk/client
02:50:55:     Compiler: Intel(R) C++ MSVC 1500 mode 1200
02:50:55:      Options: /TP /nologo /EHa /Qdiag-disable:4297,4103,1786,279 /Ox -arch:SSE
02:50:55:               /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qopenmp /Qrestrict /MT /Qmkl
02:50:55:     Platform: win32 XP
02:50:55:         Bits: 32
02:50:55:         Mode: Release
02:50:55:******************************* System ********************************
02:50:55:          CPU: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
02:50:55:       CPU ID: GenuineIntel Family 6 Model 58 Stepping 9
02:50:55:         CPUs: 8
02:50:55:       Memory: 31.98GiB
02:50:55:  Free Memory: 28.49GiB
02:50:55:      Threads: WINDOWS_THREADS
02:50:55:   On Battery: false
02:50:55:   UTC offset: -8
02:50:55:          PID: 9808
02:50:55:          CWD: f:/ProgramData/FAHClient
02:50:55:           OS: Windows 7 Professional
02:50:55:      OS Arch: AMD64
02:50:55:         GPUs: 1
02:50:55:        GPU 0: NVIDIA:3 GK104 [GeForce GTX 660 Ti]
02:50:55:         CUDA: 3.0
02:50:55:  CUDA Driver: 5000
02:50:55:Win32 Service: false
02:50:55:***********************************************************************
02:50:55:<config>
02:50:55:  <service-description v='Folding@home Client'/>
02:50:55:  <service-restart v='true'/>
02:50:55:  <service-restart-delay v='5000'/>
02:50:55:
02:50:55:  <!-- Client Control -->
02:50:55:  <cycle-rate v='4'/>
02:50:55:  <cycles v='-1'/>
02:50:55:  <data-directory v='.'/>
02:50:55:  <disable-project-lookup v='false'/>
02:50:55:  <exec-directory v='F:\Program Files (x86)\FAHClient'/>
02:50:55:  <exit-when-done v='false'/>
02:50:55:  <threads v='4'/>
02:50:55:
02:50:55:  <!-- Configuration -->
02:50:55:  <config-rotate v='true'/>
02:50:55:  <config-rotate-dir v='configs'/>
02:50:55:  <config-rotate-max v='16'/>
02:50:55:
02:50:55:  <!-- Debugging -->
02:50:55:  <assignment-servers>
02:50:55:    assign3.stanford.edu:8080 assign4.stanford.edu:80
02:50:55:  </assignment-servers>
02:50:55:  <capture-directory v='capture'/>
02:50:55:  <capture-sockets v='false'/>
02:50:55:  <debug-sockets v='false'/>
02:50:55:  <exception-locations v='true'/>
02:50:55:  <gpu-assignment-servers>
02:50:55:    assign-GPU.stanford.edu:80 assign-GPU.stanford.edu:8080
02:50:55:  </gpu-assignment-servers>
02:50:55:  <stack-traces v='false'/>
02:50:55:
02:50:55:  <!-- Error Handling -->
02:50:55:  <max-slot-errors v='5'/>
02:50:55:  <max-unit-errors v='5'/>
02:50:55:
02:50:55:  <!-- FahCore Control -->
02:50:55:  <checkpoint v='15'/>
02:50:55:  <core-dir v='cores'/>
02:50:55:  <core-priority v='idle'/>
02:50:55:  <cpu-affinity v='false'/>
02:50:55:  <cpu-usage v='100'/>
02:50:55:  <no-assembly v='false'/>
02:50:55:
02:50:55:  <!-- Folding Slot Configuration -->
02:50:55:  <cause-pref v='ANY'/>
02:50:55:  <client-subtype v='STDCLI'/>
02:50:55:  <client-type v='normal'/>
02:50:55:  <cpu-species v='X86_PENTIUM_II'/>
02:50:55:  <cpu-type v='AMD64'/>
02:50:55:  <cpus v='-1'/>
02:50:55:  <cuda-index v='0'/>
02:50:55:  <gpu v='true'/>
02:50:55:  <gpu-usage v='100'/>
02:50:55:  <max-packet-size v='normal'/>
02:50:55:  <opencl-index v='0'/>
02:50:55:  <os-species v='UNKNOWN'/>
02:50:55:  <os-type v='WIN32'/>
02:50:55:  <project-key v='0'/>
02:50:55:  <smp v='true'/>
02:50:55:
02:50:55:  <!-- Logging -->
02:50:55:  <log v='log.txt'/>
02:50:55:  <log-color v='false'/>
02:50:55:  <log-crlf v='true'/>
02:50:55:  <log-date v='false'/>
02:50:55:  <log-date-periodically v='21600'/>
02:50:55:  <log-debug v='true'/>
02:50:55:  <log-domain v='false'/>
02:50:55:  <log-header v='true'/>
02:50:55:  <log-level v='true'/>
02:50:55:  <log-no-info-header v='true'/>
02:50:55:  <log-redirect v='false'/>
02:50:55:  <log-rotate v='true'/>
02:50:55:  <log-rotate-dir v='logs'/>
02:50:55:  <log-rotate-max v='16'/>
02:50:55:  <log-short-level v='false'/>
02:50:55:  <log-simple-domains v='true'/>
02:50:55:  <log-thread-id v='false'/>
02:50:55:  <log-thread-prefix v='true'/>
02:50:55:  <log-time v='true'/>
02:50:55:  <log-to-screen v='true'/>
02:50:55:  <log-truncate v='false'/>
02:50:55:  <verbosity v='5'/>
02:50:55:
02:50:55:  <!-- Network -->
02:50:55:  <proxy v=':8080'/>
02:50:55:  <proxy-enable v='false'/>
02:50:55:  <proxy-pass v=''/>
02:50:55:  <proxy-user v=''/>
02:50:55:
02:50:55:  <!-- Process Control -->
02:50:55:  <child v='false'/>
02:50:55:  <daemon v='false'/>
02:50:55:  <pid v='false'/>
02:50:55:  <pid-file v='Folding@home Client.pid'/>
02:50:55:  <respawn v='false'/>
02:50:55:  <service v='false'/>
02:50:55:
02:50:55:  <!-- Remote Command Server -->
02:50:55:  <command-address v='0.0.0.0'/>
02:50:55:  <command-allow v='127.0.0.1 10.0.0.0/24'/>
02:50:55:  <command-allow-no-pass v='127.0.0.1 10.0.0.0/24'/>
02:50:55:  <command-deny v='0.0.0.0/0'/>
02:50:55:  <command-deny-no-pass v='0.0.0.0/0'/>
02:50:55:  <command-port v='36330'/>
02:50:55:
02:50:55:  <!-- Slot Control -->
02:50:55:  <max-shutdown-wait v='60'/>
02:50:55:  <pause-on-battery v='false'/>
02:50:55:  <pause-on-start v='false'/>
02:50:55:
02:50:55:  <!-- User Information -->
02:50:55:  <machine-id v='0'/>
02:50:55:  <passkey v=''/>
02:50:55:  <team v='0'/>
02:50:55:  <user v='ptschnack'/>
02:50:55:
02:50:55:  <!-- Work Unit Control -->
02:50:55:  <dump-after-deadline v='true'/>
02:50:55:  <max-queue v='16'/>
02:50:55:  <max-units v='0'/>
02:50:55:  <next-unit-percentage v='99'/>
02:50:55:
02:50:55:  <!-- Folding Slots -->
02:50:55:  <slot id='0' type='GPU'/>
02:50:55:  <slot id='1' type='SMP'/>
02:50:55:</config>

Re: Project 7808 (0, 406, 9) - ETA 21 DAYS!, TPF 5hrs

Posted: Mon Feb 11, 2013 4:34 pm
by bollix47
Hi ptschnack and welcome to the folding support forum.

The correct number of steps for P7808 is indeed 1.5 million.

Please post your System Info and Config sections of your log as explained here and 40-50 lines from the start of the actual work unit in progress.

Re: Project 7808 (0, 406, 9) - ETA 21 DAYS!, TPF 5hrs

Posted: Tue Feb 12, 2013 12:06 am
by ptschnack
I've added the requested information to the original description.

Re: Project 7808 (0, 406, 9) - ETA 21 DAYS!, TPF 5hrs

Posted: Tue Feb 12, 2013 12:56 am
by bruce
While most of what you say makes sense, I believe that you misunderstand checkpointing.
ptschnack wrote:1) The TPF at 5 hours 13 mins means a lot of work is thrown away if the WU needs to be paused.
When you pause a WU, it will restart from the most recent checkpoint. The default checkpoint setting for FahCore_a4 is every 15 minutes. That means when you pause and later restart, you may have to recalculate between 0 and 15 minutes worth of work. The total processing time or the TPF of a specific project has nothing to do with how much work must be repeated.

Re: Project 7808 (0, 406, 9) - ETA 21 DAYS!, TPF 5hrs

Posted: Tue Feb 12, 2013 1:12 am
by bollix47
Open Task manager and check the CPU usage column for FAHCore_a4.exe and FAHCore_15.exe. There should be only one instance of each. What percentages do you see? Check if anything else is using your CPU.

Have you tried pausing both slots and doing a re-boot?

Re: Project 7808 (0, 406, 9) - ETA 21 DAYS!, TPF 5hrs

Posted: Tue Feb 12, 2013 2:31 am
by ptschnack
'Before' and 'after' refer to before and after reboot. All parameters were the same before and after reboot except where noted:

Taskman Processes:

One each of:

FAHClient.exe CPU 0% (156K before, 14K after -- memory private working set)
FAHControl.exe CPU 0% (32K before, 14K after)
FahCore_15.exe CPU 0% (34K)
FahCore_14.exe CPU 99% (129K)

Two of:

FAHCoreWrapper.exe CPU 0% (1.1K each)

Taskman Performance:
100% on all eight 'cores'
Total memory consumption is 5.61GB before and 3GB after.

MSI Afterburner Utility:

GPU utilization 99%

Re: Project 7808 (0, 406, 9) - ETA 21 DAYS!, TPF 5hrs

Posted: Tue Feb 12, 2013 2:39 am
by ptschnack
BTW, this is my dev box which is running Visual Studio 2010 most of the time as an editor, but sometimes I recompile a semi-large project. Other work I do on other machines to which I Remote Desktop from the dev box.

My usage patterns for the dev box have not changed in the last few months, so I doubt very much that they are relevant to this problem.

Re: Project 7808 (0, 406, 9) - ETA 21 DAYS!, TPF 5hrs

Posted: Tue Feb 12, 2013 3:56 am
by mmonnin
ptschnack wrote:'Before' and 'after' refer to before and after reboot. All parameters were the same before and after reboot except where noted:

Taskman Processes:

One each of:

FAHClient.exe CPU 0% (156K before, 14K after -- memory private working set)
FAHControl.exe CPU 0% (32K before, 14K after)
FahCore_15.exe CPU 0% (34K)
FahCore_14.exe CPU 99% (129K)

Two of:

FAHCoreWrapper.exe CPU 0% (1.1K each)

Taskman Performance:
100% on all eight 'cores'
Total memory consumption is 5.61GB before and 3GB after.

MSI Afterburner Utility:

GPU utilization 99%
Did you mean fahcore_a4.exe by chance? If so then that would look ok as the SMP is taking all CPU cycles and I'm guessing fahcore_15 takes 1% once in awhile.

Re: Project 7808 (0, 406, 9) - ETA 21 DAYS!, TPF 5hrs

Posted: Tue Feb 12, 2013 7:52 am
by ptschnack
There's nothing so simple that I can't insert at least one typo. Yes, that's supposed to be FahCore_a4.exe.

I don't know if FahCore_15 ever takes enough CPU to register even 1% of CPU. What I do know is it gets enough CPU since GPU WU's run just fine.

For jollies, I just checked and most of the time, the a4 core has 99% of the CPU and the other 1% is shared by all other processes all of which report 0%. Every 10-15 seconds the "NT Kernel & System" uses 1% of CPU and the a4 core drops to 98% while all other processes remain at 0%.

BTW, the CPU and memory usage pattern I see now is normal for when the machine is folding. So I'm pretty convinced that the folding project itself is wonky -- unless this case is anomalous and really needs 3 weeks of SMP:8 CPU.

Re: Project 7808 (0, 406, 9) - ETA 21 DAYS!, TPF 5hrs

Posted: Tue Feb 12, 2013 9:12 am
by ptschnack
I don't know why I didn't try this earlier, but when I start the viewer for the 7808 project, nothing but the background image of the earth at night is displayed. It's a nice background and all, but I was expecting to see some molecular action. Instead got nadda.

When I view the GPU project that's running, it displays a jiggling molecule as expected.

Re: Project 7808 (0, 406, 9) - ETA 21 DAYS!, TPF 5hrs

Posted: Tue Feb 12, 2013 9:18 am
by mmonnin
You'll probably be able to see the memory usage drop after rebooting a Windows machine w/o running FAH at all. Its just...windows after all.

Re: Project 7808 (0, 406, 9) - ETA 21 DAYS!, TPF 5hrs

Posted: Tue Feb 12, 2013 10:50 am
by bollix47
@ptschnack

What's your current TPF for the 7808 slot? Did anything change after your reboot?

As far as the viewer is concerned it does take a while to load. How long did you wait? I just opened the same project in the Viewer and it took a while, maybe 20-30 seconds, before the molecules showed up. While it was sitting there with the world view there was a line (bottom left) that said Loading.

Re: Project 7808 (0, 406, 9) - ETA 21 DAYS!, TPF 5hrs

Posted: Tue Feb 12, 2013 11:45 am
by ptschnack
No significant change on reboot. TPF is now 5 hours 4 mins instead of 5 hours 13 mins, the 9 minute gain probably just reflects that all other apps are closed and folding is the only app running.

I waited a a minute or two for the viewer to show something. No joy. Viewing the GPU work unit, it takes at most ~15s for the molecule to be displayed.

If you've started to run this project, what ETA and TPF did you get? Anything similar to my numbers (taking into account hardware differences)? If not and your ETA was in the hours range and TPF less than an hour, then the project is hosed on my machine and I should just kill it.

Re: Project 7808 (0, 406, 9) - ETA 21 DAYS!, TPF 5hrs

Posted: Tue Feb 12, 2013 12:35 pm
by bollix47
I've run many work units from this project but don't recall seeing any with such large TPFs. The current one:

Project: 7808 (Run 6, Clone 64, Gen 153) is showing TPFs of under 7 minutes.

I've asked the PG contact to have a look at this thread but they are very busy so I have no idea when or even if they would have time to look into why this work unit is taking so long. If you want to give him some time to respond he may have some suggestions but I understand if you don't want to wait any longer than you already have.

We don't usually recommend dumping because until it can be determined that the work unit is bad and is marked as such it will just be sent to another folder and become their problem.

If you do decide to dump the work unit please use the following procedure:

Pause and quit folding. ( Since I've never done this I'm not sure how necessary this is but it shouldn't hurt )
Open a cmd window and type:

FAHClient --dump 01

Using this method should inform the server that the work unit has been dumped and avoid the long timeout.

Re: Project 7808 (0, 406, 9) - ETA 21 DAYS!, TPF 5hrs

Posted: Tue Feb 12, 2013 12:40 pm
by ptschnack
I'll continue with the WU since this might be doing useful work. I just have doubts that it is.