Page 1 of 2
Project: 2170 (Run 46, Clone 234, Gen 2) hung
Posted: Tue May 27, 2008 1:40 am
by anko1
Hi, all.
I was running 2170 (Run 46, Clone 234, Gen 2) on my graphical client (windows XP) and it hung at "new time frame estimate working" for a really long time, so I quit the application, using the icon. Well, the whole log disappeared, and didn't go into FAHlog Prev (which is why some of my info is a little vague). Anyway, when I restarted the client, hoping that it would send, it kept trying to do the same work unit, but using standard loops b/c the prior termination was improper [but it wasn't!!! I swear I used the icon!!!
![Wink ;-)](./images/smilies/icon_wink.gif)
]. I've just stopped the client. Since I have a WU results file for that project, I'm planning on trying qfix. Any other suggestions, words of wisdom? If it helps, here's the entry from the que:
CURRENT QUEUE:
00 EMPTY
01 EMPTY
02 EMPTY
03 EMPTY
04 EMPTY
05 *ACTIVE "Folding@Home" (82) 171.65.103.160:8080 May 17 04:27 | July 22 04:27
06 EMPTY
07 EMPTY
08 EMPTY
09 EMPTY
Thanks very much,
Angela
Re: Project: 2170 (Run 46, Clone 234, Gen 2) hung
Posted: Fri May 30, 2008 1:05 am
by 7im
You could add the -forceasm switch to the shortcut that launches the client and avoid the standard loops, even the when the client behaves badly. Not sure what caused the original hang. Probably need to let it run, and see if it happens again, and collect more info about what the computer and what the client is doing when it hangs.
Re: Project: 2170 (Run 46, Clone 234, Gen 2) hung
Posted: Fri May 30, 2008 4:17 am
by anko1
So I should go ahead and rerun it? Should I use qfix to send the original work?
Re: Project: 2170 (Run 46, Clone 234, Gen 2) hung
Posted: Fri May 30, 2008 4:28 am
by anandhanju
If you're game for a little bit of experimentation, you could
a) Take a backup of your FAH directory.
b) Disable your internet connection temporarily on that system and start the client.
c) If it attempts to send the result, good. You can now enable the internet connection and send the result on its way.
d) If it starts working from a checkpoint, you can let it chug along while you renable the connection.
e) If it starts from 0%, you have nothing to lose. Try qfixing it. Restart client and enable connection. Whatever happens now is the only alternative.
Or you could wait until someone less sleepier than me thinks of a simple way to get around this
![Wink :wink:](./images/smilies/icon_wink.gif)
Re: Project: 2170 (Run 46, Clone 234, Gen 2) hung
Posted: Fri May 30, 2008 4:47 am
by anko1
The client doesn't recognize that the unit is there. Any other suggestions? Thanks for taking the time to answer.
--- Opening Log file [May 20 14:01:38]
Code: Select all
# Windows Graphical Edition ###################################################
###############################################################################
Folding@Home Client Version 5.03
http://folding.stanford.edu
###############################################################################
###############################################################################
Launch directory: C:\Folding@Home
Arguments: -local -verbosity 9
[14:01:38] - Ask before connecting: No
[14:01:38] - User name: anko1 (Team 47815)
[14:01:38] - User ID: 14991F842ED3B1A8
[14:01:38] - Machine ID: 3
[14:01:38]
[14:01:38] Loaded queue successfully.
[14:01:38] Initialization complete
[14:01:38] + Benchmarking ...
[14:01:41] The benchmark result is 4896
[14:01:41]
[14:01:41] + Processing work unit
[14:01:41] - Autosending finished units...
[14:01:41] Trying to send all finished work units
[14:01:41] + No unsent completed units remaining.
[14:01:41] - Autosend completed
[14:01:41] Core required: FahCore_82.exe
[14:01:41] Core found.
[14:01:41] Working on Unit 05 [May 20 14:01:41]
[14:01:41] + Working ...
[14:01:41] - Calling 'FahCore_82.exe -dir work/ -suffix 05 -checkpoint 15 -verbose -lifeline 404 -version 503'
[14:01:41]
[14:01:41] *------------------------------*
[14:01:41] Folding@Home PMD Core
[14:01:41] Version 1.03 (September 7, 2005)
[14:01:41]
[14:01:41] Preparing to commence simulation
[14:01:41] - Ensuring status. Please wait.
[14:01:58] - Looking at optimizations...
[14:01:58] - Working with standard loops on this execution.
[14:01:58] - Previous termination of core was improper.
[14:01:58] - Files status OK
[14:01:59] - Expanded 92947 -> 599777 (decompressed 645.2 percent)
[14:01:59]
[14:01:59] Project: 2170 (Run 46, Clone 234, Gen 2)
[14:01:59]
[14:01:59] Entering M.D.
[14:02:06] Protein: p2170_lambda_obc_300K
[14:02:06]
[14:02:06] Completed 0 out of 500000 steps (0)
[14:05:22] Printing Queue Information
CURRENT QUEUE:
00 EMPTY
01 EMPTY
02 EMPTY
03 EMPTY
04 EMPTY
05 *ACTIVE "Folding@Home" (82) 171.65.103.160:8080 May 17 04:27 | July 22 04:27
06 EMPTY
07 EMPTY
08 EMPTY
09 EMPTY
[14:10:06] ***** Got a SIGTERM signal (2)
Folding@Home Client Shutdown.
Re: Project: 2170 (Run 46, Clone 234, Gen 2) hung
Posted: Fri May 30, 2008 4:51 am
by anandhanju
I'm out of ideas. If I were you, I'd just let it run and see if it hangs again.
Re: Project: 2170 (Run 46, Clone 234, Gen 2) hung
Posted: Fri May 30, 2008 5:11 am
by anko1
Thanks for trying!
Re: Project: 2170 (Run 46, Clone 234, Gen 2) hung
Posted: Fri May 30, 2008 5:58 am
by codysluder
anko1 wrote:The client doesn't recognize that the unit is there. Any other suggestions?
Code: Select all
CURRENT QUEUE:
00 EMPTY
01 EMPTY
02 EMPTY
03 EMPTY
04 EMPTY
05 *ACTIVE "Folding@Home" (82) 171.65.103.160:8080 May 17 04:27 | July 22 04:27
06 EMPTY
07 EMPTY
08 EMPTY
09 EMPTY
Look in the work folder for WURESULTS_04.dat or _03.dat. If either is there, stop the client and run qfix from the CLI.
Re: Project: 2170 (Run 46, Clone 234, Gen 2) hung
Posted: Fri May 30, 2008 1:10 pm
by anko1
I have results from O5. Would that be the one I'm looking for?
Thanks for the help.
Re: Project: 2170 (Run 46, Clone 234, Gen 2) hung
Posted: Fri May 30, 2008 9:33 pm
by bruce
The queue says that 05 is active. If that WU isn't finished yet, you can expect there to be quite a few files with *_05* in their name. Normally, when a WU finishes, the important data is collected into a file called wuresults_0*.dat (which will be uploaded) and most of the other files are deleted. Then the status of the WU is changed from active to ready-to-upload.
When a WU has an error, this process may be disrupted. It's difficult to know whether wuresults_*.dat was created before the disruption or the disruption prevented it from being created. Qfix will look for wuresults* files and in some cases is able to correct for all or part of the disruption.
Since WU 05 is active, I assumed that the error you reported earlier was WU 04 or perhaps 03.
Re: Project: 2170 (Run 46, Clone 234, Gen 2) hung
Posted: Wed Jun 04, 2008 5:18 am
by anko1
Sorry, guess I wasn't clear. WU5 is the one that I stopped when it hung up at "new time frame...." Then when I restarted, hoping that it would finish up and send, it began the same WU at the start, so I have a results file for it (I presume generated on July 22). So what do you suggest? Qfix it and then run the unit to see if it hangs again? or just let the unit proceed and replace what ever is currently in Results_05?
Re: Project: 2170 (Run 46, Clone 234, Gen 2) hung
Posted: Wed Jun 04, 2008 7:24 am
by bruce
I'm confused.
Post a list of the contents of "work" including the date-time together with the queueinfo output taken at the same time.
Re: Project: 2170 (Run 46, Clone 234, Gen 2) hung
Posted: Thu Jun 05, 2008 5:41 am
by anko1
Thanks for trying to help me resolve this, Bruce. Here's the contents of my work folder:
core82.sta
current.xyz
current.xyz_temp
logfile_02
logfile_03
logfile_05
logfile_05-2170restart [a file I saved]
wudata_05 [dat file]
wudata_05 [INC file]
wudata_05 [MSInfo document]
wudata_05.dyn
wudata_05.eng
wudata_05.inp
wudata_05.out
wudata_05.top
wudata_05.trj
wudata_05CP.arc
wuinfo_05
wuresults_05
and here's the current queue:
Code: Select all
--- Opening Log file [June 5 05:28:58]
# Windows Graphical Edition ###################################################
###############################################################################
Folding@Home Client Version 5.03
http://folding.stanford.edu
###############################################################################
###############################################################################
Launch directory: C:\Folding@Home
Arguments: -local -verbosity 9 -forceasm
Warning:
By using the -forceasm flag, you are overriding
safeguards in the program. If you did not intend to
do this, please restart the program without -forceasm.
If work units are not completing fully (and particularly
if your machine is overclocked), then please discontinue
use of the flag.
[05:28:58] - Ask before connecting: No
[05:28:58] - User name: anko1 (Team 47815)
[05:28:58] - User ID: 14991F842ED3B1A8
[05:28:58] - Machine ID: 3
[05:28:58]
[05:28:58] Loaded queue successfully.
[05:28:58] Initialization complete
[05:28:58] + Benchmarking ...
[05:29:02] The benchmark result is 4596
[05:29:02]
[05:29:02] + Processing work unit
[05:29:02] - Autosending finished units...
[05:29:02] Trying to send all finished work units
[05:29:02] + No unsent completed units remaining.
[05:29:02] - Autosend completed
[05:29:02] Core required: FahCore_82.exe
[05:29:02] Core found.
[05:29:02] Working on Unit 05 [June 5 05:29:02]
[05:29:02] + Working ...
[05:29:02] - Calling 'FahCore_82.exe -dir work/ -suffix 05 -checkpoint 15 -forceasm -verbose -lifeline 2232 -version 503'
[05:29:02]
[05:29:02] *------------------------------*
[05:29:02] Folding@Home PMD Core
[05:29:02] Version 1.03 (September 7, 2005)
[05:29:02]
[05:29:02] Preparing to commence simulation
[05:29:02] - Assembly optimizations manually forced on.
[05:29:02] - Not checking prior termination.
[05:29:03] - Expanded 92947 -> 599777 (decompressed 645.2 percent)
[05:29:03]
[05:29:03] Project: 2170 (Run 46, Clone 234, Gen 2)
[05:29:03]
[05:29:03] Assembly optimizations on if available.
[05:29:03] Entering M.D.
[05:30:09] Protein: p2170_lambda_obc_300K
[05:30:09]
[05:30:09] Completed 57000 out of 500000 steps (11)
[05:37:00] Printing Queue Information
CURRENT QUEUE:
00 EMPTY
01 EMPTY
02 EMPTY
03 EMPTY
04 EMPTY
05 *ACTIVE "Folding@Home" (82) 171.65.103.160:8080 May 17 04:27 | July 22 04:27
06 EMPTY
07 EMPTY
08 EMPTY
09 EMPTY
I see that the second date in the active line is a July date, and not an earlier date. I hadn't looked closely enough
![Embarassed :oops:](./images/smilies/icon_redface.gif)
, and thought that was the date the unit first finished and hung.
Thanks again!!
Angela
Re: Project: 2170 (Run 46, Clone 234, Gen 2) hung
Posted: Thu Jun 05, 2008 11:17 am
by bruce
Everthing looks normal to me.
anko1 wrote:[05:29:02] + Processing work unit
[05:29:02] - Autosending finished units...
[05:29:02] Trying to send all finished work units
[05:29:02] + No unsent completed units remaining.
[05:29:02] - Autosend completed
[05:29:02] Core required: FahCore_82.exe
[05:29:02] Core found.
[05:29:02] Working on Unit 05 [June 5 05:29:02] <---- currently working on unit 05
[05:29:02] + Working ...
<snip>
[05:29:03] Project: 2170 (Run 46, Clone 234, Gen 2)
<snip>
[05:30:09] Completed 57000 out of 500000 steps (11) <--- and 11% has already been finished.
[05:37:00] Printing Queue Information
CURRENT QUEUE:
00 EMPTY
01 EMPTY
02 EMPTY
03 EMPTY
04 EMPTY
05 *ACTIVE "Folding@Home" (82) 171.65.103.160:8080 May 17 04:27 | July 22 04:27 <----Active WU downloaded 17May is due 22 July
06 EMPTY
07 EMPTY
08 EMPTY
09 EMPTY <---- . . . and nothing else is in queue except the one you're working on.
Re: Project: 2170 (Run 46, Clone 234, Gen 2) hung
Posted: Thu Jun 05, 2008 2:37 pm
by anko1
Well, I was wondering if I had results already for this WU. It is the same one I was working on that hung up after 100% at "Estimating time frame..." Should I qfix it, and if I do, should I continue to rerun the same project to see if it hangs again? Or should I just continue with the second run of this WU?