Project: 3062 (Run 3, Clone 26, Gen 1)
Moderators: Site Moderators, FAHC Science Team
Project: 3062 (Run 3, Clone 26, Gen 1)
It crashed 3 time at the same point (59%)
I made hourly backup of the fah folder so if anyone is interessed, I will give him the backup juste before the crash.
I made hourly backup of the fah folder so if anyone is interessed, I will give him the backup juste before the crash.
Team #35819 P2P-Community
-
- Posts: 266
- Joined: Sun Dec 02, 2007 6:08 pm
- Location: Central New York
- Contact:
Re: Project: 3062 (Run 3, Clone 26, Gen 1)
_ikki_, have you tried restarting the WU from a point just before the usual crash-point, say at 56 - 57% or so?
Some folks have found that the WU will complete if done this way, though not all.
Some folks have found that the WU will complete if done this way, though not all.
Re: Project: 3062 (Run 3, Clone 26, Gen 1)
No I didn't try to restart the WU before before it crashed but it worth to test it, thanks to my backups
Team #35819 P2P-Community
-
- Posts: 450
- Joined: Tue Dec 04, 2007 8:36 pm
Re: Project: 3062 (Run 3, Clone 26, Gen 1)
What was the error message? We might be able to help a bit more if you posted FAHlog.txt.There are lots of possible meanings for "crashed" and I don't want to guess._ikki_ wrote:It crashed 3 time at the same point (59%)
Re: Project: 3062 (Run 3, Clone 26, Gen 1)
It's an error like this :
[10:51:13] Completed 2900000 out of 5000000 steps (58 percent)
[11:00:31] Writing local files
[11:00:31] Completed 2950000 out of 5000000 steps (59 percent)
[11:08:25] Warning: long 1-4 interactions
[11:08:29] CoreStatus = 1 (1)
[11:08:29] Client-core communications error: ERROR 0x1
[11:08:29] Deleting current work unit & continuing...
[11:12:51] - Warning: Could not delete all work unit files (7): Core returned invalid code
[11:12:51] Trying to send all finished work units
[11:12:51] + No unsent completed units remaining.
[11:12:51] - Preparing to get new work unit...
[11:12:51] + Attempting to get work packet
[11:12:51] - Will indicate memory of 2014 MB
Team #35819 P2P-Community
-
- Posts: 1037
- Joined: Sun Dec 02, 2007 3:47 pm
- Location: Colorado @ 10,000 feet
Re: Project: 3062 (Run 3, Clone 26, Gen 1)
This WU hasn't been submitted by anyone yet. If it's been issued to other people they may be having the same issues as you.
Look in the work folder to see if there are any wuresults_0x.dat files. If there are I would try running qfix to get credit for what you did and get this WU entered into the WU database.
Look in the work folder to see if there are any wuresults_0x.dat files. If there are I would try running qfix to get credit for what you did and get this WU entered into the WU database.
Re: Project: 3062 (Run 3, Clone 26, Gen 1)
There is no wuresults_0x.dat file in the work folder.
Team #35819 P2P-Community
Re: Project: 3062 (Run 3, Clone 26, Gen 1)
For debugging, here is the data one frame before the crash (about 10 minutes) :
http://rapidshare.com/files/80698307/fa ... 1.tgz.html
Happy new year :d
http://rapidshare.com/files/80698307/fa ... 1.tgz.html
Happy new year :d
Last edited by _ikki_ on Wed Jan 02, 2008 2:00 pm, edited 1 time in total.
Team #35819 P2P-Community
-
- Pande Group Member
- Posts: 2058
- Joined: Fri Nov 30, 2007 6:25 am
- Location: Stanford
Re: Project: 3062 (Run 3, Clone 26, Gen 1)
Thanks! We'll take a look once the full team gets back from the Stanford holiday break.
Re: Project: 3062 (Run 3, Clone 26, Gen 1)
okay, I hope this data will be useful.
Team #35819 P2P-Community
-
- Posts: 1037
- Joined: Sun Dec 02, 2007 3:47 pm
- Location: Colorado @ 10,000 feet
Re: Project: 3062 (Run 3, Clone 26, Gen 1)
Someone else has sucessfully finished this WU.
Your WU (P3062 R3 C26 G1) was added to the stats database on 2007-12-31 22:59:05 for 1324 points of credit.
After changing the date on my computer, because the deadline had passed, I was able to run the WU past frame 59 as well. If someone else hadn't already submitted it for credit I would keep going to the end.
Your WU (P3062 R3 C26 G1) was added to the stats database on 2007-12-31 22:59:05 for 1324 points of credit.
After changing the date on my computer, because the deadline had passed, I was able to run the WU past frame 59 as well. If someone else hadn't already submitted it for credit I would keep going to the end.
[17:53:44] Working on Unit 07 [December 23 17:53:44]
[17:53:44] + Working ...
[17:53:44] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 07 -checkpoint 15 -forceasm -verbose -lifeline 20217 -version 600'
[17:53:44]
[17:53:44] *------------------------------*
[17:53:44] Folding@Home Gromacs SMP Core
[17:53:44] Version 1.74 (November 27, 2006)
[17:53:44]
[17:53:44] Preparing to commence simulation
[17:53:44] - Ensuring status. Please wait.
[17:54:01] - Assembly optimizations manually forced on.
[17:54:01] - Not checking prior termination.
[17:54:01] - Expanded 607662 -> 3257309 (decompressed 536.0 percent)
[17:54:01]
[17:54:01] Project: 3062 (Run 3, Clone 26, Gen 1)
[17:54:01]
[17:54:01] Assembly optimizations on if available.
[17:54:01] Entering M.D.
[17:54:07] Calling FAH init
[17:54:07] mbda5_99sb
[17:54:07] Writing local files
[17:54:07] Completed 2900000 out of 5000000 steps (58 percent)
[17:54:07] Extra SSE boost OK.
[17:54:07]
[17:54:07] Completed 2900000 out of 5000000 steps (58 percent)
[17:54:07] Extra SSE boost OK.
[18:02:07] Writing local files
[18:02:08] Completed 2950000 out of 5000000 steps (59 percent)
[18:10:16] Writing local files
[18:10:16] Completed 3000000 out of 5000000 steps (60 percent)
[18:18:21] Writing local files
[18:18:21] Completed 3050000 out of 5000000 steps (61 percent)
[18:26:20] Writing local files
[18:26:20] Completed 3100000 out of 5000000 steps (62 percent)
Re: Project: 3062 (Run 3, Clone 26, Gen 1)
... since last time you checked this WU ?Someone else has sucessfully finished this WU.
What does it mean ?
The donator who finished the WU has restarted the client before the crash or did the WU finished without any action ? (the only one who can respond is the donator himself )
The next time I'll try to restart the client if I haven't passed the deadline but It implies to inspect regularly the log file...
Let ask you several questions to conclude :
- Should we warn Stanford if a WU has crashed and if we succeed (after restarting the client) to complete it ?
- What do we do if the deadline has passed and the WU crashed ? Should we report it ?
Team #35819 P2P-Community
Re: Project: 3062 (Run 3, Clone 26, Gen 1)
At some point I'm sure that Stanford will figure out how to prevent this problem before it happens, but at this point the only thing we can do is continue to gather data that may help them find the problem. I'd recommend that we do report cases where a WU failed without a restart but was able to proceed if it was stopped/restarted. That may be the only way we can help._ikki_ wrote:The next time I'll try to restart the client if I haven't passed the deadline but It implies to inspect regularly the log file...
Let ask you several questions to conclude :
- Should we warn Stanford if a WU has crashed and if we succeed (after restarting the client) to complete it ?
- What do we do if the deadline has passed and the WU crashed ? Should we report it ?
The fact that ChelseaOilman was able to resume processing and continue past the point of the original error is important. For debugging purposes, I'd like to see is a captured WU that will fail in the next frame, not one that can continue.
For the purposes of learning more about the protein, itself, finishing the project is important. For the purposes of debugging, finding a repeatable error is important.ChelseaOilman wrote:After changing the date on my computer, because the deadline had passed, I was able to run the WU past frame 59 as well. If someone else hadn't already submitted it for credit I would keep going to the end.
If you want to help with the debugging, it's probably better to use the advanced setting to ignore local deadlines than it is to change your system clock, but both work.
@ _ikki_
If you disable local deadlines and restart the WU that you published, does it fail for you (indicating a possible hardware issue) or does it continue like it did for ChelseaOilman (indicating that restarting changed something)
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
Re: Project: 3062 (Run 3, Clone 26, Gen 1)
how to disable local deadlines without modifying the date ?bruce wrote:
@ _ikki_
If you disable local deadlines and restart the WU that you published, does it fail for you (indicating a possible hardware issue) or does it continue like it did for ChelseaOilman (indicating that restarting changed something)
Team #35819 P2P-Community
Re: Project: 3062 (Run 3, Clone 26, Gen 1)
With the console client, restart with -config or with -configonly._ikki_ wrote:how to disable local deadlines without modifying the date ?
. . .
Change advanced options (yes/no) [no]? y
. . .
Ignore any deadline information (mainly useful if
system clock frequently has errors) (no/yes) [no]? y
. . .
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.