Project: 3062 (Run 3, Clone 26, Gen 1)

Moderators: Site Moderators, FAHC Science Team

_ikki_
Posts: 27
Joined: Wed Dec 05, 2007 8:38 am

Project: 3062 (Run 3, Clone 26, Gen 1)

Post by _ikki_ »

It crashed 3 time at the same point (59%)

I made hourly backup of the fah folder so if anyone is interessed, I will give him the backup juste before the crash.
Team #35819 P2P-Community
Flathead74
Posts: 266
Joined: Sun Dec 02, 2007 6:08 pm
Location: Central New York
Contact:

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Post by Flathead74 »

_ikki_, have you tried restarting the WU from a point just before the usual crash-point, say at 56 - 57% or so?

Some folks have found that the WU will complete if done this way, though not all.
_ikki_
Posts: 27
Joined: Wed Dec 05, 2007 8:38 am

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Post by _ikki_ »

No I didn't try to restart the WU before before it crashed but it worth to test it, thanks to my backups ;)
Team #35819 P2P-Community
gwildperson
Posts: 450
Joined: Tue Dec 04, 2007 8:36 pm

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Post by gwildperson »

_ikki_ wrote:It crashed 3 time at the same point (59%)
What was the error message? We might be able to help a bit more if you posted FAHlog.txt.There are lots of possible meanings for "crashed" and I don't want to guess. ;)
_ikki_
Posts: 27
Joined: Wed Dec 05, 2007 8:38 am

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Post by _ikki_ »

It's an error like this :
[10:51:13] Completed 2900000 out of 5000000 steps (58 percent)
[11:00:31] Writing local files
[11:00:31] Completed 2950000 out of 5000000 steps (59 percent)
[11:08:25] Warning: long 1-4 interactions
[11:08:29] CoreStatus = 1 (1)
[11:08:29] Client-core communications error: ERROR 0x1
[11:08:29] Deleting current work unit & continuing...
[11:12:51] - Warning: Could not delete all work unit files (7): Core returned invalid code
[11:12:51] Trying to send all finished work units
[11:12:51] + No unsent completed units remaining.
[11:12:51] - Preparing to get new work unit...
[11:12:51] + Attempting to get work packet
[11:12:51] - Will indicate memory of 2014 MB
Team #35819 P2P-Community
ChelseaOilman
Posts: 1037
Joined: Sun Dec 02, 2007 3:47 pm
Location: Colorado @ 10,000 feet

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Post by ChelseaOilman »

This WU hasn't been submitted by anyone yet. If it's been issued to other people they may be having the same issues as you.

Look in the work folder to see if there are any wuresults_0x.dat files. If there are I would try running qfix to get credit for what you did and get this WU entered into the WU database.
_ikki_
Posts: 27
Joined: Wed Dec 05, 2007 8:38 am

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Post by _ikki_ »

There is no wuresults_0x.dat file in the work folder.
Team #35819 P2P-Community
_ikki_
Posts: 27
Joined: Wed Dec 05, 2007 8:38 am

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Post by _ikki_ »

For debugging, here is the data one frame before the crash (about 10 minutes) :

http://rapidshare.com/files/80698307/fa ... 1.tgz.html

Happy new year :d
Last edited by _ikki_ on Wed Jan 02, 2008 2:00 pm, edited 1 time in total.
Team #35819 P2P-Community
VijayPande
Pande Group Member
Posts: 2058
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Post by VijayPande »

Thanks! We'll take a look once the full team gets back from the Stanford holiday break.
_ikki_
Posts: 27
Joined: Wed Dec 05, 2007 8:38 am

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Post by _ikki_ »

okay, I hope this data will be useful.
Team #35819 P2P-Community
ChelseaOilman
Posts: 1037
Joined: Sun Dec 02, 2007 3:47 pm
Location: Colorado @ 10,000 feet

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Post by ChelseaOilman »

Someone else has sucessfully finished this WU.

Your WU (P3062 R3 C26 G1) was added to the stats database on 2007-12-31 22:59:05 for 1324 points of credit.

After changing the date on my computer, because the deadline had passed, I was able to run the WU past frame 59 as well. If someone else hadn't already submitted it for credit I would keep going to the end.
[17:53:44] Working on Unit 07 [December 23 17:53:44]
[17:53:44] + Working ...
[17:53:44] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 07 -checkpoint 15 -forceasm -verbose -lifeline 20217 -version 600'

[17:53:44]
[17:53:44] *------------------------------*
[17:53:44] Folding@Home Gromacs SMP Core
[17:53:44] Version 1.74 (November 27, 2006)
[17:53:44]
[17:53:44] Preparing to commence simulation
[17:53:44] - Ensuring status. Please wait.
[17:54:01] - Assembly optimizations manually forced on.
[17:54:01] - Not checking prior termination.
[17:54:01] - Expanded 607662 -> 3257309 (decompressed 536.0 percent)
[17:54:01]
[17:54:01] Project: 3062 (Run 3, Clone 26, Gen 1)
[17:54:01]
[17:54:01] Assembly optimizations on if available.
[17:54:01] Entering M.D.
[17:54:07] Calling FAH init
[17:54:07] mbda5_99sb
[17:54:07] Writing local files
[17:54:07] Completed 2900000 out of 5000000 steps (58 percent)
[17:54:07] Extra SSE boost OK.
[17:54:07]
[17:54:07] Completed 2900000 out of 5000000 steps (58 percent)
[17:54:07] Extra SSE boost OK.
[18:02:07] Writing local files
[18:02:08] Completed 2950000 out of 5000000 steps (59 percent)
[18:10:16] Writing local files
[18:10:16] Completed 3000000 out of 5000000 steps (60 percent)
[18:18:21] Writing local files
[18:18:21] Completed 3050000 out of 5000000 steps (61 percent)
[18:26:20] Writing local files
[18:26:20] Completed 3100000 out of 5000000 steps (62 percent)
_ikki_
Posts: 27
Joined: Wed Dec 05, 2007 8:38 am

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Post by _ikki_ »

Someone else has sucessfully finished this WU.
... since last time you checked this WU ?

What does it mean ?

The donator who finished the WU has restarted the client before the crash or did the WU finished without any action ? (the only one who can respond is the donator himself ;) )

The next time I'll try to restart the client if I haven't passed the deadline but It implies to inspect regularly the log file...

Let ask you several questions to conclude :
- Should we warn Stanford if a WU has crashed and if we succeed (after restarting the client) to complete it ?
- What do we do if the deadline has passed and the WU crashed ? Should we report it ?
Team #35819 P2P-Community
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Post by bruce »

_ikki_ wrote:The next time I'll try to restart the client if I haven't passed the deadline but It implies to inspect regularly the log file...

Let ask you several questions to conclude :
- Should we warn Stanford if a WU has crashed and if we succeed (after restarting the client) to complete it ?
- What do we do if the deadline has passed and the WU crashed ? Should we report it ?
At some point I'm sure that Stanford will figure out how to prevent this problem before it happens, but at this point the only thing we can do is continue to gather data that may help them find the problem. I'd recommend that we do report cases where a WU failed without a restart but was able to proceed if it was stopped/restarted. That may be the only way we can help.

The fact that ChelseaOilman was able to resume processing and continue past the point of the original error is important. For debugging purposes, I'd like to see is a captured WU that will fail in the next frame, not one that can continue.
ChelseaOilman wrote:After changing the date on my computer, because the deadline had passed, I was able to run the WU past frame 59 as well. If someone else hadn't already submitted it for credit I would keep going to the end.
For the purposes of learning more about the protein, itself, finishing the project is important. For the purposes of debugging, finding a repeatable error is important.

If you want to help with the debugging, it's probably better to use the advanced setting to ignore local deadlines than it is to change your system clock, but both work.

@ _ikki_
If you disable local deadlines and restart the WU that you published, does it fail for you (indicating a possible hardware issue) or does it continue like it did for ChelseaOilman (indicating that restarting changed something) :?:
_ikki_
Posts: 27
Joined: Wed Dec 05, 2007 8:38 am

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Post by _ikki_ »

bruce wrote:
@ _ikki_
If you disable local deadlines and restart the WU that you published, does it fail for you (indicating a possible hardware issue) or does it continue like it did for ChelseaOilman (indicating that restarting changed something) :?:
how to disable local deadlines without modifying the date ? :?:
Team #35819 P2P-Community
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 3062 (Run 3, Clone 26, Gen 1)

Post by bruce »

_ikki_ wrote:how to disable local deadlines without modifying the date ? :?:
With the console client, restart with -config or with -configonly.
. . .
Change advanced options (yes/no) [no]? y
. . .
Ignore any deadline information (mainly useful if
system clock frequently has errors) (no/yes) [no]? y
. . .
Post Reply