Page 1 of 1

Project: 895 (Run 6, Clone 485, Gen 1)

Posted: Wed Jun 25, 2008 6:38 am
by Raxle
Hanging up after 100% when "shutting down core", cpu usage goes to 0.

win 5.04 client

Code: Select all


[06:54:00] Loaded queue successfully.
[06:54:00] + Benchmarking ...
[06:54:03] 
[06:54:03] + Processing work unit
[06:54:03] Core required: FahCore_82.exe
[06:54:03] Core found.
[06:54:03] Working on Unit 03 [June 24 06:54:03]
[06:54:03] + Working ...
[06:54:03] 
[06:54:03] *------------------------------*
[06:54:03] Folding@Home PMD Core
[06:54:03] Version 1.03 (September 7, 2005)
[06:54:03] 
[06:54:03] Preparing to commence simulation
[06:54:04] - Ensuring status. Please wait.
[06:54:20] - Looking at optimizations...
[06:54:20] - Working with standard loops on this execution.
[06:54:21] - Previous termination of core was improper.
[06:54:21] - Going to use standard loops.
[06:54:21] - Files status OK
[06:54:21] - Expanded 18817 -> 118805 (decompressed 631.3 percent)
[06:54:21] 
[06:54:21] Project: 895 (Run 6, Clone 485, Gen 1)
[06:54:21] 
[06:54:21] Entering M.D.
[07:03:36] Protein: p895_p53peptides_gb
[07:03:36] 
[07:03:36] Completed 3000000 out of 5000000 steps  (60)
[07:09:22] Writing checkpoint files
[07:24:22] Writing checkpoint files
[07:28:14] Writing local files
[07:28:14] Completed 3050000 out of 5000000 steps  (61)
[07:28:14] Writing checkpoint files
[07:43:14] Writing checkpoint files
[07:53:45] Writing local files
[07:53:45] Completed 3100000 out of 5000000 steps  (62)
[07:53:45] Writing checkpoint files
[08:08:45] Writing checkpoint files
[08:19:43] Writing local files
[08:19:43] Completed 3150000 out of 5000000 steps  (63)
[08:19:43] Writing checkpoint files
[08:34:43] Writing checkpoint files
[08:45:47] Writing local files
[08:45:47] Completed 3200000 out of 5000000 steps  (64)
[08:45:47] Writing checkpoint files
[09:00:48] Writing checkpoint files
[09:11:47] Writing local files
[09:11:47] Completed 3250000 out of 5000000 steps  (65)
[09:11:47] Writing checkpoint files
[09:26:46] Writing checkpoint files
[09:36:39] Writing local files
[09:36:39] Completed 3300000 out of 5000000 steps  (66)
[09:36:39] Writing checkpoint files
[09:51:40] Writing checkpoint files
[10:01:24] Writing local files
[10:01:24] Completed 3350000 out of 5000000 steps  (67)
[10:01:24] Writing checkpoint files
[10:16:24] Writing checkpoint files
[10:25:47] Writing local files
[10:25:47] Completed 3400000 out of 5000000 steps  (68)
[10:25:47] Writing checkpoint files
[10:40:48] Writing checkpoint files
[10:50:23] Writing local files
[10:50:23] Completed 3450000 out of 5000000 steps  (69)
[10:50:23] Writing checkpoint files
[11:05:22] Writing checkpoint files
[11:14:57] Writing local files
[11:14:57] Completed 3500000 out of 5000000 steps  (70)
[11:14:57] Writing checkpoint files
[11:29:58] Writing checkpoint files
[11:39:30] Writing local files
[11:39:30] Completed 3550000 out of 5000000 steps  (71)
[11:39:30] Writing checkpoint files
[11:54:30] Writing checkpoint files
[12:04:08] Writing local files
[12:04:08] Completed 3600000 out of 5000000 steps  (72)
[12:04:08] Writing checkpoint files
[12:19:08] Writing checkpoint files
[12:28:22] Writing local files
[12:28:22] Completed 3650000 out of 5000000 steps  (73)
[12:28:22] Writing checkpoint files
[12:43:22] Writing checkpoint files
[12:52:57] Writing local files
[12:52:57] Completed 3700000 out of 5000000 steps  (74)
[12:52:57] Writing checkpoint files
[13:07:57] Writing checkpoint files
[13:17:32] Writing local files
[13:17:32] Completed 3750000 out of 5000000 steps  (75)
[13:17:32] Writing checkpoint files
[13:32:33] Writing checkpoint files
[13:42:22] Writing local files
[13:42:22] Completed 3800000 out of 5000000 steps  (76)
[13:42:22] Writing checkpoint files
[13:57:23] Writing checkpoint files
[14:10:04] Writing local files
[14:10:04] Completed 3850000 out of 5000000 steps  (77)
[14:10:04] Writing checkpoint files
[14:25:04] Writing checkpoint files
[14:33:57] Writing local files
[14:33:57] Completed 3900000 out of 5000000 steps  (78)
[14:33:58] Writing checkpoint files
[14:48:58] Writing checkpoint files
[14:58:51] Writing local files
[14:58:51] Completed 3950000 out of 5000000 steps  (79)
[14:58:51] Writing checkpoint files
[15:13:52] Writing checkpoint files
[15:21:39] Writing local files
[15:21:39] Completed 4000000 out of 5000000 steps  (80)
[15:21:39] Writing checkpoint files
[15:36:40] Writing checkpoint files
[15:45:03] Writing local files
[15:45:03] Completed 4050000 out of 5000000 steps  (81)
[15:45:03] Writing checkpoint files
[16:00:04] Writing checkpoint files
[16:07:44] Writing local files
[16:07:44] Completed 4100000 out of 5000000 steps  (82)
[16:07:44] Writing checkpoint files
[16:22:48] Writing checkpoint files
[16:30:31] Writing local files
[16:30:31] Completed 4150000 out of 5000000 steps  (83)
[16:30:31] Writing checkpoint files
[16:45:31] Writing checkpoint files
[16:53:40] Writing local files
[16:53:40] Completed 4200000 out of 5000000 steps  (84)
[16:53:40] Writing checkpoint files
[17:08:41] Writing checkpoint files
[17:16:18] Writing local files
[17:16:18] Completed 4250000 out of 5000000 steps  (85)
[17:16:18] Writing checkpoint files
[17:31:20] Writing checkpoint files
[17:40:07] Writing local files
[17:40:07] Completed 4300000 out of 5000000 steps  (86)
[17:40:07] Writing checkpoint files
[17:55:07] Writing checkpoint files
[18:02:47] Writing local files
[18:02:47] Completed 4350000 out of 5000000 steps  (87)
[18:02:47] Writing checkpoint files
[18:17:48] Writing checkpoint files
[18:25:26] Writing local files
[18:25:26] Completed 4400000 out of 5000000 steps  (88)
[18:25:26] Writing checkpoint files
[18:40:27] Writing checkpoint files
[18:48:50] Writing local files
[18:48:50] Completed 4450000 out of 5000000 steps  (89)
[18:48:50] Writing checkpoint files
[19:03:50] Writing checkpoint files
[19:11:29] Writing local files
[19:11:29] Completed 4500000 out of 5000000 steps  (90)
[19:11:29] Writing checkpoint files
[19:26:29] Writing checkpoint files
[19:34:36] Writing local files
[19:34:36] Completed 4550000 out of 5000000 steps  (91)
[19:34:36] Writing checkpoint files
[19:49:36] Writing checkpoint files
[19:57:15] Writing local files
[19:57:15] Completed 4600000 out of 5000000 steps  (92)
[19:57:15] Writing checkpoint files
[20:12:15] Writing checkpoint files
[20:19:52] Writing local files
[20:19:52] Completed 4650000 out of 5000000 steps  (93)
[20:19:52] Writing checkpoint files
[20:34:52] Writing checkpoint files
[20:43:01] Writing local files
[20:43:01] Completed 4700000 out of 5000000 steps  (94)
[20:43:01] Writing checkpoint files
[20:58:02] Writing checkpoint files
[21:05:41] Writing local files
[21:05:41] Completed 4750000 out of 5000000 steps  (95)
[21:05:41] Writing checkpoint files
[21:20:41] Writing checkpoint files
[21:29:04] Writing local files
[21:29:04] Completed 4800000 out of 5000000 steps  (96)
[21:29:04] Writing checkpoint files
[21:44:05] Writing checkpoint files
[21:51:44] Writing local files
[21:51:44] Completed 4850000 out of 5000000 steps  (97)
[21:51:44] Writing checkpoint files
[22:06:44] Writing checkpoint files
[22:14:24] Writing local files
[22:14:24] Completed 4900000 out of 5000000 steps  (98)
[22:14:24] Writing checkpoint files
[22:29:25] Writing checkpoint files
[22:37:48] Writing local files
[22:37:48] Completed 4950000 out of 5000000 steps  (99)
[22:37:48] Writing checkpoint files
[22:52:48] Writing checkpoint files
[23:00:28] Writing local files
[23:00:28] Completed 5000000 out of 5000000 steps  (100)
[23:00:28] Writing checkpoint files
[23:01:28] 
[23:01:28] Finished Work Unit:
[23:01:29] Leaving Run
[23:01:33] - Writing 625983 bytes of core data to disk...
[23:01:33]   ... Done.
[23:01:33] - Shutting down core

Re: Project: 895 (Run 6, Clone 485, Gen 1)

Posted: Wed Jun 25, 2008 6:55 am
by bruce
This bug has been reported a few times but nobody has been able to figure out what causes it.

Which version of WIndows are you running?
Are other FAH clients running on the same machine? if so what WUs were active? What MachineIDs are assigned?
What hardware is this happening on?

Re: Project: 895 (Run 6, Clone 485, Gen 1)

Posted: Wed Jun 25, 2008 7:47 am
by Raxle
Windows XP Professional SP2 with automatic updates.
Other client running as service with machine ID 1 (worked on Project: 2170 (Run 43, Clone 833, Gen 2)), problem client running in console with machine ID 2.
All running in Lenovo R61i notebook (ThinkPad R61i Intel Core 2 Duo Processor T5250 1.50GHz L2 cache 2MB FSB 800MHz; RAM 2x512MB PC2-5300 DDR2 SDRAM; HDD 160Gb SATA 5400prm; Intel Graphics Media Accelerator X3100)

Re: Project: 895 (Run 6, Clone 485, Gen 1)

Posted: Wed Jun 25, 2008 6:06 pm
by bruce
This is useful information, but I still don't know what the developers are going to need to track this one down.

One of the previous reports had two AMBER WUs running at the same time and one was a service and one was not, which appears to apply in your case. Please fill in the missing data "xx" below. I'm going to assume that the service was running continuously and it finished one WU and started another during the time that p895 r6 c485 g1 was running. If that's not true, you won't have any data for the blank lines, or you may need more data from the service -- whatever events it takes to cover the timespan that the WU which hung was being processed.

MachineID = 2 (non service)
Started p895 r6 c485 g1 [xx:xx:xx]
Stopped p895 r6 c485 g1 [xx:xx:xx]
Restarted p895 r6 c485 g1 [06:54:21]
Finished p895 r6 c485 g1 [23:01:33]

MachineID = 1 (service)
Started pxx rxx cxx gxx [xx:xx:xx]
Finished pxx rxx cxx gxx [xx:xx:xx]
Started p2170 r43 c833 g2 [xx:xx:xx]
Finished p2170 r43 c833 g2 [xx:xx:xx]


As far as what you can do to avoid the problem in the future, I have a couple of guesses which might help, but nothing is certain.
1) Configure one client to run AMBER and the other to run something else. (Probably neither desirable, points-wise, or possible)
2) Run both copies interactively or both copies as services.

If you choose either of these options and you ever see a WU hang like this again, be sure to notify us. As I've said, the bug is really obscure and we need all the help we can get to figure it out.

Re: Project: 895 (Run 6, Clone 485, Gen 1)

Posted: Wed Jun 25, 2008 11:48 pm
by nwkelley
i've passed this on to the group so we're at least aware of the situation. as bruce said, might be a tricky one!

please keep us updated if it reoccurs...

-nick

Re: Project: 895 (Run 6, Clone 485, Gen 1)

Posted: Thu Jun 26, 2008 7:10 am
by Raxle
MachineID = 2 (non service)
Started p895 r6 c485 g1 [06:40:xx] - [07:10:xx] June 23--- approximately
Stopped p895 r6 c485 g1 [06:36:xx] June 24 --- approximately
------------------Sorry, have no logs before this point..
Restarted p895 r6 c485 g1 [06:54:21] June 24
Finished p895 r6 c485 g1 [23:01:33] June 24

MachineID = 1 (service)
Started p2170 r43 c833 g2 [02:19:40] June 23
Stopped p2170 r43 c833 g2 [06:36:36] June 23
Restarted p2170 r43 c833 g2 [06:37:53] June 23
Stopped p2170 r43 c833 g2 [06:48:30] June 24 --- reboot, so another client was stopped approx. at 06:36
Restarted p2170 r43 c833 g2 [06:49:48] June 24
Stopped p2170 r43 c833 g2 [06:19:33] June 26 (at 98 %, still working :) )

So, MachineID = 1 (service) was running WU="p2170 r43 c833" g2 before and after problem WU="p895 r6 c485 g1".
I hope this helps.

Running 2 clients as services not always handy - sometimes I have to shutdown one of the clients.