FahMon still hangs, but I can reinstall that. That is a curiosity, rather than a necessity. After I restarted WU 2668, before I connected the wireless, or checked fahmon, I got a fatal error but I have no idea how fatal. I killed the program and reincarnated it. It seems to be running smoothly and has been doing so for the last hour, I know changing settings can upset some units, but so far there has been no problem like this. I'll include the sections of the log that. should have most of the relevant details,Maybe someone can tell me if I've
caused the problem, so I can avoid doing it again in the future, or if there is a break in the unit that has made it invalid.
In the fah6 directory, there are also files with details of steps50144b_n1.pdb, step50144c to step50169b_n1.pdb,step50169c_n1. Which were created at 4:49, and Then done at about 5:05, between 20% and 21%. Are these files of any use? Or did the program Just resume from the last clear checkpoint?
Code: Select all
[23:16:43] Completed 500000 out of 500000 steps (100 percent)
[23:16:43] Writing final coordinates.
[23:16:43] Past main M.D. loop
[23:16:43] Will end MPI now
[23:17:43]
[23:17:43] Finished Work Unit:
[23:17:43] - Reading up to 3723696 from "work/wudata_00.arc": Read 3723696
[23:17:43] - Reading up to 1778268 from "work/wudata_00.xtc": Read 1778268
[23:17:43] goefile size: 0
[23:17:43] logfile size: 16908
[23:17:43] Leaving Run
[23:17:46] - Writing 5523272 bytes of core data to disk...
[23:17:46] ... Done.
[23:17:47] - Shutting down core
[23:17:47]
[23:17:47] Folding@home Core Shutdown: FINISHED_UNIT
[23:17:53] CoreStatus = 64 (100)
[23:17:53] Unit 0 finished with 69 percent of time to deadline remaining.
[23:17:53] Updated performance fraction: 0.659703
[23:17:53] Sending work to server
[23:17:53] + Attempting to send results
[23:17:53] - Reading file work/wuresults_00.dat from core
[23:17:53] (Read 5523272 bytes from disk)
[23:17:53] Connecting to http://171.64.65.56:8080/
[23:22:21] Posted data.
[23:22:21] Initial: 0000; - Uploaded at ~20 kB/s
[23:22:22] - Averaged speed for that direction ~22 kB/s
[23:22:22] + Results successfully sent
[23:22:22] Thank you for your contribution to Folding@Home.
[23:22:22] + Number of Units Completed: 49
[23:26:27] - Warning: Could not delete all work unit files (0): Core returned invalid code
[23:26:27] Trying to send all finished work units
[23:26:27] + No unsent completed units remaining.
[23:26:27] - Preparing to get new work unit...
[23:26:27] + Attempting to get work packet
[23:26:27] - Connecting to assignment server
[23:26:27] Connecting to http://assign.stanford.edu:8080/
[23:26:27] Posted data.
[23:26:27] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[23:26:27] + News From Folding@Home: Welcome to Folding@Home
[23:26:28] Loaded queue successfully.
[23:26:28] Connecting to http://171.64.65.56:8080/
[23:26:32] Posted data.
[23:26:32] Initial: 0000; - Receiving payload (expected size: 4088766)
[23:27:04] - Downloaded at ~124 kB/s
[23:27:04] - Averaged speed for that direction ~120 kB/s
[23:27:04] + Received work.
[23:27:04] Trying to send all finished work units
[23:27:04] + No unsent completed units remaining.
[23:27:04] + Closed connections
[23:27:04]
[23:27:04] + Processing work unit
[23:27:04] Core required: FahCore_a2.exe
[23:27:04] Core found.
[23:27:04] Working on Unit 01 [August 27 23:27:04]
[23:27:04] + Working ...
[23:27:04] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 01 -priority 96 -checkpoint 15 -verbose -lifeline 9057 -version 602'
[23:27:04]
[23:27:04] *------------------------------*
[23:27:04] Folding@Home Gromacs SMP Core
[23:27:04] Version 2.00 (Wed Jul 9 13:11:25 PDT 2008)
[23:27:04]
[23:27:04] Preparing to commence simulation
[23:27:04] - Ensuring status. Please wait.
[23:27:04] Files status OK
[23:27:05] - Expanded 4088254 -> 23992989 (decompressed 586.8 percent)
[23:27:05] Called DecompressByteArray: compressed_data_size=4088254 data_size=23992989, decompressed_data_size=23992989 diff=0
[23:27:05] - Digital signature verified
[23:27:05]
[23:27:05] Project: 2668 (Run 0, Clone 48, Gen 0)
[23:27:05]
[23:27:05] Assembly optimizations on if available.
[23:27:05] Entering M.D.
[23:27:15] (Run 0, Clone 48, Gen 0)
[23:27:15]
[23:27:15] Entering M.D.
[23:55:56] Completed 5000 out of 250000 steps (2%)
[04:02:36] Completed 47500 out of 250000 steps (19%)
[04:02:55] - Autosending finished units...
[04:02:55] Trying to send all finished work units
[04:02:55] + No unsent completed units remaining.
[04:02:55] - Autosend completed
[04:18:40] Completed 50000 out of 250000 steps (20%)
[04:27:58] ***** Got a SIGTERM signal (15)
[04:27:58] Killing all core threads
Folding@Home Client Shutdown.
[04:48:05]
[04:48:05] Preparing to commence simulation
[04:48:05] - Ensuring status. Please wait.
[04:48:05] Files status OK
[04:48:05] - Expanded 4088254 -> 23992989 (decompressed 586.8 percent)
[04:48:05] Called DecompressByteArray: compressed_data_size=4088254 data_size=23992989, decompressed_data_size=23992989 diff=0
[04:48:06] - Digital signature verified
[04:48:06]
[04:48:06] Project: 2668 (Run 0, Clone 48, Gen 0)
[04:48:06]
[04:48:06] Assembly optimizations on if available.
[04:48:06] Entering M.D.
[04:48:12] Will resume from checkpoint file
[04:48:15] ng M.D.
NNODES=4, MYRANK=0, HOSTNAME=hal5
NNODES=4, MYRANK=2, HOSTNAME=hal5
NNODES=4, MYRANK=3, HOSTNAME=hal5
NNODES=4, MYRANK=1, HOSTNAME=hal5
NODEID=0 argc=19
NODEID=2 argc=19
NODEID=1 argc=19
NODEID=3 argc=19
:-) G R O M A C S (-:
Groningen Machine for Chemical Simulation
:-) VERSION 3.3.99_development_200800503 (-:
Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2008, The GROMACS development team,
check out http://www.gromacs.org for more information.
:-) mdrun (-:
Reading file work/wudata_01.tpr, VERSION 3.3.99_development_20070618 (single precision)
[04:48:21] Will resume from checkpoint file
Note: tpx file_version 48, software version 56
Making 1D domain decomposition 1 x 1 x 4
starting mdrun 'IBX in water'
250000 steps, 500.0 ps.
[04:48:27] data_01.log
[04:48:27] Verified work/wudata_01.trr
[04:48:27] Verified work/wudata_01.xtc
[04:48:27] Verified work/wudata_01.edr
[04:48:27] Completed 50010 out of 250000 steps (20%)
t = 100.288 ps: Water molecule starting at atom 117853 can not be settled.
Check for bad contacts and/or reduce the timestep.
Wrote pdb files with previous and current coordinates
t = 100.290 ps: Water molecule starting at atom 117853 can not be settled.
Check for bad contacts and/or reduce the timestep.
Wrote pdb files with previous and current coordinates
t = 100.292 ps: Water molecule starting at atom 117853 can not be settled.
Check for bad contacts and/or reduce the timestep.
Wrote pdb files with previous and current coordinates
t = 100.294 ps: Water molecule starting at atom 117853 can not be settled.
Check for bad contacts and/or reduce the timestep.
Wrote pdb files with previous and current coordinates
t = 100.296 ps: Water molecule starting at atom 117853 can not be settled.
Check for bad contacts and/or reduce the timestep.
Wrote pdb files with previous and current coordinates
t = 100.298 ps: Water molecule starting at atom 117853 can not be settled.
Check for bad contacts and/or reduce the timestep.
Wrote pdb files with previous and current coordinates
t = 100.316 ps: Water molecule starting at atom 117346 can not be settled.
Check for bad contacts and/or reduce the timestep.
Wrote pdb files with previous and current coordinates
t = 100.318 ps: Water molecule starting at atom 117346 can not be settled.
Check for bad contacts and/or reduce the timestep.
Wrote pdb files with previous and current coordinates
t = 100.322 ps: Water molecule starting at atom 117346 can not be settled.
Check for bad contacts and/or reduce the timestep.
Wrote pdb files with previous and current coordinates
t = 100.324 ps: Water molecule starting at atom 115984 can not be settled.
Check for bad contacts and/or reduce the timestep.
Wrote pdb files with previous and current coordinates
t = 100.326 ps: Water molecule starting at atom 117346 can not be settled.
Check for bad contacts and/or reduce the timestep.
Wrote pdb files with previous and current coordinates
t = 100.328 ps: Water molecule starting at atom 135829 can not be settled.
Check for bad contacts and/or reduce the timestep.
Wrote pdb files with previous and current coordinates
t = 100.330 ps: Water molecule starting at atom 117346 can not be settled.
Check for bad contacts and/or reduce the timestep.
Wrote pdb files with previous and current coordinates
t = 100.332 ps: Water molecule starting at atom 115984 can not be settled.
Check for bad contacts and/or reduce the timestep.
Wrote pdb files with previous and current coordinates
t = 100.334 ps: Water molecule starting at atom 115984 can not be settled.
Check for bad contacts and/or reduce the timestep.
Wrote pdb files with previous and current coordinates
t = 100.336 ps: Water molecule starting at atom 115984 can not be settled.
Check for bad contacts and/or reduce the timestep.
Wrote pdb files with previous and current coordinates
t = 100.338 ps: Water molecule starting at atom 115984 can not be settled.
Check for bad contacts and/or reduce the timestep.
Wrote pdb files with previous and current coordinates
Step 50170:
The charge group starting at atom 117346 moved than the distance allowed by the domain decomposition (1.200000) in direction Z
distance out of cell 59.336140
Old coordinates: 9.597 2.748 4.253
New coordinates: -76.197 -34.908 67.316
Old cell boundaries in direction Z: 4.123 7.907
New cell boundaries in direction Z: 4.049 7.980
-------------------------------------------------------
Program mdrun, VERSION 3.3.99_development_200800503
Source code file: domdec.c, line: 2644
Fatal error:
A charge group move too far between two domain decomposition steps
-------------------------------------------------------
Thanx for Using GROMACS - Have a Nice Day
Error on node 1, will try to stop all the nodes
Halting parallel program mdrun on CPU 1 out of 4
gcq#0: Thanx for Using GROMACS - Have a Nice Day
[cli_1]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
[cli_0]: aborting job:
Fatal error in MPI_Sendrecv: Error message texts are not available
[04:57:06] Preparing to commence simulation
[04:57:06] - Ensuring status. Please wait.
[04:57:15] - Looking at optimizations...
[04:57:15] - Working with standard loops on this execution.
[04:57:15] - Files status OK
[04:57:17] - Expanded 4088254 -> 23992989 (decompressed 586.8 percent)
[04:57:17] Called DecompressByteArray: compressed_data_size=4088254 data_size=23992989, decompressed_data_size=23992989 diff=0
[04:57:17] - Digital signature verified
[04:57:17]
[04:57:17] Project: 2668 (Run 0, Clone 48, Gen 0)
[04:57:17]
[04:57:17] Entering M.D.
[04:57:23] Will resume from checkpoint file
NNODES=4, MYRANK=3, HOSTNAME=hal5
NNODES=4, MYRANK=2, HOSTNAME=hal5
NNODES=4, MYRANK=1, HOSTNAME=hal5
NNODES=4, MYRANK=0, HOSTNAME=hal5
NODEID=3 argc=19
NODEID=2 argc=19
NODEID=1 argc=19
NODEID=0 argc=19
:-) G R O M A C S (-:
Groningen Machine for Chemical Simulation
:-) VERSION 3.3.99_development_200800503 (-:
Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2008, The GROMACS development team,
check out http://www.gromacs.org for more information.
:-) mdrun (-:
Reading file work/wudata_01.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 56
Making 1D domain decomposition 1 x 1 x 4
starting mdrun 'IBX in water'
250000 steps, 500.0 ps.
[04:57:29] Resuming from checkpoint
[04:57:29] fcSaveRestoreState: I/O failed dir=0, var=0000000001E6DC00, varsize=581268
[04:57:29] Verified work/wudata_01.log
[04:57:29] Verified work/wudata_01.trr
[04:57:29] Verified work/wudata_01.xtc
[04:57:29] Verified work/wudata_01.edr
[04:57:29] Completed 50020 out of 250000 steps (20%)
[05:12:07] Completed 52500 out of 250000 steps (21%)
Writing checkpoint, step 52560 at Thu Aug 28 15:12:28 2008