Page 2 of 3

Project: 7504 (Run 19, Clone 11, Gen 0)

Posted: Fri Nov 11, 2011 3:28 am
by HayesK
i7-930, Ubuntu 11.04, client 6.34, core 2.27. Started having "Client-core communications error: ERROR 0x8b" with Project: 7504 (Run 19, Clone 11, Gen 0) a few days ago. The first time, the client recovered after a few failures and downloaded a different wu, which was completed OK. The second time, the client paused after repeated failures, but downloaded a different wu when restarted, which was completed OK. The third time, the client paused after repeated failures, but would not download a different wu until deleting the work folder, queue.dat and machinedependent.dat files. The rig has completed 2 prior p7504 OK. Copy of the terminal text below.

Code: Select all

[19:33:10] Project: 7504 (Run 19, Clone 11, Gen 0)
[19:33:10] 
[19:33:10] Assembly optimizations on if available.
[19:33:10] Entering M.D.
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                            :-)  VERSION 4.5.3  (-:

        Written by Emile Apol, Rossen Apostolov, Herman J.C. Berendsen,
      Aldert van Buuren, Pär Bjelkmar, Rudi van Drunen, Anton Feenstra, 
        Gerrit Groenhof, Peter Kasson, Per Larsson, Pieter Meulenhoff, 
           Teemu Murtola, Szilard Pall, Sander Pronk, Roland Schulz, 
                Michael Shirts, Alfons Sijbers, Peter Tieleman,

               Berk Hess, David van der Spoel, and Erik Lindahl.

       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
            Copyright (c) 2001-2010, The GROMACS development team at
        Uppsala University & The Royal Institute of Technology, Sweden.
            check out http://www.gromacs.org for more information.


                               :-)  Gromacs  (-:

Reading file work/wudata_07.tpr, VERSION 4.5.3-dev-20101129-58a6b (single precision)
Starting 8 threads
Making 2D domain decomposition 4 x 2 x 1
starting mdrun 'KPC in water'
500000 steps,   2000.0 ps.
[19:33:16] Mapping NT from 8 to 8 
[19:33:16] Completed 0 out of 500000 steps  (0%)

Step 6, time 0.024 (ps)  LINCS WARNING
relative constraint deviation after LINCS:
rms 0.008367, max 0.214761 (between atoms 2120 and 2121)
bonds that rotated more than 90 degrees:
 atom 1 atom 2  angle  previous, current, constraint length
   2120   2121  101.2    0.0919   0.1120      0.0922
   2126   2127   99.4    0.0921   0.1008      0.0922

Step 7, time 0.028 (ps)  LINCS WARNING
relative constraint deviation after LINCS:
rms 0.002713, max 0.066975 (between atoms 2126 and 2127)
bonds that rotated more than 90 degrees:
 atom 1 atom 2  angle  previous, current, constraint length
   2126   2127   92.1    0.1008   0.0983      0.0922

Step 8, time 0.032 (ps)  LINCS WARNING
relative constraint deviation after LINCS:
rms 0.005786, max 0.127879 (between atoms 2120 and 2121)
bonds that rotated more than 90 degrees:
 atom 1 atom 2  angle  previous, current, constraint length
   2120   2121   97.4    0.0931   0.1039      0.0922
   2126   2127   98.2    0.0983   0.1016      0.0922

Step 9, time 0.036 (ps)  LINCS WARNING
relative constraint deviation after LINCS:
rms 0.006935, max 0.194229 (between atoms 2126 and 2127)
bonds that rotated more than 90 degrees:
 atom 1 atom 2  angle  previous, current, constraint length
   2126   2127  101.1    0.1016   0.1101      0.0922

NOTE: Turning on dynamic load balancing


Step 10, time 0.04 (ps)  LINCS WARNING
relative constraint deviation after LINCS:
rms 0.028186, max 0.840337 (between atoms 2126 and 2127)
bonds that rotated more than 90 degrees:
 atom 1 atom 2  angle  previous, current, constraint length
   2120   2121   95.4    0.0898   0.1019      0.0922
   2126   2127  123.0    0.1101   0.1696      0.0922

Step 11, time 0.044 (ps)  LINCS WARNING
relative constraint deviation after LINCS:
rms 192.414480, max 3864.146240 (between atoms 2115 and 2118)
bonds that rotated more than 90 degrees:
 atom 1 atom 2  angle  previous, current, constraint length
   2118   2126  112.3    0.2094 178.7684      0.1664
   2118   2127  104.3    0.1585 174.2906      0.1664
   2126   2127  117.5    0.1696   3.2380      0.0922
   2136   2144  109.0    0.1522   0.9631      0.1522
   2136   2138  113.3    0.1526   1.1632      0.1526
   2138   2141  105.5    0.1522   0.5683      0.1522
   2144   2146  107.2    0.1335   0.4836      0.1335
   2144   2145  115.5    0.1229   0.2654      0.1229
   2146   2148   91.7    0.1449   0.4996      0.1449
   2091   2094  106.9    0.1526   1.5871      0.1526
   2109   2111  107.9    0.1335  82.3349      0.1335
   2109   2110   96.1    0.1229  13.1070      0.1229
   2111   2113  105.6    0.1446 129.0341      0.1449
   2113   2132   99.2    0.1520 178.8563      0.1522
   2113   2115   95.9    0.1520 463.1650      0.1526
   2132   2134   98.1    0.1335  38.1713      0.1335
   2132   2133  109.8    0.1229  38.3856      0.1229
   2094   2097  154.1    0.1526   0.1679      0.1526
   2089   2091  130.5    0.1526   7.4305      0.1526
   2069   2085  146.7    0.1522   0.3818      0.1522
   2085   2087  109.7    0.1335   1.5452      0.1335
   2085   2086  135.6    0.1229   0.4953      0.1229
Segmentation fault
[19:33:17] CoreStatus = 8B (139)
[19:33:17] Client-core communications error: ERROR 0x8b
[19:33:17] Deleting current work unit & continuing...
[19:33:27] Trying to send all finished work units
[19:33:27] + No unsent completed units remaining.
[19:33:27] - Preparing to get new work unit...
[19:33:27] Cleaning up work directory
[19:33:27] + Attempting to get work packet
[19:33:27] Passkey found
[19:33:27] - Will indicate memory of 3800 MB
[19:33:27] - Connecting to assignment server
[19:33:27] Connecting to http://assign.stanford.edu:8080/
[19:33:28] Posted data.
[19:33:28] Initial: 8F80; - Successful: assigned to (128.143.199.97).
[19:33:28] + News From Folding@Home: Welcome to Folding@Home
[19:33:28] Loaded queue successfully.
[19:33:28] Sent data
[19:33:28] Connecting to http://128.143.199.97:8080/
[19:33:29] Posted data.
[19:33:29] Initial: 0000; - Receiving payload (expected size: 1510591)
[19:33:32] - Downloaded at ~491 kB/s
[19:33:32] - Averaged speed for that direction ~502 kB/s
[19:33:32] + Received work.
[19:33:32] + Closed connections
[19:33:37] 
[19:33:37] + Processing work unit
[19:33:37] Core required: FahCore_a3.exe
[19:33:37] Core found.
[19:33:37] Working on queue slot 08 [November 10 19:33:37 UTC]
[19:33:37] + Working ...
[19:33:37] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 08 -np 8 -checkpoint 30 -verbose -lifeline 1487 -version 634'

[19:33:37] 
[19:33:37] *------------------------------*
[19:33:37] Folding@Home Gromacs SMP Core
[19:33:37] Version 2.27 (Dec. 15, 2010)
[19:33:37] 
[19:33:37] Preparing to commence simulation
[19:33:37] - Looking at optimizations...
[19:33:37] - Created dyn
[19:33:37] - Files status OK
[19:33:38] - Expanded 1510079 -> 2700832 (decompressed 178.8 percent)
[19:33:38] Called DecompressByteArray: compressed_data_size=1510079 data_size=2700832, decompressed_data_size=2700832 diff=0
[19:33:38] - Digital signature verified
[19:33:38] 
[19:33:38] Project: 7504 (Run 19, Clone 11, Gen 0)
[19:33:38] 
[19:33:38] Assembly optimizations on if available.
[19:33:38] Entering M.D.
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                            :-)  VERSION 4.5.3  (-:

        Written by Emile Apol, Rossen Apostolov, Herman J.C. Berendsen,
      Aldert van Buuren, Pär Bjelkmar, Rudi van Drunen, Anton Feenstra, 
        Gerrit Groenhof, Peter Kasson, Per Larsson, Pieter Meulenhoff, 
           Teemu Murtola, Szilard Pall, Sander Pronk, Roland Schulz, 
                Michael Shirts, Alfons Sijbers, Peter Tieleman,

               Berk Hess, David van der Spoel, and Erik Lindahl.

       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
            Copyright (c) 2001-2010, The GROMACS development team at
        Uppsala University & The Royal Institute of Technology, Sweden.
            check out http://www.gromacs.org for more information.


                               :-)  Gromacs  (-:

Reading file work/wudata_08.tpr, VERSION 4.5.3-dev-20101129-58a6b (single precision)
Starting 8 threads
Making 2D domain decomposition 4 x 2 x 1
starting mdrun 'KPC in water'
500000 steps,   2000.0 ps.
[19:33:44] Mapping NT from 8 to 8 
[19:33:44] Completed 0 out of 500000 steps  (0%)

Step 6, time 0.024 (ps)  LINCS WARNING
relative constraint deviation after LINCS:
rms 0.008367, max 0.214761 (between atoms 2120 and 2121)
bonds that rotated more than 90 degrees:
 atom 1 atom 2  angle  previous, current, constraint length
   2120   2121  101.2    0.0919   0.1120      0.0922
   2126   2127   99.4    0.0921   0.1008      0.0922

Step 7, time 0.028 (ps)  LINCS WARNING
relative constraint deviation after LINCS:
rms 0.002713, max 0.066975 (between atoms 2126 and 2127)
bonds that rotated more than 90 degrees:
 atom 1 atom 2  angle  previous, current, constraint length
   2126   2127   92.1    0.1008   0.0983      0.0922

Step 8, time 0.032 (ps)  LINCS WARNING
relative constraint deviation after LINCS:
rms 0.005786, max 0.127879 (between atoms 2120 and 2121)
bonds that rotated more than 90 degrees:
 atom 1 atom 2  angle  previous, current, constraint length
   2120   2121   97.4    0.0931   0.1039      0.0922
   2126   2127   98.2    0.0983   0.1016      0.0922

Step 9, time 0.036 (ps)  LINCS WARNING
relative constraint deviation after LINCS:
rms 0.006935, max 0.194229 (between atoms 2126 and 2127)
bonds that rotated more than 90 degrees:
 atom 1 atom 2  angle  previous, current, constraint length
   2126   2127  101.1    0.1016   0.1101      0.0922

NOTE: Turning on dynamic load balancing


Step 10, time 0.04 (ps)  LINCS WARNING
relative constraint deviation after LINCS:
rms 0.028172, max 0.840337 (between atoms 2126 and 2127)
bonds that rotated more than 90 degrees:
 atom 1 atom 2  angle  previous, current, constraint length
   2120   2121   95.4    0.0898   0.1019      0.0922
   2126   2127  123.0    0.1101   0.1696      0.0922

Step 11, time 0.044 (ps)  LINCS WARNING
relative constraint deviation after LINCS:
rms 192.318370, max 3864.146240 (between atoms 2115 and 2118)
bonds that rotated more than 90 degrees:
 atom 1 atom 2  angle  previous, current, constraint length
   2118   2126  112.3    0.2094 178.7684      0.1664
   2118   2127  104.3    0.1585 174.2906      0.1664
   2126   2127  117.5    0.1696   3.2380      0.0922
   2136   2144  109.0    0.1522   0.9631      0.1522
   2136   2138  113.3    0.1526   1.1632      0.1526
   2138   2141  105.5    0.1522   0.5683      0.1522
   2144   2146  107.2    0.1335   0.4836      0.1335
   2144   2145  115.5    0.1229   0.2654      0.1229
   2146   2148   91.7    0.1449   0.4996      0.1449
   2089   2091  130.5    0.1526   7.4305      0.1526
   2091   2094  106.9    0.1526   1.5871      0.1526
   2109   2111  107.9    0.1335  82.3349      0.1335
   2109   2110   96.1    0.1229  13.1070      0.1229
   2111   2113  105.6    0.1446 129.0341      0.1449
   2113   2132   99.2    0.1520 178.8563      0.1522
   2113   2115   95.9    0.1520 463.1650      0.1526
   2132   2134   98.1    0.1335  38.1713      0.1335
   2132   2133  109.8    0.1229  38.3856      0.1229
   2094   2097  154.1    0.1526   0.1679      0.1526
   2069   2085  146.7    0.1522   0.3818      0.1522
   2085   2087  109.7    0.1335   1.5452      0.1335
   2085   2086  135.6    0.1229   0.4953      0.1229
Segmentation fault
[19:33:45] CoreStatus = 8B (139)
[19:33:45] Client-core communications error: ERROR 0x8b
[19:33:45] 
Folding@Home will go to sleep for 1 day as there have been 5 consecutive Cores executed which failed to complete a work unit.
[19:33:45] (To wake it up early, quit the application and restart it.)
[19:33:45] If problems persist, please visit our website at http://folding.stanford.edu for help.
[19:33:45] + Sleeping...

[19:36:56] - Autosending finished units... [November 10 19:36:56 UTC]
[19:36:56] Trying to send all finished work units
[19:36:56] + No unsent completed units remaining.
[19:36:56] - Autosend completed

Re: Project: 7504 (Run 19, Clone *, Gen 0)

Posted: Fri Nov 11, 2011 5:17 am
by bruce
Please note. I've added the following note to the first post in this topic but will repeat it here for those of you who may not start on the first page:

This is the first time I've ever seen this many reports of this type, and FINALLY a pattern has emerged. It looks like there's a very high percentage of bad clones for Run 19, Gen 0 so I'm merging them and changing the topic title.

Re: Project: 7504 (Run 19, Clone *, Gen 0)

Posted: Fri Nov 11, 2011 5:33 am
by Grandpa_01
Bruce I have a feeling you may be right here but just a little observation about the 7504 WU's I had 2 of these in a row the other day 7504 (8,45,6) and (8,45,7) the first completed with no problem but the second failed 2 times at 80+% before I completed it. The rig they folded on was 6903 / 6904 folding stable and has not had a failure in a long time. I had to drop the OC down a notch in order to get it to complete. It did complete successfully the 3rd attempt though. I think these WU's may be a little bit on the tough side when it comes to stability and worth keeping an eye on.

Re: Project: 7504 (Run 19, Clone *, Gen 0)

Posted: Fri Nov 11, 2011 6:00 am
by bruce
All I'm saying is that there's a pattern that's emerging that covers a number of failures. I'm not making any assumptions about 7504 (8,45,6) and (8,45,7) or whether they might have a common cause or not. They don't fit the pattern I'm seeing but that proves nothing.

Re: Project: 7504 (Run 19, Clone *, Gen 0)

Posted: Fri Nov 11, 2011 6:26 am
by Grandpa_01
bruce wrote:All I'm saying is that there's a pattern that's emerging that covers a number of failures. I'm not making any assumptions about 7504 (8,45,6) and (8,45,7) or whether they might have a common cause or not. They don't fit the pattern I'm seeing but that proves nothing.
I think the patter you are seeing is a correct assumption due to the fact they all fail at 0%. I would like to be able to get 1 of these faild WU's to see what I could do with it. Got a magic wand you can wave and send me 1.

Re: Project: 7504 (Run 19, Clone *, Gen 0)

Posted: Fri Nov 11, 2011 6:37 am
by bruce
Nope. I don't.

I don't think even the PG has such a magic wand.

Re: Project: 7504 (Run 19, Clone *, Gen 0)

Posted: Fri Nov 11, 2011 2:18 pm
by HayesK
My wu history shows 50 p7504 have been completed and 8 more currently in progress, none of which are Run 19. All running Ubuntu, client 6.34, core 2.27.

Re: Project: 7504 (Run 19, Clone *, Gen 0)

Posted: Fri Nov 11, 2011 2:24 pm
by Just Brew It!
FYI I am seeing the same repeat failure (error 0x8b) on one of my systems with project 7504 this morning. Have tried deleting the queue but it keeps re-downloading the same WU. Will try changing the machine ID as recommended earlier in this thread...

Question: Do these failures count against the 80% successful return rate we need to maintain in order to get the SMP bonus points?

Edit: The failing WU is Run 19, Clone 52, Gen 0.

Edit #2: After changing the machine ID it downloaded a Run 16, Clone 134, Gen 7 (still Project 7504). This one looks like it is going to be OK (4% completed so far).

Re: Project: 7504 (Run 19, Clone *, Gen 0)

Posted: Fri Nov 11, 2011 5:06 pm
by gwildperson
Just Brew It! wrote:Question: Do these failures count against the 80% successful return rate we need to maintain in order to get the SMP bonus points?
Of course. A failure is a failure, no matter what caused it. That's why it's 80% and not 99%.

I would reformulate the question slightly differently.
Question: Do repeated failures OF THE SAME UNIT all count against the 80% successful return rate we need to maintain in order to get the SMP bonus points or is each WU counted separately?

We used to think that a WU would be reassigned no more than three times if it failed repeatedly. After the same failure repeatedly happens on the same WU a large number of times, there are two problems that need solving. One is why do we need to change the MachineID to get rid of the problem? The second is how can repeated failures about which we can do nothing disable our bonus (if, in fact, they do)?

Can somebody who is represented on the QRB pass this on to their team representative. My team isn't represented.

Re: Project: 7504 (Run 19, Clone *, Gen 0)

Posted: Fri Nov 11, 2011 5:16 pm
by Just Brew It!
gwildperson wrote:I would reformulate the question slightly differently.
Question: Do repeated failures OF THE SAME UNIT all count against the 80% successful return rate we need to maintain in order to get the SMP bonus points or is each WU counted separately?
Yes, that is actually what I meant.

A follow-on question would be: Is that 80% calculated based on recent history only (e.g. the last X WUs assigned, the past X days, or whatever), or is it calculated based on all WUs which have been assigned to that passkey since the beginning of time?

Re: Project: 7504 (Run 19, Clone *, Gen 0)

Posted: Fri Nov 11, 2011 5:43 pm
by 7im
None of the above. Since the start of the QRB program was the original answer. However, after the program went mainstream (a few months after the start), the stats were reset to give everyone a fair shake. So, it is almost from the beginning, total WUs since then, IIRC.

Re: Project: 7504 (Run 19, Clone *, Gen 0)

Posted: Fri Nov 11, 2011 5:59 pm
by Just Brew It!
7im wrote:None of the above. Since the start of the QRB program was the original answer. However, after the program went mainstream (a few months after the start), the stats were reset to give everyone a fair shake. So, it is almost from the beginning, total WUs since then, IIRC.
OK, that sounds fairly reasonable. It should prevent problem WUs like 7504 from causing people to fall below 80% unless it happens very soon after they start using the passkey.

I'm still curious to know whether each crash on the same WU counts as a separate failure though...

Re: Project: 7504 (Run 19, Clone *, Gen 0)

Posted: Fri Nov 11, 2011 6:38 pm
by Tobit
A teammate is also having issues, I don't have any more details at this time:

Code: Select all

[16:10:41] Project: 7504 (Run 19, Clone 134, Gen 0)
[16:10:41] 
[16:10:41] Entering M.D.
[16:10:47] Mapping NT from 8 to 8 
[16:10:48] Completed 0 out of 500000 steps  (0%)
[16:10:56] CoreStatus = C0000029 (-1073741783)
[16:10:56] Client-core communications error: ERROR 0xc0000029
[16:10:56] Deleting current work unit & continuing...

Re: Project: 7504 (Run 19, Clone *, Gen 0)

Posted: Fri Nov 11, 2011 6:40 pm
by 7im
Just Brew It! wrote: I'm still curious to know whether each crash on the same WU counts as a separate failure though...

One WU = one fail. If you keep getting the same WU, it's still just one fail.

Re: Project: 7504 (Run 19, Clone *, Gen 0)

Posted: Fri Nov 11, 2011 6:47 pm
by Just Brew It!
Tobit wrote:A teammate is also having issues, I don't have any more details at this time:

...
Looks similar to what was happening to me this morning.

If it keeps happening tell him to delete the current WU, change the machine ID, and restart the client...