Memory leak @ 2416
Moderators: Site Moderators, FAHC Science Team
Memory leak @ 2416
Pls. help, on my Linux machine runs this:
...
[23:49:22] Folding@Home Gromacs Core
[23:49:22] Version 1.90 (March 8, 2006)
...
[23:49:24] Project: 2416 (Run 60, Clone 62, Gen 7)
...
and it's eating almost all my memory (as sys, not nice) and growing (~ 1M/sec) ...
Cpu(s): 0.0%us, 99.3%sy, 0.7%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 953948k total, 947952k used, 5996k free, 71904k buffers
Swap: 1270072k total, 1270028k used, 44k free, 101644k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3661 root 39 19 1788m 596m 752 R 99.6 64.1 36:11.59 FahCore_78.exe
3429 stanisla 15 0 14644 708 468 R 0.7 0.1 0:07.76 top
1 root 15 0 10316 264 236 S 0.0 0.0 0:00.30 init
I'm not sure what to do, or if I make some mistake ...
Thanks!
Stanislav
...
[23:49:22] Folding@Home Gromacs Core
[23:49:22] Version 1.90 (March 8, 2006)
...
[23:49:24] Project: 2416 (Run 60, Clone 62, Gen 7)
...
and it's eating almost all my memory (as sys, not nice) and growing (~ 1M/sec) ...
Cpu(s): 0.0%us, 99.3%sy, 0.7%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 953948k total, 947952k used, 5996k free, 71904k buffers
Swap: 1270072k total, 1270028k used, 44k free, 101644k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3661 root 39 19 1788m 596m 752 R 99.6 64.1 36:11.59 FahCore_78.exe
3429 stanisla 15 0 14644 708 468 R 0.7 0.1 0:07.76 top
1 root 15 0 10316 264 236 S 0.0 0.0 0:00.30 init
I'm not sure what to do, or if I make some mistake ...
Thanks!
Stanislav
-
- Site Moderator
- Posts: 6359
- Joined: Sun Dec 02, 2007 10:38 am
- Location: Bordeaux, France
- Contact:
Make a backup of your FAH folder (including everything), just in case you need to send it to Stanford ... then delete the WU.
We'll wait for an answer from Stanford to see if they need you to send the backup or if they can have a look the that particular WU
edit : I sent a mail to Paula who is in charge of this project ... let's wait for her answer
We'll wait for an answer from Stanford to see if they need you to send the backup or if they can have a look the that particular WU
edit : I sent a mail to Paula who is in charge of this project ... let's wait for her answer
I will take a look
Hey guys!
thank you for taking care of this.
I will take a look and come back to you asap.
pau
thank you for taking care of this.
I will take a look and come back to you asap.
pau
Re: I will take a look
Okay, so if you are still interested in it, so look at http://rapidshare.com/files/76186778/FA ... k.zip.html
Thanks!
Stanislav
Thanks!
Stanislav
Re: I will take a look
This WU is broken. I hope Pande Group can nail at least one cause of the 0x79 error with this WU.czonkin wrote:Okay, so if you are still interested in it, so look at http://rapidshare.com/files/76186778/FA ... k.zip.html
Thanks!
Stanislav
The FAH504-Linux.exe will report this:
My box has 2 GB of memory and it managed to allocate a 80% before erroring out.[10:05:40] Project: 2416 (Run 60, Clone 62, Gen 7)
[10:05:40]
[10:05:40] Assembly optimizations on if available.
[10:05:40] Entering M.D.
Gromacs is Copyright (c) 1991-2003, University of Groningen, The Netherlands
This inclusion of Gromacs code in the Folding@Home Core is under
a special license (see http://folding.stanford.edu/gromacs.html)
specially granted to Stanford by the copyright holders. If you
are interested in using Gromacs, visit http://www.gromacs.org where
you can download a free version of Gromacs under
the terms of the GNU General Public License (GPL) as published
by the Free Software Foundation; either version 2 of the License,
or (at your option) any later version.
[10:05:47] Protein: p2416_Ribosome_Na
[10:05:47]
[10:05:47] Writing local files
Fatal error: realloc for nlist->jjnr (1041039360 bytes, file ns.c, line 388, nlist->jjnr=0x0x74c48008): Cannot allocate memory
[10:07:01] Gromacs error.
[10:07:01]
[10:07:01] Folding@home Core Shutdown: UNKNOWN_ERROR
[10:07:02] CoreStatus = 79 (121)
[10:07:02] Client-core communications error: ERROR 0x79
[10:07:02] Deleting current work unit & continuing...
[10:07:19] - Preparing to get new work unit...
[10:07:19] + Attempting to get work packet
[10:07:19] - Connecting to assignment server
[10:07:20] - Successful: assigned to (171.65.103.162).
[10:07:20] + News From Folding@Home: Welcome to Folding@Home
With fah6:
This time it managed to allocate 85% of memory before death.[10:11:36] Project: 2416 (Run 60, Clone 62, Gen 7)
[10:11:36]
[10:11:36] Assembly optimizations on if available.
[10:11:36] Entering M.D.
Gromacs is Copyright (c) 1991-2003, University of Groningen, The Netherlands
This inclusion of Gromacs code in the Folding@Home Core is under
a special license (see http://folding.stanford.edu/gromacs.html)
specially granted to Stanford by the copyright holders. If you
are interested in using Gromacs, visit http://www.gromacs.org where
you can download a free version of Gromacs under
the terms of the GNU General Public License (GPL) as published
by the Free Software Foundation; either version 2 of the License,
or (at your option) any later version.
[10:11:43] Protein: p2416_Ribosome_Na
[10:11:43]
[10:11:43] Writing local files
Fatal error: realloc for nlist->jjnr (1055457280 bytes, file ns.c, line 388, nlist->jjnr=0x0x73e61008): Cannot allocate memory
[10:13:13] Gromacs error.
[10:13:13]
[10:13:13] Folding@home Core Shutdown: UNKNOWN_ERROR
[10:13:13] CoreStatus = 79 (121)
[10:13:13] Client-core communications error: ERROR 0x79
[10:13:13] Deleting current work unit & continuing...
[10:13:24] - Preparing to get new work unit...
[10:13:24] + Attempting to get work packet
[10:13:24] - Connecting to assignment server
Thank you!
Ok. I am convinced
I will remove it for now. In the meantime, I am still running it...
Thank you everybody!
pau
I will remove it for now. In the meantime, I am still running it...
Thank you everybody!
pau
-
- Posts: 450
- Joined: Tue Dec 04, 2007 8:36 pm
Re: Thank you!
I hope that the action is not only pulling of the WU, but a research why it is doing what it is doing and implementing a fix for that issue into the FAH Core files.ppetrone wrote:Ok. I am convinced
I will remove it for now. In the meantime, I am still running it...
Thank you everybody!
pau
Re: Thank you!
Yes, exactly. That is the only reason why I am running it.
The research will be to find out whether this is a specific WU problem (most likely) or a more general problem.
Thanks,
Pau
The research will be to find out whether this is a specific WU problem (most likely) or a more general problem.
Thanks,
Pau
-
- Posts: 1024
- Joined: Sun Dec 02, 2007 12:43 pm
Even if you find that that specific WU has a problem, it also has a more general problem. The client deleted the WU rather than reporting the problem to the server. You (i.e.-Stanford) should have some indication that this WU has failed 100 times (or some much smaller number) so you can decide if it needs to be removed from circulation without having us keep the statistics for you.ppetrone wrote:The research will be to find out whether this is a specific WU problem (most likely) or a more general problem.
As I said before I am currently working in this specific WU, to understand if the issue is generalized or not.
I am sorry if it seems as if "you" are keeping statistics for "us".
I understand Folding@Home as a big group of people (donors+scientists) doing statistics *together* to solve relevant biological questions. For that reason, I believe there is a tacit agreement of collaboration and patience.
Paula
I am sorry if it seems as if "you" are keeping statistics for "us".
I understand Folding@Home as a big group of people (donors+scientists) doing statistics *together* to solve relevant biological questions. For that reason, I believe there is a tacit agreement of collaboration and patience.
Paula
-
- Posts: 1024
- Joined: Sun Dec 02, 2007 12:43 pm
Sorry, I didn't mean that there was a big distinction between "you" and "us" but I can see how it sounded like that.ppetrone wrote:I am sorry if it seems as if "you" are keeping statistics for "us".
In fact, there are three types of things that are collaboratively working on FAH. Some things are best done by the Pande-group-type people. Some things are best done by the donor-type people. Some things are best done by software.
If several of the donor-type-people all encounter the same error, it's statistically unlikely that they'll find each other. Nevertheless, in this instance they did. Because of that, the donor-type-people were able to generate a request to find out what's going on with this case. (And a big thank you for accepting this responsibility)
Figuring out why the WU failed is best done by you, and that issue is (probably) important in more WUs than just that one, but even if it's unique, it's important.
If the FAH client is able to report this condition to the server, it's a statistical certainty that those various error reports can find each other and be examined by the Pande-group-type-people like yourself. I decided to call that a universal bug, though maybe it can be considered as an enhancement request. In any case, the reports to the server are a universal problem that is best done by improved software (no matter what is actually wrong with the WU).
As a donor-type-person, I'm also saying that it seems like we're wasting valuable resources (much more than "usually") repeating the same WUs with the the same errors many, many times and there ought to be a better way to find them and transfer them from our queue to your queue with less wasted resources.