Folding Forum

Posted: **Wed Oct 29, 2008 1:05 am**

And this is understandable Mr. Pande. Thanks for your feedback

Posted: **Wed Oct 29, 2008 1:10 am**

Let's hope the QA process sees some vast improvements
Having issues with 5 GPUs here.

Posted: **Wed Oct 29, 2008 1:11 am**

And implement that you always reQA a Project on the latest forced core, before you distribute the project. Record what core the Project was QAed on so you know if you have to reQA it before release.

Posted: **Wed Oct 29, 2008 1:36 am**

theo343 wrote:And implement that you always reQA a Project on the latest forced core, before you distribute the project. Record what core the Project was QAed on so you know if you have to reQA it before release.

Precisely... nothing revolutionary... even if just a couple WUs were run, this problem would have been evident and halted before it ever became a problem.

Posted: **Wed Oct 29, 2008 1:39 am**

VijayPande wrote:We keep an eye on the forum, but the first post was just a few hours ago. Due to staff having other responsibilities, our response will typically be on the hours time scale not minutes time scales for issues like this. I wish it could be faster, but that's what we're staffed to do at the moment.

Mr. Pande has the patients of a saint.

I do have this issue with two GPU (Nvidia) machines. After about 6 hours a project 5506 unit was finally sent out successfully. The UNSTABLE_MACHINE issue with the project 5801 unit persists. Any recommendations?

Posted: **Wed Oct 29, 2008 1:51 am**

I've got 15 Nvidia gpu's that I have to restart periodically to dump this wu. I'm ready for things to get back to normal(whatevr that is)

Posted: **Wed Oct 29, 2008 2:15 am**

...

To V.P. aka Dr. Pande aka Vijay Pande

... much obliged sir, and God Bless.

Peace

VijayPande wrote:PS In case you're curious:

MoneyGuyBK wrote: I am surprised that:
1) F@H released this WU in such a bad state
This was beta tested before (this was a project # change due to a move onto a new server -- which was done to try to keep work around while the CS servers were down).
However, more stumped that:
2) F@H has not chimed in here officially after 7 Pages of comments
We keep an eye on the forum, but the first post was just a few hours ago. Due to staff having other responsibilities, our response will typically be on the hours time scale not minutes time scales for issues like this. I wish it could be faster, but that's what we're staffed to do at the moment.

Posted: **Wed Oct 29, 2008 2:28 am**

VijayPande wrote:PS In case you're curious:
MoneyGuyBK wrote: I am surprised that:
1) F@H released this WU in such a bad state
This was beta tested before (this was a project # change due to a move onto a new server -- which was done to try to keep work around while the CS servers were down).

Two words: regression testing
This makes it all the more shocking just how broken the nVidia core 1.15 is.

Posted: **Wed Oct 29, 2008 5:26 am**

Well I come back home from work & see the p5801's have been pulled, switch on machines, flush all bad work units & we are up & running again.

I have NO server issues at the moment, all units have been returned safely to their servers & I see nothing but green ink in Fahspy.
Congratulations Vijay & co... I can rest easy for now that all my Linux SMP, Windows SMP, my standard clients,my ATI clients & especially my Nvidia clients
are happy for now.

Cheers Teddy

Posted: **Wed Oct 29, 2008 9:31 am**

VijayPande wrote:Sorry about the really nasty problem on this one. It was definitely strange since these WU's were QA'd before. I think this may be an issue where they were QA'd on an earlier core and 1.15 is causing issues.

Well ... I think you missed at least one of the QA steps ...

p5800 was fully tested through the whole QA process ... but not the p5801

Posted: **Wed Oct 29, 2008 11:56 am**

The sad thing about this is that half of my GPU folders(3 of 7 cards in total) will be dead in the water for 24 hours or more as i cannot reach them until tomorrow. (to much chaos on the roads today so im working from the homeoffice).

Those 3 cards are also the most powerful. This P5801 thing was extremly bad timing for me, as Ive been working my arse off with the clients the last couple of weeks to be competetive with a couple of guys on my team. I was just knifing and was ready to pass. I can now say goodbye to that aspect as my PPD statistic will plummit with only half my PPD for more than 24 hours and the other guys have access to all foldingmachines and have lost minimal PPD during these problems.

EDIT:
I also wounder how many Nvidia GPUs that will lay dead in the water for 24 hours or more, in total, because of the P5801 distribution.

I truly hope the QA procedure will get some improvements after this blunder.

Posted: **Wed Oct 29, 2008 12:25 pm**

toTOW wrote:
VijayPande wrote:Sorry about the really nasty problem on this one. It was definitely strange since these WU's were QA'd before. I think this may be an issue where they were QA'd on an earlier core and 1.15 is causing issues.
Well ... I think you missed at least one of the QA steps ...

p5800 was fully tested through the whole QA process ... but not the p5801

5801 was just a copy of another project, which did go all the way through QA. Nevertheless, I will have a talk with the responsible parties about this.

Posted: **Wed Oct 29, 2008 12:28 pm**

shatteredsilicon wrote: Two words: regression testing
This makes it all the more shocking just how broken the nVidia core 1.15 is.

1.15 passed all of the regression testing on machines at Stanford and NVIDIA and then passed FAH beta testing. There's not much more we can do than that before releasing it. Keep in mind that we now know that for many people (some boards), 1.15 is perfectly fine and stable, whereas for others, it doesn't work at all. If that's the case, my guess is that this is a CUDA or hardware issue. If the code in 1.15 were really broken, it would not work on any hardware, which is definitely not the case. We're working with NVIDIA on this one. The first step is to get the problem reproducible in their labs.

The bottom line here is that it is becoming clear that what works on some CUDA hardware platforms does not universally work on all. We have since gotten a few of the boards that cause problems and have included them in our recent testing.

Posted: **Wed Oct 29, 2008 12:35 pm**

VijayPande wrote:The bottom line here is that it is becoming clear that what works on some CUDA hardware platforms does not universally work on all. We have since gotten a few of the boards that cause problems and have included them in our recent testing.

So does this mean CUDA isn't compatible with all hardware which is supposed to be compatible with it, or does it point to the implementation of CUDA by the clients isn't compatible with all hardware? Or is it to soon to tell? I would hope it's the last option, as in the first case I'm afraid you don't have the same expedience in getting it sorted

Posted: **Wed Oct 29, 2008 12:54 pm**

MtM wrote:
VijayPande wrote:The bottom line here is that it is becoming clear that what works on some CUDA hardware platforms does not universally work on all. We have since gotten a few of the boards that cause problems and have included them in our recent testing.
So does this mean CUDA isn't compatible with all hardware which is supposed to be compatible with it, or does it point to the implementation of CUDA by the clients isn't compatible with all hardware? Or is it to soon to tell? I would hope it's the last option, as in the first case I'm afraid you don't have the same expedience in getting it sorted

Technically, if the same code work on certain cards but not on others, we can look at the driver or hardware level. However, the core is partly to be responsible of this as well so it's a two-side work to find out what wrong (NVIDIA with the CUDA code and PG with the core). This is what make debugging of this issue very hard.

Think of a car engine choking under load. The cause can be multiple from fuel quality, air quality, timing adjustement, ECU programming, mechanical problem or else so it take lots of diagnostic to find out what went wrong.

Folding Forum

Project 5801 issues. [Should be Offline]

Re: Project 5801 issues. [Should be Offline]

Re: Project 5801 issues. [Should be Offline]

Re: Project 5801 issues. [Should be Offline]

Re: Project 5801 issues. [Should be Offline]

Re: Project 5801 issues.

Re: Project 5801 issues. [Should be Offline]

Re: Project 5801 issues.

Re: Project 5801 issues.

Re: Project 5801 issues. [Should be Offline]

Re: Project 5801 issues. [Should be Offline]

Re: Project 5801 issues. [Should be Offline]

Re: Project 5801 issues. [Should be Offline]

Re: Project 5801 issues.

Re: Project 5801 issues.

Re: Project 5801 issues.