UNSTABLE_MACHINE resets GPU at ~1.51%

CaulfieldCap · Post by **CaulfieldCap** » Mon Jun 17, 2013 7:39 pm

I'm very new to Folding@Home, as I've just created my account and installed everything. I apologize if this isn't an exceptional question, but it is one that I can't seem to figure out.

So right now, my GPU (nVidia GeForce 660Ti) is running PRCG 8074 (42, 27, 69). Every time it reaches 1.51 or 1.52 percent complete, it resets back to 0% complete, reaches 1.51 or 1.52 percent complete again, and repeats. How can I fix this, and is there any information anyone would need to help diagnose my problem. Thank you very much!

-CC

bollix47 · Post by **bollix47** » Mon Jun 17, 2013 7:50 pm

Welcome to the folding@home support forum CaulfieldCap.

If you could please supply the log as described here we will try to help.

CaulfieldCap · Post by **CaulfieldCap** » Mon Jun 17, 2013 7:53 pm

Sure, here's what shows up in the console as it resets.

As you can see, after is passes 2%, something returns 52 and UNSTABLE_MACHINE comes up...

Then it appears to restart.

Code: Select all

19:48:03:WU02:FS00:0x15:Setting checkpoint frequency: 500000
19:48:03:WU02:FS00:0x15:Completed         3 out of 50000000 steps (0%).
19:48:36:WARNING:Exception: 8:127.0.0.1: Send error: 10053: An established connection was aborted by the software in your host machine.
19:49:22:WU00:FS01:0xa3:Completed 15000 out of 500000 steps  (3%)
19:49:39:12:127.0.0.1:New Web connection
19:50:03:WU02:FS00:0x15:Completed    500000 out of 50000000 steps (1%).
19:52:02:WU02:FS00:0x15:Completed   1000000 out of 50000000 steps (2%).
19:52:02:WU02:FS00:0x15:mdrun_gpu returned 52
19:52:02:WU02:FS00:0x15:NANs detected on GPU
19:52:02:WU02:FS00:0x15:
19:52:02:WU02:FS00:0x15:Folding@home Core Shutdown: UNSTABLE_MACHINE
19:52:02:WARNING:WU02:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
19:52:02:WARNING:WU02:FS00:Too many errors, failing
19:52:02:WU02:FS00:Sending unit results: id:02 state:SEND error:FAILED project:8074 run:42 clone:27 gen:69 core:0x15 unit:0x0000004f6652edb450b430db82969efb
19:52:02:WU02:FS00:Connecting to 171.67.108.36:8080
19:52:02:WU01:FS00:Connecting to assign-GPU.stanford.edu:80
19:52:03:WU02:FS00:Server responded WORK_QUIT (404)
19:52:03:WARNING:WU02:FS00:Server did not like results, dumping
19:52:03:WU02:FS00:Cleaning up
19:52:03:WU01:FS00:News: Welcome to Folding@Home
19:52:03:WU01:FS00:Assigned to work server 171.67.108.36
19:52:03:WU01:FS00:Requesting new work unit for slot 00: READY gpu:0:GK104 [GeForce GTX 660 Ti] from 171.67.108.36
19:52:03:WU01:FS00:Connecting to 171.67.108.36:8080
19:52:04:WU01:FS00:Downloading 59.59KiB
19:52:04:WU01:FS00:Download complete
19:52:04:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:8074 run:53 clone:27 gen:54 core:0x15 unit:0x0000003a6652edb450b431192c7ad41a
19:52:04:WU01:FS00:Starting
19:52:04:WU01:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" "C:/Users/Ian Zane/AppData/Roaming/FAHClient/cores/www.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_15.fah/FahCore_15.exe" -dir 01 -suffix 01 -version 703 -lifeline 4148 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
19:52:04:WU01:FS00:Started FahCore on PID 6400
19:52:04:WU01:FS00:Core PID:6696
19:52:04:WU01:FS00:FahCore 0x15 started
19:52:05:WU01:FS00:0x15:

bollix47 · Post by **bollix47** » Mon Jun 17, 2013 8:07 pm

As long as your next work unit is progressing properly you should be okay. You might want to check your temperatures to ensure they're not too high. If you have this type of error every time on your 660ti you could also run a memory test.

There are no returns in the database for project:8074 run:42 clone:27 gen:69 at this time. I will mark it for followup.

Status 7A

CaulfieldCap · Post by **CaulfieldCap** » Mon Jun 17, 2013 8:16 pm

My GPU is actually running relatively cool, at only 65 C (in game it runs at ~84 C). What do you mean by my next work unit? I'm sorry, I'm very new. Will I ever reach my next work unit if I don't complete this one?

Also, after doing a bit more research, I found that the issue may have come up due to an unstable overclock on my GPU. I've un-overclocked it now, and it seems to be working better (currently at 4.09%). I'll post later in this thread if it remains stable.

bollix47 · Post by **bollix47** » Mon Jun 17, 2013 8:25 pm

What do you mean by my next work unit? I'm sorry, I'm very new. Will I ever reach my next work unit if I don't complete this one?

You received a different work unit after the one you reported(project:8074 run:42 clone:27 gen:69) was dumped: project:8074 run:53 clone:27 gen:54

Sometimes the same work unit (same PRCG numbers) does get returned to you for processing if it failed. That can happen a number of times (I think up to 5) before the server realizes it's not going to get this work unit back from you so it moves on to a different work unit which is what you're working on now and since it's working better the failed one could very well have been a bad work unit. Or it could have been a bad overclock. We'll know more if and when the work unit is returned by someone else.

Post by **bruce** » Mon Jun 17, 2013 8:59 pm

FAH is known to "push" GPUs harder than many other applications which has often meant that unstable overclocks are "discovered" by FAH.

bollix47 · Post by **bollix47** » Tue Jun 18, 2013 11:37 am

The work unit was completed by another folder:

Hi ***** (team *****),
Your WU (P8074 R42 C27 G69) was added to the stats database on 2013-06-17 21:00:09 for 3874 points of credit.

Looks more likely that the overclock may have caused the problem.

Followup report closed.

ntsarb · Post by **ntsarb** » Fri Sep 13, 2013 6:42 pm

I think I'm dealing with the same problem. Any help would be appreciated.

The machine is based on i7 860 (stock clock), 2 x MSI GTX 660 Twin Frozr (stock clock), 4 x 4 GB DDR3 1333MHz RAM

Here's the Log:

Code: Select all

05:21:12:WARNING:WU02:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
05:45:04:WARNING:WU01:FS01:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
05:45:06:WARNING:WU03:FS01:Failed to get assignment from 'assign-GPU.stanford.edu:80': Empty work server assignment
05:45:06:WARNING:WU03:FS01:Failed to get assignment from 'assign-GPU.stanford.edu:8080': Empty work server assignment
05:45:06:ERROR:WU03:FS01:Exception: Could not get an assignment
05:45:07:WARNING:WU03:FS01:Failed to get assignment from 'assign-GPU.stanford.edu:80': Empty work server assignment
05:45:08:WARNING:WU03:FS01:Failed to get assignment from 'assign-GPU.stanford.edu:8080': Empty work server assignment
05:45:08:ERROR:WU03:FS01:Exception: Could not get an assignment
05:46:07:WARNING:WU03:FS01:Failed to get assignment from 'assign-GPU.stanford.edu:80': Empty work server assignment
05:46:08:WARNING:WU03:FS01:Failed to get assignment from 'assign-GPU.stanford.edu:8080': Empty work server assignment
05:46:08:ERROR:WU03:FS01:Exception: Could not get an assignment
08:14:03:WARNING:WU02:FS00:Detected clock skew (1 mins 06 secs), adjusting time estimates
08:14:03:WARNING:WU03:FS01:Detected clock skew (1 mins 06 secs), adjusting time estimates
******************************* Date: 2013-09-12 *******************************
10:33:25:WARNING:WU02:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
10:34:31:WARNING:WU02:FS00:Detected clock skew (1 mins 06 secs), adjusting time estimates
11:03:29:WARNING:WU02:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
11:04:35:WARNING:WU02:FS00:Detected clock skew (1 mins 05 secs), adjusting time estimates
14:10:23:WARNING:WU02:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
14:11:29:WARNING:WU02:FS00:Detected clock skew (1 mins 05 secs), adjusting time estimates
******************************* Date: 2013-09-12 *******************************
19:09:20:WARNING:WU01:FS01:Detected clock skew (1 mins 05 secs), adjusting time estimates
19:09:21:WARNING:WU02:FS00:Detected clock skew (1 mins 06 secs), adjusting time estimates
******************************* Date: 2013-09-13 *******************************
02:36:57:WARNING:WU01:FS01:Detected clock skew (1 mins 08 secs), adjusting time estimates
02:36:58:WARNING:WU02:FS00:Detected clock skew (1 mins 09 secs), adjusting time estimates
04:09:03:WARNING:WU02:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
04:10:11:WARNING:WU02:FS00:Detected clock skew (1 mins 08 secs), adjusting time estimates
04:26:53:WARNING:WU02:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
04:28:00:WARNING:WU02:FS00:Detected clock skew (1 mins 07 secs), adjusting time estimates
06:19:36:WARNING:WU02:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
06:20:43:WARNING:WU02:FS00:Detected clock skew (1 mins 07 secs), adjusting time estimates
06:24:08:WARNING:WU03:FS01:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
06:25:13:WARNING:WU03:FS01:Detected clock skew (1 mins 05 secs), adjusting time estimates
06:51:24:WARNING:WU02:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
06:52:30:WARNING:WU02:FS00:Detected clock skew (1 mins 06 secs), adjusting time estimates
08:31:42:WARNING:WU03:FS01:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
08:32:48:WARNING:WU03:FS01:Detected clock skew (1 mins 05 secs), adjusting time estimates
******************************* Date: 2013-09-13 *******************************
09:20:13:WARNING:WU03:FS01:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
10:07:15:WARNING:WU03:FS01:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
10:08:21:WARNING:WU03:FS01:Detected clock skew (1 mins 05 secs), adjusting time estimates
10:12:34:WARNING:WU02:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
10:12:34:WARNING:WU02:FS00:Too many errors, failing
10:12:35:WARNING:WU02:FS00:Server did not like results, dumping
10:14:49:WARNING:WU03:FS01:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
10:14:49:WARNING:WU03:FS01:Too many errors, failing
10:14:50:WARNING:WU03:FS01:Server did not like results, dumping
10:16:38:WARNING:WU01:FS00:FahCore has not changed since last download, aborting core update
10:32:30:WARNING:WU01:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
10:33:37:WARNING:WU01:FS00:Detected clock skew (1 mins 06 secs), adjusting time estimates
10:48:30:WARNING:WU01:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
10:49:37:WARNING:WU01:FS00:Detected clock skew (1 mins 06 secs), adjusting time estimates
11:03:27:WARNING:WU01:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
11:04:33:WARNING:WU01:FS00:Detected clock skew (1 mins 06 secs), adjusting time estimates
11:11:57:WARNING:WU01:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
11:13:02:WARNING:WU01:FS00:Detected clock skew (1 mins 04 secs), adjusting time estimates
11:20:38:WARNING:WU01:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
11:20:38:WARNING:WU01:FS00:Too many errors, failing
11:20:39:WARNING:WU01:FS00:Server did not like results, dumping
******************************* Date: 2013-09-13 *******************************

Mod edit: Added Code tags to log

P5-133XL · Post by **P5-133XL** » Fri Sep 13, 2013 8:01 pm

Not even close to enough information to diagnose. Please include the entire log, including the system and config portions and everything from where you got the WU's to after it failed

Post by **Joe_H** » Fri Sep 13, 2013 8:02 pm

You provided too small a section of your log to tell much of anything from. Please provide the beginning section of the log which contains version and system information as well as the system configuration so we can tell what folding slots correspond to what. The Welcome to the Forum posts at the top of this sub-forum give useful information on how to post and find log information.

Also post where the error messages start, the posted messages are somewhat after the fact.

Post by **bruce** » Fri Sep 13, 2013 9:11 pm

The default installation for any system that folds with with a mixture of ATI and NVidia GPUs is likely to be misconfiguration. Providing the requested log information will certainly be helpful. I recommend you start by removing either type of GPU, leaving a single type. That will change the configuration and there's a (slim) chance it will start folding. If not, try deleting all of the GPU slots and let the system rebuild them. With more information, the next steps will become clear.

7im · Post by **7im** » Fri Sep 13, 2013 9:32 pm

05:21:12:WARNING:WU02:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
05:45:04:WARNING:WU01:FS01:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)

Very unusual to have two failures within minutes of each other, unless there was a driver crash, or the system was overheating. This points more towards a systemic issue than a work unit or client issue, otherwise only one would have failed, not both, IMO. Having bad weather? Power brown out from lightning or too much AC? System on a UPS? Gaming while folding? Lot's of things can contribute to a problem like this.

Like they said above, please post more info.

ntsarb · Post by **ntsarb** » Sat Sep 14, 2013 10:22 am

Many thanks for your responses. I have now run Windows Memory Diagnostics utility (extended run) and did not find any issues.

The previous log was automatically deleted. I let FaH run overnight and found lots of errors in the morning, but it's too long to fit into a forum's message (over 60000 characters). An alternative on how to share it? Here's the configuration part:

Code: Select all

*********************** Log Started 2013-09-13T22:31:51Z ***********************
22:31:51:************************* Folding@home Client *************************
22:31:51:      Website: http://folding.stanford.edu/
22:31:51:    Copyright: (c) 2009-2013 Stanford University
22:31:51:       Author: Joseph Coffland <[email protected]>
22:31:51:         Args: --open-web-control
22:31:51:       Config: C:/Users/Nikos/AppData/Roaming/FAHClient/config.xml
22:31:51:******************************** Build ********************************
22:31:51:      Version: 7.3.6
22:31:51:         Date: Feb 18 2013
22:31:51:         Time: 15:25:17
22:31:51:      SVN Rev: 3923
22:31:51:       Branch: fah/trunk/client
22:31:51:     Compiler: Intel(R) C++ MSVC 1500 mode 1200
22:31:51:      Options: /TP /nologo /EHa /Qdiag-disable:4297,4103,1786,279 /Ox -arch:SSE
22:31:51:               /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qopenmp /Qrestrict /MT /Qmkl
22:31:51:     Platform: win32 XP
22:31:51:         Bits: 32
22:31:51:         Mode: Release
22:31:51:******************************* System ********************************
22:31:51:          CPU: Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz
22:31:51:       CPU ID: GenuineIntel Family 6 Model 30 Stepping 5
22:31:51:         CPUs: 8
22:31:51:       Memory: 16.00GiB
22:31:51:  Free Memory: 13.16GiB
22:31:51:      Threads: WINDOWS_THREADS
22:31:51:  Has Battery: false
22:31:51:   On Battery: false
22:31:51:   UTC offset: 1
22:31:51:          PID: 7908
22:31:51:          CWD: C:/Users/Nikos/AppData/Roaming/FAHClient
22:31:51:           OS: Windows 7 Home Premium
22:31:51:      OS Arch: AMD64
22:31:51:         GPUs: 2
22:31:51:        GPU 0: NVIDIA:3 GK106 [GeForce GTX 660]
22:31:51:        GPU 1: NVIDIA:3 GK106 [GeForce GTX 660]
22:31:51:         CUDA: 3.0
22:31:51:  CUDA Driver: 5050
22:31:51:Win32 Service: false
22:31:51:***********************************************************************

ntsarb · Post by **ntsarb** » Sat Sep 14, 2013 10:36 am

By filtering the Log entries per "Slot", I found that only the first GPU is exhibiting the problem. Notably, the second GPU and the CPU did not give any errors.

I run FaH using the following (one GPU at a time) configurations:

1) 1st GPU card on 1st PCI-E slot, 2nd PCI-E slot empty
Produces errors within 2-3 minutes.

2) 1st GPU card on 2nd PCI-E slot, 1st PCI-E slot empty
No errors.

3) 2nd GPU card on 1st PCI-E slot, 2nd PCI-E slot empty
Produces errors within 2-3 minutes.

4) GPU card on 2nd PCI-E slot, 1st PCI-E slot empty
No errors.

Looks like the GPU cards are good and the 1st (blue) PCI-E slot on ASUS P7P55D-E EVO motherboard is problematic.

Many thanks to 7im (who helped me focus on system issues) and all users who kindly offered to help.

Folding Forum

UNSTABLE_MACHINE resets GPU at ~1.51%

UNSTABLE_MACHINE resets GPU at ~1.51%

Re: Folding process running on GPU resets at ~1.51%

Re: Folding process running on GPU resets at ~1.51%

Re: Folding process running on GPU resets at ~1.51%

Re: Folding process running on GPU resets at ~1.51%

Re: Folding process running on GPU resets at ~1.51%

Re: Folding process running on GPU resets at ~1.51%

Re: UNSTABLE_MACHINE resets GPU at ~1.51%

Re: UNSTABLE_MACHINE resets GPU at ~1.51%

Re: UNSTABLE_MACHINE resets GPU at ~1.51%

Re: UNSTABLE_MACHINE resets GPU at ~1.51%

Re: UNSTABLE_MACHINE resets GPU at ~1.51%

Re: UNSTABLE_MACHINE resets GPU at ~1.51%

Re: UNSTABLE_MACHINE resets GPU at ~1.51%

Re: UNSTABLE_MACHINE resets GPU at ~1.51%