Tesla P4 failing

It seems that a lot of GPU problems revolve around specific versions of drivers. Though NVidia has their own support structure, you can often learn from information reported by others who fold.

Moderators: Site Moderators, FAHC Science Team

Post Reply
TheDevil
Posts: 22
Joined: Mon Aug 15, 2022 11:21 am

Tesla P4 failing

Post by TheDevil »

Let this sit for 3 hours - this is all I get on this device. I've tried more than a few drivers set. Any advice? I also tried in Linux - and it would not enable.

20:59:14:WU00:FS05:0x22: Version: 7.7.0
20:59:14:WU00:FS05:0x22:********************************************************************************
20:59:14:WU00:FS05:0x22:Project: 17918 (Run 929, Clone 0, Gen 58)
20:59:14:WU00:FS05:0x22:Reading tar file core.xml
20:59:15:WU00:FS05:0x22:Reading tar file integrator.xml
20:59:15:WU00:FS05:0x22:Reading tar file state.xml
20:59:15:WU00:FS05:0x22:Reading tar file system.xml
20:59:18:WU00:FS05:0x22:Digital signatures verified
20:59:18:WU00:FS05:0x22:Folding@home GPU Core22 Folding@home Core
20:59:18:WU00:FS05:0x22:Version 0.0.20
20:59:18:WU00:FS05:0x22: Checkpoint write interval: 50000 steps (5%) [20 total]
20:59:18:WU00:FS05:0x22: JSON viewer frame write interval: 10000 steps (1%) [100 total]
20:59:18:WU00:FS05:0x22: XTC frame write interval: 25000 steps (2.5%) [40 total]
20:59:18:WU00:FS05:0x22: Global context and integrator variables write interval: disabled
20:59:18:WU00:FS05:0x22:There are 4 platforms available.
20:59:18:WU00:FS05:0x22:Platform 0: Reference
20:59:18:WU00:FS05:0x22:Platform 1: CPU
20:59:18:WU00:FS05:0x22:Platform 2: OpenCL
20:59:18:WU00:FS05:0x22: opencl-device 0 specified
20:59:18:WU00:FS05:0x22:Platform 3: CUDA
20:59:18:WU00:FS05:0x22: cuda-device 0 specified
21:00:25:WU00:FS05:0x22:Attempting to create CUDA context:
21:00:25:WU00:FS05:0x22: Configuring platform CUDA
21:00:42:WU00:FS05:0x22:ERROR:Discrepancy: Forces are blowing up! 683 0
21:00:42:WU00:FS05:0x22:Saving result file ..\logfile_01.txt
21:00:42:WU00:FS05:0x22:Saving result file science.log
21:00:42:WU00:FS05:0x22:Saving result file state.xml
21:00:47:WU00:FS05:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT


*********************** Log Started 2022-08-19T19:12:18Z ***********************
20:05:32:WU00:FS05:0x22:WARNING:Console control signal 1 on PID 9796
20:06:33:WARNING:FS05:Killing WU00
20:59:08:WU00:FS05:0x22:ERROR:exception: Error loading CUDA module: CUDA_ERROR_ILLEGAL_ADDRESS (700)
20:59:13:WARNING:WU00:FS05:FahCore returned an unknown error code which probably indicates that it crashed
20:59:13:WARNING:WU00:FS05:FahCore returned: UNKNOWN_ENUM (-1073740791 = 0xc0000409)
21:00:42:WU00:FS05:0x22:ERROR:Discrepancy: Forces are blowing up! 683 0
toTOW
Site Moderator
Posts: 6359
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: Tesla P4 failing

Post by toTOW »

Did you test the GPU with other applications ? I don't know if OCCT would work on a Tesla card ...

Did you check your system RAM for errors with Memtest86+ ?

Are temperatures and voltages fine on the GPU and the CPU ?

Which drivers did you use ?
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
TheDevil
Posts: 22
Joined: Mon Aug 15, 2022 11:21 am

Re: Tesla P4 failing

Post by TheDevil »

toTOW wrote: Sat Aug 20, 2022 1:37 pm Did you test the GPU with other applications ? I don't know if OCCT would work on a Tesla card ...

Did you check your system RAM for errors with Memtest86+ ?

Are temperatures and voltages fine on the GPU and the CPU ?

Which drivers did you use ?
Currently 516.94 with Cuda 11.7
Tried:
412.36 - Cuda 10.0
453.64 - Cuda 11.0

Idle temp is 43c hotspot is 53c
WHen i send a job to it temp is 58c hotspot 70c

Idle draw is 1w - with 0% load and a task assigned to it its at 25w. and only pulling 30% of TDP

System ram HPE 752369-081 16GB 2RX4 DDR4 2133Mhz PC4-17000 Ecc x8 (128gb). Server would know if any of this Ram was bad, IIRC.

FYI this is a server so there is No OC or sillyness about stability as far s I know. device is a HP DL360 G9.

Device worked in ESXi to and was able to do graphics duty on my VMs

GPU-Z validated it - https://www.techpowerup.com/gpuz/details/g3uau

And then JUST for giggles, I ran it on a ETH miner, and got the 16-17/mhs and it ran fine for 45 mins as purely a test.
JimboPalmer
Posts: 2522
Joined: Mon Feb 16, 2009 4:12 am
Location: Greenwood MS USA

Re: Tesla P4 failing

Post by JimboPalmer »

One idea, are you getting the driver's directly from Nvidia?

https://www.nvidia.com/Download/driverR ... 588/en-us/
Tsar of all the Rushers
I tried to remain childlike, all I achieved was childish.
A friend to those who want no friends
Post Reply