Page 4 of 15
Re: Stats not updating
Posted: Fri Apr 18, 2014 8:00 am
by -alias-
7im wrote:-alias- wrote:As usual, nobody reacts with PG / Stanford to fix the problem.
Except they just fixed the OS STATS page reporting so that blows your "as usual" theory completely out of the water and in to completely unrecognizable dust.
I think that you've lost My point here! Fact is that the stat-server was down for about 24 hours, and PG did not manage to discover that the server was out of service before you notified them with a ping, and even then it took several hours before it was any reaction.. I am an amateur with 6 servers spread across three buildings, but if an error occurs on one of them I get automatically notified by HFM.NET via e-mail to my smartphone. I can then log on to the server from remote and fixe the problem promptly. PG is the professional part in this system, but it looks like they do not have any sort of automatic notification if something goes wrong on there side.
Re: Stats not updating
Posted: Fri Apr 18, 2014 1:51 pm
by drougnor
What you need to remember, -alias-, is that Stanford is a Research University, therefore their IT is focused on monitoring and maintaining the systems that are CORE to that research.
That core focus, unfortunately for us, DOESN'T include our outbound stats server.
So, we monitor the stats and report and WHEN the appropriate IT folks can, they bring them back up. But we have to be patient and deal with the fact that a small IT team is already spread far to thin. Just like in EVERY professional IT setting.
my $.02. do with it what you will.
Re: Stats not updating
Posted: Fri Apr 18, 2014 2:02 pm
by 7im
No offense -alias- but you have no way of knowing what notifications PG received about the stats server or not. My ping post could very well have been redundant.
Not at all related to your position, but it could be said that the Ping post is as much to placate the perpetual complainers as any other function.
Re: Stats not updating
Posted: Fri Apr 18, 2014 2:21 pm
by kasi
OK, have chosen to do another task. Not sure if that is advisable with current tasks that take more than a day to complete, but I'll give it a go and see what happens. If that one goes missing as well I'll stop, from years of distributed computing know better than to continue to send results to servers that are overloaded or malfunctioning.
Re: Stats not updating
Posted: Fri Apr 18, 2014 2:24 pm
by kyleb
I'm looking into this. I definitely have received the results from 13001 R6 C0 G7, so I'm trying to figure out what went wrong here.
Re: Stats not updating
Posted: Fri Apr 18, 2014 3:53 pm
by davidcoton
kasi wrote:OK, have chosen to do another task. Not sure if that is advisable with current tasks that take more than a day to complete, but I'll give it a go and see what happens. If that one goes missing as well I'll stop, from years of distributed computing know better than to continue to send results to servers that are overloaded or malfunctioning.
I don't think there is a problem with the work being received.
Joe_H wrote:
... one of the stats logs may have not been processed into the database.
That implies that everything was received correctly, but the points were not transferred when the stats server came back. It does not imply an overloaded Collection Server, though a separate(?) problem with another server recently DID affect the return of some WUs. 90%+ confidence that the missing log file can be found and transferred, so the credit should turn up eventually.
David
Re: Stats not updating
Posted: Fri Apr 18, 2014 4:21 pm
by Sunny
Points not yet updated for following 13000 Project (completed yesterday morning Apr 17):
18:17:20:WU00:FS01:0x17:Project: 13000 (Run 838, Clone 0, Gen 6)
18:17:20:WU00:FS01:0x17:Unit: 0x00000011538b3db753108822086b93a1
Code: Select all
11:35:23:WU00:FS01:0x17:Completed 5000000 out of 5000000 steps (100%)
11:35:23:WU01:FS01:Connecting to assign-GPU.stanford.edu:80
11:35:24:WU01:FS01:News: Welcome to Folding@Home
11:35:24:WU01:FS01:Assigned to work server 171.64.65.56
11:35:24:WU01:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:GF114 [GeForce GTX 560 Ti] from 171.64.65.56
11:35:24:WU01:FS01:Connecting to 171.64.65.56:8080
11:35:25:WU01:FS01:Downloading 4.32MiB
11:35:31:WU01:FS01:Download 53.48%
11:35:35:WU01:FS01:Download complete
11:35:35:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:9408 run:451 clone:0 gen:1 core:0x17 unit:0x000000010a3b1e5c5342d982d96a60c7
11:35:37:WU00:FS01:0x17:Saving result file logfile_01.txt
11:35:37:WU00:FS01:0x17:Saving result file checkpointState.xml
11:35:39:WU00:FS01:0x17:Saving result file checkpt.crc
11:35:39:WU00:FS01:0x17:Saving result file log.txt
11:35:39:WU00:FS01:0x17:Saving result file positions.xtc
11:35:42:WU00:FS01:0x17:Folding@home Core Shutdown: FINISHED_UNIT
11:35:42:WU00:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
11:35:42:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:13000 run:838 clone:0 gen:6 core:0x17 unit:0x00000011538b3db753108822086b93a1
11:35:42:WU00:FS01:Uploading 12.83MiB to 140.163.4.231
11:35:42:WU00:FS01:Connecting to 140.163.4.231:8080
11:35:42:WU01:FS01:Starting
11:35:42:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /root/spot/Folding/cores/www.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/Core_17.fah/FahCore_17 -dir 01 -suffix 01 -version 703 -lifeline 32361 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
11:35:42:WU01:FS01:Started FahCore on PID 32601
11:35:42:WU01:FS01:Core PID:32605
11:35:42:WU01:FS01:FahCore 0x17 started
11:35:43:WU01:FS01:0x17:*********************** Log Started 2014-04-17T11:35:43Z ***********************
11:35:43:WU01:FS01:0x17:Project: 9408 (Run 451, Clone 0, Gen 1)
11:35:43:WU01:FS01:0x17:Unit: 0x000000010a3b1e5c5342d982d96a60c7
11:35:43:WU01:FS01:0x17:CPU: 0x00000000000000000000000000000000
11:35:43:WU01:FS01:0x17:Machine: 1
11:35:43:WU01:FS01:0x17:Reading tar file state.xml
11:35:43:WU01:FS01:0x17:Reading tar file system.xml
11:35:44:WU01:FS01:0x17:Reading tar file integrator.xml
11:35:44:WU01:FS01:0x17:Reading tar file core.xml
11:35:44:WU01:FS01:0x17:Digital signatures verified
11:35:48:WU00:FS01:Upload 5.36%
11:35:55:WU00:FS01:Upload 10.23%
11:36:01:WU00:FS01:Upload 14.61%
11:36:07:WU00:FS01:Upload 19.00%
11:36:13:WU00:FS01:Upload 23.38%
11:36:19:WU00:FS01:Upload 26.30%
11:36:25:WU00:FS01:Upload 31.66%
11:36:31:WU00:FS01:Upload 35.56%
11:36:38:WU00:FS01:Upload 39.46%
11:36:44:WU00:FS01:Upload 43.84%
11:36:50:WU00:FS01:Upload 48.23%
11:36:56:WU00:FS01:Upload 52.12%
11:37:02:WU00:FS01:Upload 56.02%
11:37:08:WU00:FS01:Upload 60.40%
11:37:14:WU00:FS01:Upload 64.30%
11:37:21:WU00:FS01:Upload 68.68%
11:37:28:WU00:FS01:Upload 74.04%
11:37:34:WU00:FS01:Upload 77.94%
11:37:40:WU00:FS01:Upload 82.32%
11:37:46:WU00:FS01:Upload 86.71%
11:37:53:WU00:FS01:Upload 91.09%
11:37:59:WU00:FS01:Upload 94.99%
11:38:06:WU00:FS01:Upload 99.86%
11:38:19:WU00:FS01:Upload complete
11:38:19:WU00:FS01:Server responded WORK_ACK (400)
11:38:19:WU00:FS01:Final credit estimate, 41042.00 points
11:38:19:WU00:FS01:Cleaning up
Re: Stats not updating
Posted: Fri Apr 18, 2014 9:26 pm
by folding_hoomer
A member of my team is missing the credits for one 13001.
Member: hbf878, Team: 70335
WU 13001 (351,8,3)
Upload finished: 17.4, 16:47:12 UTC
Log:
Code: Select all
15:49:36:WU01:FS00:0x17:Completed 4900000 out of 5000000 steps (98%)
16:16:04:WU01:FS00:0x17:Completed 4950000 out of 5000000 steps (99%)
******************************* Date: 2014-04-17 *******************************
16:42:31:WU01:FS00:0x17:Completed 5000000 out of 5000000 steps (100%)
16:43:01:WU01:FS00:0x17:Saving result file logfile_01.txt
16:43:01:WU01:FS00:0x17:Saving result file checkpointState.xml
16:43:04:WU01:FS00:0x17:Saving result file checkpt.crc
16:43:04:WU01:FS00:0x17:Saving result file log.txt
16:43:04:WU01:FS00:0x17:Saving result file positions.xtc
16:43:07:WU01:FS00:0x17:Folding@home Core Shutdown: FINISHED_UNIT
16:43:08:WU01:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
16:43:09:WU01:FS00:Sending unit results: id:01 state:SEND error:NO_ERROR project:13001 run:351 clone:8 gen:3 core:0x17 unit:0x00000004538b3db75328b3622cd3349f
16:43:09:WU01:FS00:Uploading 12.83MiB to 140.163.4.231
16:43:09:WU01:FS00:Connecting to 140.163.4.231:8080
16:43:16:WU01:FS00:Upload 9.74%
16:43:23:WU01:FS00:Upload 11.69%
16:43:30:WU01:FS00:Upload 17.05%
16:43:42:WU01:FS00:Upload 19.48%
16:43:48:WU01:FS00:Upload 24.84%
16:43:59:WU01:FS00:Upload 27.27%
16:44:06:WU01:FS00:Upload 33.12%
16:44:16:WU01:FS00:Upload 35.06%
16:44:22:WU01:FS00:Upload 40.42%
16:44:33:WU01:FS00:Upload 42.86%
16:44:39:WU01:FS00:Upload 48.21%
16:44:51:WU01:FS00:Upload 50.65%
16:44:57:WU01:FS00:Upload 56.01%
16:45:03:WU01:FS00:Upload 58.44%
16:45:09:WU01:FS00:Upload 60.88%
16:45:15:WU01:FS00:Upload 63.80%
16:45:25:WU01:FS00:Upload 66.72%
16:45:33:WU01:FS00:Upload 70.62%
16:45:40:WU01:FS00:Upload 73.05%
16:45:47:WU01:FS00:Upload 75.97%
16:45:53:WU01:FS00:Upload 78.41%
16:45:59:WU01:FS00:Upload 81.33%
16:46:05:WU01:FS00:Upload 83.77%
16:46:11:WU01:FS00:Upload 86.69%
16:46:17:WU01:FS00:Upload 89.61%
16:46:24:WU01:FS00:Upload 92.05%
16:46:30:WU01:FS00:Upload 94.48%
16:46:36:WU01:FS00:Upload 97.40%
16:46:43:WU01:FS00:Upload 99.84%
16:47:12:WU01:FS00:Upload complete
16:47:12:WU01:FS00:Server responded WORK_ACK (400)
16:47:12:WU01:FS00:Final credit estimate, 39511.00 points
16:47:12:WU01:FS00:Cleaning up
Thanks in advance.
Re: Stats not updating
Posted: Fri Apr 18, 2014 9:44 pm
by kasi
Thank you kyleb for confirming that results have been received for 13001 R6 C0 G7.
I would prefer the credit if/when it can be fixed but my main concerns in this were to avoid burning coal wastefully and to contribute to scientific research.
Re: Stats not updating
Posted: Sat Apr 19, 2014 4:13 am
by MonsterBuilder
Looks like I lost one in the jumble as well - I've not been credited with a completed gpu WU since 4/14 - was this one received successfully?
Code: Select all
02:03:56:WU00:FS00:0x17:*********************** Log Started 2014-04-16T02:03:55Z ***********************
02:03:56:WU00:FS00:0x17:Project: 13001 (Run 264, Clone 6, Gen 3)
02:03:56:WU00:FS00:0x17:Unit: 0x00000005538b3db753289aad433ed310
02:03:56:WU00:FS00:0x17:CPU: 0x00000000000000000000000000000000
02:03:56:WU00:FS00:0x17:Machine: 0
02:03:56:WU00:FS00:0x17:Digital signatures verified
02:03:56:WU00:FS00:0x17:Folding@home GPU core17
02:03:56:WU00:FS00:0x17:Version 0.0.52
02:03:56:WU00:FS00:0x17: Found a checkpoint file
02:04:21:Started thread 10 on PID 4388
02:08:57:WU00:FS00:0x17:Completed 2500000 out of 5000000 steps (50%)
02:08:57:WU00:FS00:0x17:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
02:37:15:WU00:FS00:0x17:Completed 2550000 out of 5000000 steps (51%)
- snip -
******************************* Date: 2014-04-17 *******************************
03:50:30:WU00:FS00:0x17:Completed 4800000 out of 5000000 steps (96%)
04:53:56:WU00:FS00:0x17:Completed 4850000 out of 5000000 steps (97%)
05:22:04:WU00:FS00:0x17:Completed 4900000 out of 5000000 steps (98%)
05:49:46:WU00:FS00:0x17:Completed 4950000 out of 5000000 steps (99%)
06:17:27:WU00:FS00:0x17:Completed 5000000 out of 5000000 steps (100%)
06:17:27:WU02:FS00:Connecting to assign-GPU.stanford.edu:80
06:17:27:WU02:FS00:News: Welcome to Folding@Home
06:17:27:WU02:FS00:Assigned to work server 140.163.4.231
06:17:27:WU02:FS00:Requesting new work unit for slot 00: RUNNING gpu:0:R575A [AMD Radeon HD7700 Series] from 140.163.4.231
06:17:27:WU02:FS00:Connecting to 140.163.4.231:8080
06:17:28:WU02:FS00:Downloading 4.84MiB
06:17:33:WU02:FS00:Download complete
06:17:33:WU02:FS00:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:13001 run:257 clone:6 gen:9 core:0x17 unit:0x00000013538b3db7532898b0b4095bfd
06:17:49:WU00:FS00:0x17:Saving result file logfile_01.txt
06:17:49:WU00:FS00:0x17:Saving result file checkpointState.xml
06:17:52:WU00:FS00:0x17:Saving result file checkpt.crc
06:17:52:WU00:FS00:0x17:Saving result file log.txt
06:17:52:WU00:FS00:0x17:Saving result file positions.xtc
06:17:55:WU00:FS00:0x17:Folding@home Core Shutdown: FINISHED_UNIT
06:17:55:WU00:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
06:17:55:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:13001 run:264 clone:6 gen:3 core:0x17 unit:0x00000005538b3db753289aad433ed310
06:17:55:WU00:FS00:Uploading 12.83MiB to 140.163.4.231
06:17:55:WU00:FS00:Connecting to 140.163.4.231:8080
06:17:56:WU02:FS00:Starting
06:17:56:WU02:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" "C:/Users/Eric Buchanan/AppData/Roaming/FAHClient/cores/www.stanford.edu/~pande/Win32/AMD64/ATI/R600/Core_17.fah/FahCore_17.exe" -dir 02 -suffix 01 -version 703 -lifeline 4388 -checkpoint 15 -gpu 0 -gpu-vendor ati
06:17:57:WU02:FS00:Started FahCore on PID 5568
06:17:57:Started thread 13 on PID 4388
06:17:58:WU02:FS00:Core PID:8420
06:17:58:WU02:FS00:FahCore 0x17 started
06:17:58:WU02:FS00:0x17:*********************** Log Started 2014-04-17T06:17:58Z ***********************
06:17:58:WU02:FS00:0x17:Project: 13001 (Run 257, Clone 6, Gen 9)
06:17:58:WU02:FS00:0x17:Unit: 0x00000013538b3db7532898b0b4095bfd
06:17:58:WU02:FS00:0x17:CPU: 0x00000000000000000000000000000000
06:17:58:WU02:FS00:0x17:Machine: 0
06:17:58:WU02:FS00:0x17:Reading tar file state.xml
06:17:59:WU02:FS00:0x17:Reading tar file system.xml
06:17:59:WU02:FS00:0x17:Reading tar file integrator.xml
06:17:59:WU02:FS00:0x17:Reading tar file core.xml
06:17:59:WU02:FS00:0x17:Digital signatures verified
06:17:59:WU02:FS00:0x17:Folding@home GPU core17
06:17:59:WU02:FS00:0x17:Version 0.0.52
06:18:01:WU00:FS00:Upload 7.79%
06:18:07:WU00:FS00:Upload 13.15%
06:18:13:WU00:FS00:Upload 18.51%
06:18:19:WU00:FS00:Upload 23.87%
06:18:25:WU00:FS00:Upload 29.71%
06:18:31:WU00:FS00:Upload 34.58%
06:18:37:WU00:FS00:Upload 39.94%
06:18:43:WU00:FS00:Upload 45.30%
06:18:49:WU00:FS00:Upload 50.66%
06:18:55:WU00:FS00:Upload 56.02%
06:19:01:WU00:FS00:Upload 61.37%
06:19:07:WU00:FS00:Upload 66.73%
06:19:13:WU00:FS00:Upload 72.09%
06:19:19:WU00:FS00:Upload 77.45%
06:19:25:WU00:FS00:Upload 82.81%
06:19:31:WU00:FS00:Upload 88.16%
06:19:37:WU00:FS00:Upload 93.52%
06:19:43:WU00:FS00:Upload 98.88%
06:19:58:WU00:FS00:Upload complete
06:19:59:WU00:FS00:Server responded WORK_ACK (400)
06:19:59:WU00:FS00:Final credit estimate, 36677.00 points
06:19:59:WU00:FS00:Cleaning up
Re: Stats not updating
Posted: Sat Apr 19, 2014 9:18 am
by Valkyrie
So did 140.163.4.231 accept a bunch of completed WU's and just keep them? That server seems to be a common theme here.
Re: Stats not updating
Posted: Sat Apr 19, 2014 3:49 pm
by Joe_H
Based on what has been reported in prior stats server problems, I would guess that one or more logs of point results were sent from the WS to the stats server but not processed. In the past investigating to find unprocessed logs has usually taken a few days. As I understand it, the information on which WU's were received but not credited is helpful in locating the specific logs.
Re: Stats not updating
Posted: Mon Apr 21, 2014 6:53 am
by DocJonz
Joe_H wrote:Based on what has been reported in prior stats server problems, I would guess that one or more logs of point results were sent from the WS to the stats server but not processed. In the past investigating to find unprocessed logs has usually taken a few days. As I understand it, the information on which WU's were received but not credited is helpful in locating the specific logs.
I'm guessing I fall in that category - my stats for the 16th April were half what they should have been. How would I go about trying to determine which WU's went awry?
Re: Stats not updating
Posted: Mon Apr 21, 2014 7:16 am
by Joe_H
DocJonz wrote:I'm guessing I fall in that category - my stats for the 16th April were half what they should have been. How would I go about trying to determine which WU's went awry?
Everybody's stats were short for the 16th, the stats stopped updating about half way through the day. The backlog was processed starting about 24 hours later, you should have seen a bump in your points on the 17th as the ones turned in during the outage were credited. As for determining which if any WU's did not get credited, you would have to compare the estimated points with what was awarded eventually. That can be relatively easy if you have a small number of machines and WU's, not very easy if you have many. Though so far it appears mostly that some WU's from projects 13000 and 13001 are the ones missing. As for specific WU's being reported as uncredited, they may have enough reports already to determine which logs they are looking for.
Re: Stats not updating
Posted: Mon Apr 21, 2014 7:29 am
by DocJonz
Thanks Joe_H.
I was expecting the spike when things were credited (as has happened before) ... but haven't seen one just yet.
I had 5x GTX 780's running 13000/1 over the period (in addition to high core-count CPU's) so maybe they are still in the 'missing' category.