20210206 Missing Work?

Moderators: Site Moderators, FAHC Science Team

cine.chris
Posts: 78
Joined: Sun Apr 26, 2020 1:29 pm

Re: 21210206 Missing Work?

Post by cine.chris »

It'll be one week tomorrow that the stats servers issues started.
I'm still seeing a fraction of my work logged.
20M PPD translates to 2.5M pts/3hr. Of course, I'd reasonably expect to see #s in the 2-3M range.
The 3 days prior to this break, I'd just installed 3070#2 and averaging ~19M/day.
Part of the frustration is adding a dual-Xeon server to consolidate GPUs & an RTX3070 and seeing fewer points than prior.

Image
Image Image
new08
Posts: 188
Joined: Fri Jan 04, 2008 11:02 pm
Hardware configuration: Hewlett-Packard 1494 Win10 Build 1836
GeForce [MSI] GTX 950
Runs F@H Ver7.6.21
[As of Jan 2021]
Location: England

Re: 21210206 Missing Work?

Post by new08 »

I have had a similar issue- posted on an old thread.
I lost 2 days work and accepted as valid but not in either official or EOC stats.
I corrected by re-installing F@H.
Even then the online control reported differently to the Adv.control till I stopped and restarted that.
I have the old logs still so can post some data if it doesn't recover itself.
Recent cases of lost data to EOC did correct- but that was just their interface issue, I think.
Image
lyuvelch
Posts: 4
Joined: Tue Feb 09, 2021 5:05 am

Re: 21210206 Missing Work?

Post by lyuvelch »

I don't see updates on the stats page for my account since
2021-02-08 09:57:27. It is
2021-92-09 14:18:00 JST now in Japan.

I can't see 1m+ points made from my clients since then.

Code: Select all

stats.foldingathome.org/donor/439946803
At the time of writing this the page this page is reporting
Date of last Work Unit 2021-02-08 09:57:27
Total score 30,192,984
Total WUs 855
Overall rank (if points are combined) 42,639 of 2,791,739
Active clients (within 50 days) 16
Active clients (within 7 days) 12
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon [email protected], 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon [email protected], 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: [email protected], 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: 21210206 Missing Work?

Post by Neil-B »

cine.chris wrote:It'll be one week tomorrow that the stats servers issues started.
Just to manage your expectations:

Stats issues rarely take hours/days to resolve ... most take a few weeks ... some have taken months ... but in my experience thaey have always been sorted eventually.

Is this right that it should take so long - probably not - but with the restricted dev effort and the dispersed (physically and organisationally) nature of the FaH infrastructure this is unfortunately the reality.

Stats issues have been around since that day points were introduced - but this particular issue may be part of a build up of things (as far as I can tell) that possibly started over a month ago where one set of stats issues were sorted but seemed to unfortunately have a knock on impact at another part - this may make it even more challenging to unpick and resolve ... Posting PRCGs helps the team trace/track down the issues :) ... and a bit of patience helps manage expectations ;)
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
CaptainHalon
Posts: 63
Joined: Mon Apr 13, 2020 11:47 am

Re: 21210206 Missing Work?

Post by CaptainHalon »

Started seeing a similar issue yesterday. EOC showed about 4/7 of my usual PPD. Moreover, stats,foldingathome.org hasn't showed an update for me since about 10AM GMT yesterday.
mgetz
Posts: 57
Joined: Tue Aug 11, 2020 6:23 pm

Re: 21210206 Missing Work?

Post by mgetz »

There seems to be more than one disconnect. If I check some individual work units they show up. But main stats is showing that I last turned in a work unit on the 8th... and I've definitely turned in quite a few (that I can verify checking the WU directly!). So there is all sorts of messed up going on.
Image
SilvioMartin
Posts: 30
Joined: Thu Sep 24, 2020 6:06 pm
Hardware configuration: iMac 2017 Intel Quad-Core i5 3,4 GHz, 8 GB RAM, Radeon Pro 560 4 GB, typically with the latest macOS update. 5 Raspberry Pi 4B (2 GB).
Location: Oberhausen, Germany
Contact:

Re: 21210206 Missing Work?

Post by SilvioMartin »

I'd say the statistics are completely broken, or they were turned off for maintenance / bug fixing. Even Anonymous didn't upload any good work units since yesterday morning: https://stats.foldingathome.org/donor/1437

The good thing is that Anonymous never ever will complain about it ;)
My Raspberry Pi folding rack: http://www.anne-emscher.net/fah/
cine.chris
Posts: 78
Joined: Sun Apr 26, 2020 1:29 pm

Re: 21210206 Missing Work?

Post by cine.chris »

Hi Neil-B,
Good to hear from you.
It appears to be a hard-coded address issue, from the view of an engineer that was often forced to deal with the vagaries of IT organizations. I heard mention of server transitions.
Perhaps "BIND" could be an appropriate pun to apply for symptoms like this?
It's a fragile architecture that's connected like a chain vs a web.
I've shutdown systems until I see this is rectified.
Currently at about 40%, until this is corrected.
Image Image
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon [email protected], 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon [email protected], 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: [email protected], 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: 21210206 Missing Work?

Post by Neil-B »

Yup fragile, in a bind, even this isn't the way to do this but it has kindof evolved beyond where it was ever designed... but hey it is what we have :) ... The science will be progressing fine ... and the points do always catch up ... shutting down is obviously your choice but as long as the logs show work acknowledge and an estimated points then science is progressing ... what happens is that at some point a points/stats reconstruction is done znd z spike sometimes a really big one appears and everything is back to etrre it should be ... shutting down kit means science isn't progressed
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 20210206 Missing Work?

Post by bruce »

When systems that were created 20 years ago by non-programmers reach the point of being fragile and repeatedly failing, there's usually only one alternative: Have a professional programmer rewrite it from the ground up. That means its performance continues to degrade until it can be replaced by a new system.

It looks like we may have reached that point. Of course none of us sees the big picture. Almost everybody looks at their total points -- which is not helpful in identifying a problem which is an aggregate of many small errors plus many small successes -- and not particularly useful in identifying a reparable problem or repairing or replacing the overall system.

Treating it as problems that may be associated with individual work servers, are there identifiable work servers that ARE working correctly? That may be the first sign that progress is being made?
cine.chris
Posts: 78
Joined: Sun Apr 26, 2020 1:29 pm

Re: 20210206 Missing Work?

Post by cine.chris »

It appears that the current Linux client 7.6.21, ignored the specified collection server & returned work to the errant 206.223.170.146.
The PRCG showed 'not found'.

Update:

Watched another WU with same results for: project:17431 run:0 clone:1731 gen:105
Last edited by cine.chris on Tue Feb 09, 2021 5:59 pm, edited 2 times in total.
Image Image
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon [email protected], 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon [email protected], 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: [email protected], 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: 20210206 Missing Work?

Post by Neil-B »

If you mean not found in the stats system that can happen - if your log shows work ack and estimated point then the WU will be useful to science - CS are only used if WS can't accept ... my guess WS accepted it but stats connection for that WS is borked (and I think this is only one of a number of ssues all overlapping) - to the point the stats system doesn't even know the WU exists ... Luckily the stats system can be totalled wrecked and the science can still continue ... I am just glad I am not the poor person who has to track down, resolve, then catch up all the stats ... but in my experience it always gets sorted eventually
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
cine.chris
Posts: 78
Joined: Sun Apr 26, 2020 1:29 pm

Re: 20210206 Missing Work?

Post by cine.chris »

Bruce,
Yes, they need to pick their battles, host resolution appears to be a good candidate for a patch.
Creating 'A' records for critical servers & migrating code to name resolution would be a doable plan. Even 'foreign' servers can have managed 'A' records in the native domain (I just tested that...). It resolved within seconds, the first ping worked. Of course, cached updates would need to be tested for latency.
Services could easily be redirected to a back-up or new service, even migrated back if the 'new' service failed, etc.
Hope this is resolved soon.
Image Image
WeatherWitch
Posts: 1
Joined: Mon Feb 08, 2021 7:09 pm

Re: 20210206 Missing Work?

Post by WeatherWitch »

Stats seem to be slowly updating now? - I just jusmped from 5k to 95k
SilvioMartin
Posts: 30
Joined: Thu Sep 24, 2020 6:06 pm
Hardware configuration: iMac 2017 Intel Quad-Core i5 3,4 GHz, 8 GB RAM, Radeon Pro 560 4 GB, typically with the latest macOS update. 5 Raspberry Pi 4B (2 GB).
Location: Oberhausen, Germany
Contact:

Re: 20210206 Missing Work?

Post by SilvioMartin »

Neil-B wrote:if your log shows work ack and estimated point then the WU will be useful to science
Good enough for me to keep them running.
My Raspberry Pi folding rack: http://www.anne-emscher.net/fah/
Post Reply