Page 2 of 3
Re: 21210206 Missing Work?
Posted: Tue Feb 09, 2021 1:20 am
by cine.chris
It'll be one week tomorrow that the stats servers issues started.
I'm still seeing a fraction of my work logged.
20M PPD translates to 2.5M pts/3hr. Of course, I'd reasonably expect to see #s in the 2-3M range.
The 3 days prior to this break, I'd just installed 3070#2 and averaging ~19M/day.
Part of the frustration is adding a dual-Xeon server to consolidate GPUs & an RTX3070 and seeing fewer points than prior.
Re: 21210206 Missing Work?
Posted: Tue Feb 09, 2021 3:53 am
by new08
I have had a similar issue- posted on an old thread.
I lost 2 days work and accepted as valid but not in either official or EOC stats.
I corrected by re-installing F@H.
Even then the online control reported differently to the Adv.control till I stopped and restarted that.
I have the old logs still so can post some data if it doesn't recover itself.
Recent cases of lost data to EOC did correct- but that was just their interface issue, I think.
Re: 21210206 Missing Work?
Posted: Tue Feb 09, 2021 5:28 am
by lyuvelch
I don't see updates on the stats page for my account since
2021-02-08 09:57:27. It is
2021-92-09 14:18:00 JST now in Japan.
I can't see 1m+ points made from my clients since then.
Code: Select all
stats.foldingathome.org/donor/439946803
At the time of writing this the page this page is reporting
Date of last Work Unit 2021-02-08 09:57:27
Total score 30,192,984
Total WUs 855
Overall rank (if points are combined) 42,639 of 2,791,739
Active clients (within 50 days) 16
Active clients (within 7 days) 12
Re: 21210206 Missing Work?
Posted: Tue Feb 09, 2021 9:04 am
by Neil-B
cine.chris wrote:It'll be one week tomorrow that the stats servers issues started.
Just to manage your expectations:
Stats issues rarely take hours/days to resolve ... most take a few weeks ... some have taken months ... but in my experience thaey have always been sorted eventually.
Is this right that it should take so long - probably not - but with the restricted dev effort and the dispersed (physically and organisationally) nature of the FaH infrastructure this is unfortunately the reality.
Stats issues have been around since that day points were introduced - but this particular issue may be part of a build up of things (as far as I can tell) that possibly started over a month ago where one set of stats issues were sorted but seemed to unfortunately have a knock on impact at another part - this may make it even more challenging to unpick and resolve ... Posting PRCGs helps the team trace/track down the issues
... and a bit of patience helps manage expectations
Re: 21210206 Missing Work?
Posted: Tue Feb 09, 2021 12:39 pm
by CaptainHalon
Started seeing a similar issue yesterday. EOC showed about 4/7 of my usual PPD. Moreover, stats,foldingathome.org hasn't showed an update for me since about 10AM GMT yesterday.
Re: 21210206 Missing Work?
Posted: Tue Feb 09, 2021 12:56 pm
by mgetz
There seems to be more than one disconnect. If I check some individual work units they show up. But main stats is showing that I last turned in a work unit on the 8th... and I've definitely turned in quite a few (that I can verify checking the WU directly!). So there is all sorts of messed up going on.
Re: 21210206 Missing Work?
Posted: Tue Feb 09, 2021 1:59 pm
by SilvioMartin
I'd say the statistics are completely broken, or they were turned off for maintenance / bug fixing. Even Anonymous didn't upload any good work units since yesterday morning:
https://stats.foldingathome.org/donor/1437
The good thing is that Anonymous never ever will complain about it
Re: 21210206 Missing Work?
Posted: Tue Feb 09, 2021 2:08 pm
by cine.chris
Hi Neil-B,
Good to hear from you.
It appears to be a hard-coded address issue, from the view of an engineer that was often forced to deal with the vagaries of IT organizations. I heard mention of server transitions.
Perhaps "BIND" could be an appropriate pun to apply for symptoms like this?
It's a fragile architecture that's connected like a chain vs a web.
I've shutdown systems until I see this is rectified.
Currently at about 40%, until this is corrected.
Re: 21210206 Missing Work?
Posted: Tue Feb 09, 2021 3:17 pm
by Neil-B
Yup fragile, in a bind, even this isn't the way to do this but it has kindof evolved beyond where it was ever designed... but hey it is what we have
... The science will be progressing fine ... and the points do always catch up ... shutting down is obviously your choice but as long as the logs show work acknowledge and an estimated points then science is progressing ... what happens is that at some point a points/stats reconstruction is done znd z spike sometimes a really big one appears and everything is back to etrre it should be ... shutting down kit means science isn't progressed
Re: 20210206 Missing Work?
Posted: Tue Feb 09, 2021 4:45 pm
by bruce
When systems that were created 20 years ago by non-programmers reach the point of being fragile and repeatedly failing, there's usually only one alternative: Have a professional programmer rewrite it from the ground up. That means its performance continues to degrade until it can be replaced by a new system.
It looks like we may have reached that point. Of course none of us sees the big picture. Almost everybody looks at their total points -- which is not helpful in identifying a problem which is an aggregate of many small errors plus many small successes -- and not particularly useful in identifying a reparable problem or repairing or replacing the overall system.
Treating it as problems that may be associated with individual work servers, are there identifiable work servers that ARE working correctly? That may be the first sign that progress is being made?
Re: 20210206 Missing Work?
Posted: Tue Feb 09, 2021 5:00 pm
by cine.chris
It appears that the current Linux client 7.6.21, ignored the specified collection server & returned work to the errant 206.223.170.146.
The PRCG showed 'not found'.
Update:
Watched another WU with same results for: project:17431 run:0 clone:1731 gen:105
Re: 20210206 Missing Work?
Posted: Tue Feb 09, 2021 5:31 pm
by Neil-B
If you mean not found in the stats system that can happen - if your log shows work ack and estimated point then the WU will be useful to science - CS are only used if WS can't accept ... my guess WS accepted it but stats connection for that WS is borked (and I think this is only one of a number of ssues all overlapping) - to the point the stats system doesn't even know the WU exists ... Luckily the stats system can be totalled wrecked and the science can still continue ... I am just glad I am not the poor person who has to track down, resolve, then catch up all the stats ... but in my experience it always gets sorted eventually
Re: 20210206 Missing Work?
Posted: Tue Feb 09, 2021 5:45 pm
by cine.chris
Bruce,
Yes, they need to pick their battles, host resolution appears to be a good candidate for a patch.
Creating 'A' records for critical servers & migrating code to name resolution would be a doable plan. Even 'foreign' servers can have managed 'A' records in the native domain (I just tested that...). It resolved within seconds, the first ping worked. Of course, cached updates would need to be tested for latency.
Services could easily be redirected to a back-up or new service, even migrated back if the 'new' service failed, etc.
Hope this is resolved soon.
Re: 20210206 Missing Work?
Posted: Tue Feb 09, 2021 5:54 pm
by WeatherWitch
Stats seem to be slowly updating now? - I just jusmped from 5k to 95k
Re: 20210206 Missing Work?
Posted: Tue Feb 09, 2021 6:19 pm
by SilvioMartin
Neil-B wrote:if your log shows work ack and estimated point then the WU will be useful to science
Good enough for me to keep them running.