171.64.65.64 overloaded
Moderators: Site Moderators, FAHC Science Team
-
- Posts: 270
- Joined: Sun Dec 02, 2007 2:26 pm
- Hardware configuration: Folders: Intel C2D E6550 @ 3.150 GHz + GPU XFX 9800GTX+ @ 765 MHZ w. WinXP-GPU
AMD A2X64 3800+ @ stock + GPU XFX 9800GTX+ @ 775 MHZ w. WinXP-GPU
Main rig: an old Athlon Barton 2500+ @2.25 GHz & 2* 512 MB RAM Apacer, Radeon 9800Pro, WinXP SP3+ - Location: Belgium, near the International Sea-Port of Antwerp
Re: 171.64.65.64 overloaded
Same reports from my friend; it 's not possible to get a constant stream of Work and swift uploads.
That way production is seriously hampered and very inefficient; the donor systems are running idle for too long on average!
He runs a farm with a mix of SMP and multi-GPU, almost all exactly the same systems (hardware and software).
Configs also the same.
This is hurting production really badly!
Can someone take control of that server ( 171.64.65.64 ) and repair it, please?
.
That way production is seriously hampered and very inefficient; the donor systems are running idle for too long on average!
He runs a farm with a mix of SMP and multi-GPU, almost all exactly the same systems (hardware and software).
Configs also the same.
This is hurting production really badly!
Can someone take control of that server ( 171.64.65.64 ) and repair it, please?
.
- stopped Linux SMP w. HT on [email protected] GHz
....................................
Folded since 10-06-04 till 09-2010
....................................
Folded since 10-06-04 till 09-2010
-
- Posts: 270
- Joined: Sun Dec 02, 2007 2:26 pm
- Hardware configuration: Folders: Intel C2D E6550 @ 3.150 GHz + GPU XFX 9800GTX+ @ 765 MHZ w. WinXP-GPU
AMD A2X64 3800+ @ stock + GPU XFX 9800GTX+ @ 775 MHZ w. WinXP-GPU
Main rig: an old Athlon Barton 2500+ @2.25 GHz & 2* 512 MB RAM Apacer, Radeon 9800Pro, WinXP SP3+ - Location: Belgium, near the International Sea-Port of Antwerp
Re: 171.64.65.64 overloaded
.
Now it 's 171.64.65.64 GPU vspg2v lin5 full Reject 1.92 0 0 6 17883 5049 in REJECT again !
.
Now it 's 171.64.65.64 GPU vspg2v lin5 full Reject 1.92 0 0 6 17883 5049 in REJECT again !
.
- stopped Linux SMP w. HT on [email protected] GHz
....................................
Folded since 10-06-04 till 09-2010
....................................
Folded since 10-06-04 till 09-2010
-
- Posts: 660
- Joined: Mon Oct 25, 2010 5:57 am
- Hardware configuration: a) Main unit
Sandybridge in HAF922 w/200 mm side fan
--i7 [email protected] GHz
--ASUS P8P67 DeluxeB3
--4GB ADATA 1600 RAM
--750W Corsair PS
--2Seagate Hyb 750&500 GB--WD Caviar Black 1TB
--EVGA 660GTX-Ti FTW - Signature 2 GPU@ 1241 Boost
--MSI GTX560Ti @900MHz
--Win7Home64; FAH V7.3.2; 327.23 drivers
b) 2004 HP a475c desktop, 1 core Pent 4 [email protected] GHz; Mem 2GB;HDD 160 GB;Zotac GT430PCI@900 MHz
WinXP SP3-32 FAH v7.3.6 301.42 drivers - GPU slot only
c) 2005 Toshiba M45-S551 laptop w/2 GB mem, 160GB HDD;Pent M 740 CPU @ 1.73 GHz
WinXP SP3-32 FAH v7.3.6 [Receiving Core A4 work units]
d) 2011 lappy-15.6"-1920x1080;i7-2860QM,2.5;IC Diamond Thermal Compound;GTX 560M 1,536MB u/c@700;16GB-1333MHz RAM;HDD:500GBHyb w/ 4GB SSD;Win7HomePrem64;320.18 drivers FAH 7.4.2ß - Location: Saratoga, California USA
Re: 171.64.65.64 overloaded
I've extracted from the server log selected columns for 77 times between May 29 and today, June 22 when the CONNECT status was anything but Accepting.
As a "frequent flyer" for 171.64.65.64, I'm wondering if there is something systemic that makes this server down so much - is it the server code, the hardware, the WUs that are loaded, or what. Is that likely to be fixed anytime soon? Or should we adjust our expectations that it will be periodically off-line?
Since this is serving WUs to very fast Fermi GPUs who come back to the well every couple of hours, there seems to be a lot of dead time while clients are trying to unload completed WUs, and can't get new WUs from one of the other servers for Fermi WUs that are still up until X number of failed attempts to connect to 171.64.65.64.
I'm appreciative of members of PG, including Dr. Pande, in giving updates in this thread. Maybe it's time for another brief update.
As a "frequent flyer" for 171.64.65.64, I'm wondering if there is something systemic that makes this server down so much - is it the server code, the hardware, the WUs that are loaded, or what. Is that likely to be fixed anytime soon? Or should we adjust our expectations that it will be periodically off-line?
Since this is serving WUs to very fast Fermi GPUs who come back to the well every couple of hours, there seems to be a lot of dead time while clients are trying to unload completed WUs, and can't get new WUs from one of the other servers for Fermi WUs that are still up until X number of failed attempts to connect to 171.64.65.64.
I'm appreciative of members of PG, including Dr. Pande, in giving updates in this thread. Maybe it's time for another brief update.
Re: 171.64.65.64 overloaded
by yslin » Sat May 21, 2011 1:49 pm
Hi,
I've been working on this server but it might take more to fix. Sorry for the inconveniences!
yslin
yslin
Pande Group Member
Re: 171.64.65.64 overloaded
by VijayPande » Sat May 21, 2011 3:56 pm
It's still having problems, we so we're doing a hard reboot. The machine will likely fsck for a while. We'll give you an update when we know more.
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
VijayPande
Pande Group Member
Code: Select all
SERVER IP WHO STATUS CONNECT CPU LOAD NET LOAD DL WUs AVAIL WUs to go WUs WAIT
Sun May 29 17:05:10 PDT 2011 171.64.65.64 lin5 full Reject 0.64 44 31 135700 135700 135700
Sun May 29 20:00:10 PDT 2011 171.64.65.64 lin5 full Reject 0.7 40 44 135596 135596 135596
Mon May 30 12:00:10 PDT 2011 171.64.65.64 lin5 full Reject 1.02 26 36 135674 135674 135674
Tue May 31 08:05:10 PDT 2011 171.64.65.64 lin5 full Reject 0.74 52 36 135886 135886 135886
Tue May 31 11:35:10 PDT 2011 171.64.65.64 lin5 full Reject 0.59 44 36 135978 135978 135978
Tue May 31 19:10:10 PDT 2011 171.64.65.64 lin5 full Reject 0.66 44 29 135973 135973 135973
Wed Jun 1 10:30:10 PDT 2011 171.64.65.64 lin5 full Reject 0.67 37 25 135771 135771 135771
Wed Jun 1 14:35:11 PDT 2011 171.64.65.64 lin5 full Reject 0.76 37 31 135856 135856 135856
Thu Jun 2 01:45:10 PDT 2011 171.64.65.64 lin5 full Reject 0.62 45 33 135894 135894 135894
Thu Jun 2 07:10:10 PDT 2011 171.64.65.64 lin5 full Reject 0.83 54 25 136006 136006 136006
Thu Jun 2 12:25:10 PDT 2011 171.64.65.64 - full DOWN - - 34- - -
Thu Jun 2 20:40:10 PDT 2011 171.64.65.64 lin5 full Reject 0.42 43 33 135845 135845 135845
Fri Jun 3 04:30:10 PDT 2011 171.64.65.64 lin5 full Reject 0.59 69 27 135852 135852 135852
Fri Jun 3 16:10:10 PDT 2011 171.64.65.64 lin5 full Reject 0.67 80 30 135683 135683 135683
Sun Jun 5 08:55:10 PDT 2011 171.64.65.64 lin5 full Reject 0.61 52 22 135881 135881 135881
Sun Jun 5 14:45:10 PDT 2011 171.64.65.64 lin5 full Reject 0.97 33 26 135884 135884 135884
Sun Jun 5 17:05:10 PDT 2011 171.64.65.64 lin5 full Reject 0.87 19 30 135759 135759 135759
Mon Jun 6 05:35:10 PDT 2011 171.64.65.64 lin5 full Reject 1.02 0 30 135671 135671 135671
Mon Jun 6 09:45:10 PDT 2011 171.64.65.64 lin5 full Reject 1.08 0 23 135712 135712 135712
Mon Jun 6 15:35:10 PDT 2011 171.64.65.64 lin5 full Reject 0.6 52 32 135820 135820 135820
Mon Jun 6 22:00:10 PDT 2011 171.64.65.64 lin5 full Reject 0.99 34 31 0 0 0
Tue Jun 7 19:15:10 PDT 2011 171.64.65.64 lin5 full Reject 0.86 45 32 135975 135975 135975
Tue Jun 7 21:35:10 PDT 2011 171.64.65.64 lin5 full Reject 0.62 25 25 135738 135738 135738
Wed Jun 8 13:40:10 PDT 2011 171.64.65.64 lin5 full Reject 0.68 43 40 135959 135959 135959
Wed Jun 8 17:45:10 PDT 2011 171.64.65.64 lin5 full Reject 1.19 40 37 135953 135953 135953
Thu Jun 9 03:50:10 PDT 2011 171.64.65.64 lin5 full Reject 1.04 41 30 135867 135867 135867
Thu Jun 9 12:00:10 PDT 2011 171.64.65.64 lin5 full Reject 0.97 39 25 0 0 0
Thu Jun 9 17:15:10 PDT 2011 171.64.65.64 lin5 full Reject 0.92 61 20 135880 135880 135880
Thu Jun 9 19:35:10 PDT 2011 171.64.65.64 lin5 full Reject 0.87 41 20 135973 135973 135973
Thu Jun 9 21:55:10 PDT 2011 171.64.65.64 lin5 full Reject 1.03 46 25 135907 135907 135907
Thu Jun 9 23:40:11 PDT 2011 171.64.65.64 lin5 full Reject 1 58 29 135722 135722 135722
Fri Jun 10 03:55:10 PDT 2011 171.64.65.64 lin5 full Reject 1.09 102 29 135833 135833 135833
Fri Jun 10 14:25:10 PDT 2011 171.64.65.64 lin5 full Reject 0.98 47 20 136055 136055 136055
Fri Jun 10 19:05:10 PDT 2011 171.64.65.64 lin5 full Reject 0.62 47 21 136026 136026 136026
Fri Jun 10 21:25:10 PDT 2011 171.64.65.64 lin5 full Reject 0.66 41 28 135969 135969 135969
Sat Jun 11 08:35:10 PDT 2011 171.64.65.64 lin5 full Reject 0.56 58 18 136001 136001 136001
Sat Jun 11 12:40:10 PDT 2011 171.64.65.64 lin5 full Reject 0.96 42 21 135974 135974 135974
Sat Jun 11 20:15:10 PDT 2011 171.64.65.64 lin5 full Reject 1.01 75 25 0 0 0
Sun Jun 12 04:35:10 PDT 2011 171.64.65.64 lin5 full Reject 0.85 53 18 135859 135859 135859
Sun Jun 12 12:10:10 PDT 2011 171.64.65.64 lin5 full Reject 0.84 50 23 136084 136084 136084
Sun Jun 12 13:55:10 PDT 2011 171.64.65.64 lin5 full Reject 0.36 50 23 135951 135951 135951
Sun Jun 12 23:15:10 PDT 2011 171.64.65.64 lin5 full Reject 1.09 0 28 135780 135780 135780
Mon Jun 13 01:00:10 PDT 2011 171.64.65.64 lin5 full Reject 0.73 63 17 136218 136218 136218
Mon Jun 13 05:40:10 PDT 2011 171.64.65.64 lin5 full Reject 1.03 0 39 135797 135797 135797
Mon Jun 13 08:35:10 PDT 2011 171.64.65.64 lin5 full Reject 0.92 45 39 135854 135854 135854
Mon Jun 13 15:00:10 PDT 2011 171.64.65.64 lin5 full Reject 0.65 61 26 136054 136054 136054
Tue Jun 14 02:55:10 PDT 2011 171.64.65.64 lin5 full Reject 0.75 56 15 136042 136042 136042
Wed Jun 15 03:05:10 PDT 2011 171.64.65.64 lin5 standby Not Accept 0.61 41 25 135992 135992 135992
Wed Jun 15 05:25:10 PDT 2011 171.64.65.64 lin5 full Reject 0.89 21 20 135759 135759 135759
Wed Jun 15 08:20:10 PDT 2011 171.64.65.64 lin5 full Reject 0.74 89 23 135805 135805 135805
Wed Jun 15 10:40:10 PDT 2011 171.64.65.64 lin5 full Reject 0.7 69 23 135897 135897 135897
Thu Jun 16 06:55:10 PDT 2011 171.64.65.64 lin5 full Reject 0.91 57 11 119531 119531 119531
Thu Jun 16 13:55:10 PDT 2011 171.64.65.64 lin5 full Reject 0.73 41 15 112457 112457 112457
Thu Jun 16 15:40:10 PDT 2011 171.64.65.64 lin5 full Reject 0.89 81 16 0 0 0
Fri Jun 17 07:40:10 PDT 2011 171.64.65.64 lin5 full Reject 0.83 28 28 99541 99541 99541
Fri Jun 17 11:10:10 PDT 2011 171.64.65.64 lin5 full Reject 0.55 66 28 98062 98062 98062
Sat Jun 18 10:45:11 PDT 2011 171.64.65.64 lin5 full Reject 0.49 50 14 96491 96491 96491
Sat Jun 18 14:20:10 PDT 2011 171.64.65.64 lin5 full Reject 1.02 134 16 0 0 0
Sat Jun 18 16:05:10 PDT 2011 171.64.65.64 lin5 full Reject 0.44 38 17 96379 96379 96379
Sat Jun 18 21:55:10 PDT 2011 171.64.65.64 lin5 full Reject 1.01 31 19 0 0 0
Sun Jun 19 04:00:11 PDT 2011 171.64.65.64 lin5 full Reject 1.06 36 14 96604 96604 96604
Sun Jun 19 11:00:10 PDT 2011 171.64.65.64 lin5 full Reject 0.49 53 13 96562 96562 96562
Mon Jun 20 02:55:10 PDT 2011 171.64.65.64 lin5 standby Not Accept 1.49 58 19 96788 96788 96788
Mon Jun 20 04:45:10 PDT 2011 171.64.65.64 lin5 full Reject 1.08 0 12 96622 96622 96622
Mon Jun 20 05:20:11 PDT 2011 171.64.65.64 lin5 full Reject 0.84 45 12 96904 96904 96904
Mon Jun 20 07:05:11 PDT 2011 171.64.65.64 lin5 full Reject 0.97 305 12 0 0 0
Mon Jun 20 10:00:10 PDT 2011 171.64.65.64 lin5 full Reject 1 0 17 96553 96553 96553
Mon Jun 20 10:35:11 PDT 2011 171.64.65.64 lin5 full Reject 0.61 45 17 96740 96740 96740
Mon Jun 20 18:10:10 PDT 2011 171.64.65.64 lin5 full Reject 2.61 0 33 96348 96348 96348
Mon Jun 20 18:45:11 PDT 2011 171.64.65.64 lin5 full Reject 2.6 0 33 96348 96348 96348
Mon Jun 20 19:20:10 PDT 2011 171.64.65.64 lin5 full Reject 2.65 0 33 96348 96348 96348
Mon Jun 20 19:55:10 PDT 2011 171.64.65.64 lin5 full Reject 1.67 971 1 0 0 0
Mon Jun 20 22:15:10 PDT 2011 171.64.65.64 lin5 full Reject 0.31 68 1 96951 96951 96951
Tue Jun 21 14:20:10 PDT 2011 171.64.65.64 lin5 full Reject 2.02 479 9 0 0 0
Tue Jun 21 16:05:10 PDT 2011 171.64.65.64 lin5 full Reject 2.14 212 7 0 0 0
Wed Jun 22 09:05:10 PDT 2011 171.64.65.64 lin5 full Reject 1.92 0 6 96429 96429 96429
Re: 171.64.65.64 overloaded
The Pande Group is aware of the problem. The ultimate fix is complex and will probably not happen quickly.
I don't think that taking the server off-line is what anybody wants to do.
I don't think that taking the server off-line is what anybody wants to do.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 270
- Joined: Sun Dec 02, 2007 2:26 pm
- Hardware configuration: Folders: Intel C2D E6550 @ 3.150 GHz + GPU XFX 9800GTX+ @ 765 MHZ w. WinXP-GPU
AMD A2X64 3800+ @ stock + GPU XFX 9800GTX+ @ 775 MHZ w. WinXP-GPU
Main rig: an old Athlon Barton 2500+ @2.25 GHz & 2* 512 MB RAM Apacer, Radeon 9800Pro, WinXP SP3+ - Location: Belgium, near the International Sea-Port of Antwerp
Re: 171.64.65.64 overloaded
Where is the redundancy?
- stopped Linux SMP w. HT on [email protected] GHz
....................................
Folded since 10-06-04 till 09-2010
....................................
Folded since 10-06-04 till 09-2010
-
- Pande Group Member
- Posts: 2058
- Joined: Fri Nov 30, 2007 6:25 am
- Location: Stanford
Re: 171.64.65.64 overloaded
We've continued to have trouble with the CS code. Joe is in the process of overhauling it. The current code works well under medium loads, but doesn't scale well. Joe's new scheme greatly simplifies how the CS works to help it scale better. Joe has been working on this the last few weeks, which has slowed down his v7 client work, but this is a very high priority in my mind for situations like this one.
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
-
- Pande Group Member
- Posts: 2058
- Joined: Fri Nov 30, 2007 6:25 am
- Location: Stanford
Re: 171.64.65.64 overloaded
PS I've taken this server weight down to try to help balance it with the other GPU servers.
Also, I emailed Dr. Lin to have her push to get new projects going on a new server which has been assigned to her projects. That new server is much more powerful so it can take a much greater load.
Also, I emailed Dr. Lin to have her push to get new projects going on a new server which has been assigned to her projects. That new server is much more powerful so it can take a much greater load.
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
Re: 171.64.65.64 overloaded
There are three different answers to your question, depending on which problem you're referring to. The title of the topic says "overloaded" but a few posts up, GreyWhiskers asked a question about the server status being other than Accepting and that's an entirely different question.noorman wrote:Where is the redundancy?
Redundancy for downloading new WUs comes from the Assignment Server sending requests for new work to another Work Server. This concept works fine when one work server is down or is out of work. In that regard, it's better for the server to take itself off-line rather than to be overloaded. You'll notice that there are other GPU servers, so that doesn't seem to be a problem. Dr. Pande's statement about adjusting the weights is important, too. This helps balance the load when there are several Work Servers with WUs available (helping to keep any one of them from being overloaded unless they're all overloaded). Adding more projects to a different Work Server helps, too, but can't be done as quickly.
When WUs are finished, they need to find their way back to the same server that assigned them. If the server is overloaded with uploads or is off-line, the upload redundancy comes from the Collection Servers. That's an entirely different question than redundancy for downloads but is clearly being worked on. Although many people may gripe about it, it's not really a problem from an individual's point of view since there is no QRB (yet) for GPUs and all clients manage un-uploaded WUs by retrying later. Yes, this does delay the science, but that's a problem for the Pande Group, not for you, Please recognize that they need to address it based on PG priorities, not based on donor opinions.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 270
- Joined: Sun Dec 02, 2007 2:26 pm
- Hardware configuration: Folders: Intel C2D E6550 @ 3.150 GHz + GPU XFX 9800GTX+ @ 765 MHZ w. WinXP-GPU
AMD A2X64 3800+ @ stock + GPU XFX 9800GTX+ @ 775 MHZ w. WinXP-GPU
Main rig: an old Athlon Barton 2500+ @2.25 GHz & 2* 512 MB RAM Apacer, Radeon 9800Pro, WinXP SP3+ - Location: Belgium, near the International Sea-Port of Antwerp
Re: 171.64.65.64 overloaded
.
Sorry, it delays the science, but it also puts lots of systems in idle ...
That doesn't help anyone and is very inefficient (costs for donors that bring nothing for anyone)
.
Sorry, it delays the science, but it also puts lots of systems in idle ...
That doesn't help anyone and is very inefficient (costs for donors that bring nothing for anyone)
.
- stopped Linux SMP w. HT on [email protected] GHz
....................................
Folded since 10-06-04 till 09-2010
....................................
Folded since 10-06-04 till 09-2010
Re: 171.64.65.64 overloaded
No, it does NOT put systems in idle as long as there are other servers that have WUs to assign -- and that appears to be true.
Which of the three problems I mentioned are you having?
Which of the three problems I mentioned are you having?
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 270
- Joined: Sun Dec 02, 2007 2:26 pm
- Hardware configuration: Folders: Intel C2D E6550 @ 3.150 GHz + GPU XFX 9800GTX+ @ 765 MHZ w. WinXP-GPU
AMD A2X64 3800+ @ stock + GPU XFX 9800GTX+ @ 775 MHZ w. WinXP-GPU
Main rig: an old Athlon Barton 2500+ @2.25 GHz & 2* 512 MB RAM Apacer, Radeon 9800Pro, WinXP SP3+ - Location: Belgium, near the International Sea-Port of Antwerp
Re: 171.64.65.64 overloaded
.
I 'm not having trouble (I stopped F@H because of energy prices being too high over here and fin. situation), my friend is.
He 's not getting WU's and not able to send back results on an almost regular basis.
A consequence of the overload (and from being in REJECT)
.
I 'm not having trouble (I stopped F@H because of energy prices being too high over here and fin. situation), my friend is.
He 's not getting WU's and not able to send back results on an almost regular basis.
A consequence of the overload (and from being in REJECT)
.
- stopped Linux SMP w. HT on [email protected] GHz
....................................
Folded since 10-06-04 till 09-2010
....................................
Folded since 10-06-04 till 09-2010
-
- Pande Group Member
- Posts: 2058
- Joined: Fri Nov 30, 2007 6:25 am
- Location: Stanford
Re: 171.64.65.64 overloaded
Sorry to hear you're still having problems. The server is not in REJECT and has been working pretty well today. Right now, it only has 23 connections, which is pretty low. I wonder if there's something else going on other than the machine being loaded? Any chance it's an issue with your friends ISP? At times, we see weird things with ISP's.
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
-
- Posts: 660
- Joined: Mon Oct 25, 2010 5:57 am
- Hardware configuration: a) Main unit
Sandybridge in HAF922 w/200 mm side fan
--i7 [email protected] GHz
--ASUS P8P67 DeluxeB3
--4GB ADATA 1600 RAM
--750W Corsair PS
--2Seagate Hyb 750&500 GB--WD Caviar Black 1TB
--EVGA 660GTX-Ti FTW - Signature 2 GPU@ 1241 Boost
--MSI GTX560Ti @900MHz
--Win7Home64; FAH V7.3.2; 327.23 drivers
b) 2004 HP a475c desktop, 1 core Pent 4 [email protected] GHz; Mem 2GB;HDD 160 GB;Zotac GT430PCI@900 MHz
WinXP SP3-32 FAH v7.3.6 301.42 drivers - GPU slot only
c) 2005 Toshiba M45-S551 laptop w/2 GB mem, 160GB HDD;Pent M 740 CPU @ 1.73 GHz
WinXP SP3-32 FAH v7.3.6 [Receiving Core A4 work units]
d) 2011 lappy-15.6"-1920x1080;i7-2860QM,2.5;IC Diamond Thermal Compound;GTX 560M 1,536MB u/c@700;16GB-1333MHz RAM;HDD:500GBHyb w/ 4GB SSD;Win7HomePrem64;320.18 drivers FAH 7.4.2ß - Location: Saratoga, California USA
Re: 171.64.65.64 overloaded
I wanted to toss a few numbers out to show that while the server in question has had a lot of downtime, the affect on my folding over the last month has been minimal. Hats off to PG and to the flexibility of the system.
I'm running one GTX 560Ti, still with v6, so I can, and do, track all of the wu-by-wu stats in the HFM WU history log. I typically run DatAdmin 3 to export the MySQL DB to a CSV file for massaging with Excel.
Bottom line. During the month of June, I have 9 instances out of 226 completed GPU WUs where the "turn around time" has been anomalous. That is the time from the completion of one WU until the start of the next.
158 of the 226 June WUs were P6801, which involve 171.64.65.64. In the last couple of weeks, more and more of the WUs are from other projects.
96% of my WUs in June turned around in 10-30 seconds. The 9 anomalous instances range from 1:02 (mm:ss) to 17:35, with one outlier at over 4 hours. This is the period where so many of the 171.64.65.64 REJECT periods occurred.
The anomalous turn arounds were mostly correlated with either 171.64.65.64 REJECT periods, or with heavy "WU Received" periods, as reported in the Server Stats. And, some of these WU RCV loads have been heavy - see charts at bottom of post
That outlier occurred during one of the server REJECT periods. I can't remember exactly, but I may have stopped the GPU folding for a while to play with my SMP settings.
My 96% quick turn around could have been a little better in v7, since v6 won't attempt to get a new WU until after several failed attempts to upload the just-completed WU. v7 separates the upload/download, so if the assignment server recognizes not to assign me to the downed work server, then I could pick up a new WU possibly sooner.
Bottom line, for me at least. Good job to PG. The SYSTEM seems to be working well. I was surprised, once I looked at the actual data, how good the overall turn-around for my series of Core 16 Nvidia GPU projects was.
While I was looking at the server stats, I ran a couple of excel spreadsheets and charts. These two charts show how busy this server has been. No deep message or analysis here, just some interesting collateral information. The timescale is each individual half-hour update to the stats page when I pulled it a couple of days ago http://fah-web.stanford.edu/logs/171.64.65.64.log.html
NETLOAD tells how busy the server is by netstat (i.e. how many current connections the server is handling). Too many connections means that the server is heavily loaded. How many are "too many" depends on the server, but most of our servers can now handle a couple hundred connections without a problem.
NOTE LOG SCALE
WUS RCVD shows how many WUs have been received since the last time the servers WUs were updated into the stats. This shows the relative number of WUs being received on the different servers (if all is fine) or which servers are not being inputted into the stats db if there is some problem.
NOTE LINEAR SCALE
I'm running one GTX 560Ti, still with v6, so I can, and do, track all of the wu-by-wu stats in the HFM WU history log. I typically run DatAdmin 3 to export the MySQL DB to a CSV file for massaging with Excel.
Bottom line. During the month of June, I have 9 instances out of 226 completed GPU WUs where the "turn around time" has been anomalous. That is the time from the completion of one WU until the start of the next.
158 of the 226 June WUs were P6801, which involve 171.64.65.64. In the last couple of weeks, more and more of the WUs are from other projects.
96% of my WUs in June turned around in 10-30 seconds. The 9 anomalous instances range from 1:02 (mm:ss) to 17:35, with one outlier at over 4 hours. This is the period where so many of the 171.64.65.64 REJECT periods occurred.
The anomalous turn arounds were mostly correlated with either 171.64.65.64 REJECT periods, or with heavy "WU Received" periods, as reported in the Server Stats. And, some of these WU RCV loads have been heavy - see charts at bottom of post
That outlier occurred during one of the server REJECT periods. I can't remember exactly, but I may have stopped the GPU folding for a while to play with my SMP settings.
My 96% quick turn around could have been a little better in v7, since v6 won't attempt to get a new WU until after several failed attempts to upload the just-completed WU. v7 separates the upload/download, so if the assignment server recognizes not to assign me to the downed work server, then I could pick up a new WU possibly sooner.
Bottom line, for me at least. Good job to PG. The SYSTEM seems to be working well. I was surprised, once I looked at the actual data, how good the overall turn-around for my series of Core 16 Nvidia GPU projects was.
Code: Select all
"anomalous" finish-to-start times
hh:mm:ss
00:05:09
00:04:53
00:13:04
00:12:34
00:01:02
04:13:30
00:01:59
00:02:36
00:17:25
While I was looking at the server stats, I ran a couple of excel spreadsheets and charts. These two charts show how busy this server has been. No deep message or analysis here, just some interesting collateral information. The timescale is each individual half-hour update to the stats page when I pulled it a couple of days ago http://fah-web.stanford.edu/logs/171.64.65.64.log.html
NETLOAD tells how busy the server is by netstat (i.e. how many current connections the server is handling). Too many connections means that the server is heavily loaded. How many are "too many" depends on the server, but most of our servers can now handle a couple hundred connections without a problem.
NOTE LOG SCALE
WUS RCVD shows how many WUs have been received since the last time the servers WUs were updated into the stats. This shows the relative number of WUs being received on the different servers (if all is fine) or which servers are not being inputted into the stats db if there is some problem.
NOTE LINEAR SCALE
-
- Posts: 270
- Joined: Sun Dec 02, 2007 2:26 pm
- Hardware configuration: Folders: Intel C2D E6550 @ 3.150 GHz + GPU XFX 9800GTX+ @ 765 MHZ w. WinXP-GPU
AMD A2X64 3800+ @ stock + GPU XFX 9800GTX+ @ 775 MHZ w. WinXP-GPU
Main rig: an old Athlon Barton 2500+ @2.25 GHz & 2* 512 MB RAM Apacer, Radeon 9800Pro, WinXP SP3+ - Location: Belgium, near the International Sea-Port of Antwerp
Re: 171.64.65.64 overloaded
.VijayPande wrote:Sorry to hear you're still having problems. The server is not in REJECT and has been working pretty well today. Right now, it only has 23 connections, which is pretty low. I wonder if there's something else going on other than the machine being loaded? Any chance it's an issue with your friends ISP? At times, we see weird things with ISP's.
I 'm very sure my friend is very capable to distinguish local network (or ISP) problems from server problems.
He has many systems running, all with multiple GPU card setups.
He 'd just like his systems to be running F@H without almost regular interruption because a finished WU (results) cannot be returned or because there is no new Work available or because the server is overloaded.
Specifically this server, with Fermi Work, has to be a heavy duty system because of the very fast turnaround of the Work that is coming from it.
If, in future, Fermi is used better, the return of Work might increase still, which would load that server even more than it is already.
Another point: I 'm sure it 's not possible that a network or ISP problem at my friend's could explain a 'low' connection count to named server ...
.
- stopped Linux SMP w. HT on [email protected] GHz
....................................
Folded since 10-06-04 till 09-2010
....................................
Folded since 10-06-04 till 09-2010
Re: 171.64.65.64 overloaded
Minor technicality: 171.64.65.64 is NOT a named server as far as FAH is concerned. Yes, the server has a name, but FAH does not reference DNS, it uses the IP address.noorman wrote:I 'm sure it 's not possible that a network or ISP problem at my friend's could explain a 'low' connection count to named server ...
.
Minor question: how do you account for the WUs Received count being reset when the stats are uploaded to the stats server?GreyWhiskers wrote:WUS RCVD shows how many WUs have been received since the last time the servers WUs were updated into the stats. This shows the relative number of WUs being received on the different servers (if all is fine) or which servers are not being inputted into the stats db if there is some problem.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.