140.163.4.200

Moderators: Site Moderators, FAHC Science Team

JohnChodera
Pande Group Member
Posts: 467
Joined: Fri Feb 22, 2013 9:59 pm

Re: 140.163.4.200

Post by JohnChodera »

Folks: The new work server (pllwskifah1.mskcc.org) ended up in a weird state that was not receiving WUs even though the WS appeared to be running normally. We've restarted it, and it's now receiving the backlog of results.

Please let us know if you notice this happening again! We'll also try to keep a close eye on it and try to figure out what went wrong here.

Apologies for this---it might be the new big NFS storage we mounted on the WS to attempt to avoid out-of-space issues.

~ John Chodera // MSKCC
rickoic
Posts: 320
Joined: Sat May 23, 2009 4:49 pm
Hardware configuration: eVga x299 DARK 2070 Super, eVGA 2080, eVga 1070, eVga 2080 Super
MSI x399 eVga 2080, eVga 1070, eVga 1070, GT970
Location: Mississippi near Memphis, Tn

Re: 140.163.4.200

Post by rickoic »

My backload is slowly disappearing. Had 7 and now its down to 3, so progress is being made. Tks a lot for the fix.
I'm folding because Dec 2005 I had radical prostate surgery.
Lost brother to spinal cancer, brother-in-law to prostate cancer.
Several 1st cousins lost and a few who have survived.
mgetz
Posts: 57
Joined: Tue Aug 11, 2020 6:23 pm

Re: 140.163.4.200

Post by mgetz »

JohnChodera wrote:Please let us know if you notice this happening again! We'll also try to keep a close eye on it and try to figure out what went wrong here.
~ John Chodera // MSKCC
Can we keep it at zero weight through the weekend unless someone is going to actively keep an eye on it? I'd rather not have my GPUs idled for two days if possible (the science must compute!).
Image
rickoic
Posts: 320
Joined: Sat May 23, 2009 4:49 pm
Hardware configuration: eVga x299 DARK 2070 Super, eVGA 2080, eVga 1070, eVga 2080 Super
MSI x399 eVga 2080, eVga 1070, eVga 1070, GT970
Location: Mississippi near Memphis, Tn

Re: 140.163.4.200

Post by rickoic »

Spoke too soon. This just happened a few minutes ago.

Edit: this problem resolved itself a few minutes later. Just slow.

Code: Select all

15:40:05:WU04:FS01:Connecting to assign1.foldingathome.org:80
15:40:05:WU04:FS01:Assigned to work server 140.163.4.200
15:40:05:WU04:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:TU104 [GeForce RTX 2070 SUPER] from 140.163.4.200
15:40:05:WU04:FS01:Connecting to 140.163.4.200:8080
15:40:26:WARNING:WU04:FS01:WorkServer connection failed on port 8080 trying 80
15:40:26:WU04:FS01:Connecting to 140.163.4.200:80
15:40:48:ERROR:WU04:FS01:Exception: Failed to connect to 140.163.4.200:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
15:40:48:WU04:FS01:Connecting to assign1.foldingathome.org:80
15:40:48:WU04:FS01:Assigned to work server 140.163.4.200
15:40:48:WU04:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:TU104 [GeForce RTX 2070 SUPER] from 140.163.4.200
15:40:48:WU04:FS01:Connecting to 140.163.4.200:8080
15:41:09:WARNING:WU04:FS01:WorkServer connection failed on port 8080 trying 80
15:41:09:WU04:FS01:Connecting to 140.163.4.200:80
15:41:31:ERROR:WU04:FS01:Exception: Failed to connect to 140.163.4.200:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
15:41:47:WU02:FS01:0x22:Completed 1000000 out of 1000000 steps (100%)
15:41:47:WU02:FS01:0x22:Average performance: 83.8835 ns/day
15:41:48:WU04:FS01:Connecting to assign1.foldingathome.org:80
15:41:48:WU04:FS01:Assigned to work server 140.163.4.200
15:41:48:WU04:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:TU104 [GeForce RTX 2070 SUPER] from 140.163.4.200
15:41:48:WU04:FS01:Connecting to 140.163.4.200:8080
15:41:54:WU02:FS01:0x22:Saving result file ..\logfile_01.txt
15:41:54:WU02:FS01:0x22:Saving result file checkpointState.xml.bz2
15:41:55:WU02:FS01:0x22:Saving result file globals.csv
15:41:55:WU02:FS01:0x22:Saving result file positions.xtc
15:41:55:WU02:FS01:0x22:Saving result file science.log
15:41:55:WU02:FS01:0x22:Folding@home Core Shutdown: FINISHED_UNIT
15:41:56:WU02:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
15:41:56:WU02:FS01:Sending unit results: id:02 state:SEND error:NO_ERROR project:13426 run:1456 clone:20 gen:4 core:0x22 unit:0x0000000812bc7d9a5f57207fe28d1881
15:41:56:WU02:FS01:Uploading 5.70MiB to 18.188.125.154
15:41:56:WU02:FS01:Connecting to 18.188.125.154:8080
15:42:02:WU02:FS01:Upload 55.94%
15:42:07:WU02:FS01:Upload complete
15:42:07:WU02:FS01:Server responded WORK_ACK (400)
15:42:07:WU02:FS01:Final credit estimate, 176071.00 points
15:42:07:WU02:FS01:Cleaning up
15:42:09:WARNING:WU04:FS01:WorkServer connection failed on port 8080 trying 80
15:42:09:WU04:FS01:Connecting to 140.163.4.200:80
15:42:31:ERROR:WU04:FS01:Exception: Failed to connect to 140.163.4.200:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
15:43:25:WU04:FS01:Connecting to assign1.foldingathome.org:80
15:43:26:WU04:FS01:Assigned to work server 140.163.4.200
15:43:26:WU04:FS01:Requesting new work unit for slot 01: READY gpu:0:TU104 [GeForce RTX 2070 SUPER] from 140.163.4.200
15:43:26:WU04:FS01:Connecting to 140.163.4.200:8080
15:43:47:WARNING:WU04:FS01:WorkServer connection failed on port 8080 trying 80
15:43:47:WU04:FS01:Connecting to 140.163.4.200:80
15:44:08:ERROR:WU04:FS01:Exception: Failed to connect to 140.163.4.200:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
15:46:02:WU04:FS01:Connecting to assign1.foldingathome.org:80
15:46:02:WU04:FS01:Assigned to work server 140.163.4.200
15:46:03:WU04:FS01:Requesting new work unit for slot 01: READY gpu:0:TU104 [GeForce RTX 2070 SUPER] from 140.163.4.200
15:46:03:WU04:FS01:Connecting to 140.163.4.200:8080
15:46:24:WARNING:WU04:FS01:WorkServer connection failed on port 8080 trying 80
15:46:24:WU04:FS01:Connecting to 140.163.4.200:80
15:46:45:ERROR:WU04:FS01:Exception: Failed to connect to 140.163.4.200:80: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
I'm folding because Dec 2005 I had radical prostate surgery.
Lost brother to spinal cancer, brother-in-law to prostate cancer.
Several 1st cousins lost and a few who have survived.
JohnChodera
Pande Group Member
Posts: 467
Joined: Fri Feb 22, 2013 9:59 pm

Re: 140.163.4.200

Post by JohnChodera »

Looks like the server ended up not accepting 80/8080 again. We're going to keep it on weight 0 for a while to monitor.

~ John Chodera // MSKCC
mgetz
Posts: 57
Joined: Tue Aug 11, 2020 6:23 pm

Re: 140.163.4.200

Post by mgetz »

JohnChodera wrote:Looks like the server ended up not accepting 80/8080 again. We're going to keep it on weight 0 for a while to monitor.

~ John Chodera // MSKCC
I have two WUs from it right now:
13436 (22, 5, 2)
13433 (63, 0, 2) completed successfully with no retries 157.664 ns/day

I'll report back in when they finish if they upload or not.
Image
LazyDev
Posts: 13
Joined: Tue Aug 30, 2016 7:28 pm

Re: 140.163.4.200

Post by LazyDev »

My two work units have since been uploaded. Thank for fix this.
Image
mgetz
Posts: 57
Joined: Tue Aug 11, 2020 6:23 pm

Re: 140.163.4.200

Post by mgetz »

project:13436 run:22 clone:5 gen:2 core:0x22 did upload... but it took forever, something is seriously messed up with that server.
Image
JohnChodera
Pande Group Member
Posts: 467
Joined: Fri Feb 22, 2013 9:59 pm

Re: 140.163.4.200

Post by JohnChodera »

Update: it looks like the issue is with an underperforming NFS mount. We're investigating.

Thanks for your patience!

~ John Chodera // MSKCC
hhherby
Posts: 14
Joined: Thu Jan 05, 2017 9:30 pm

Re: 140.163.4.200

Post by hhherby »

I'm noticing this being a super slow connection that keeps timing out.
hhherby
Posts: 14
Joined: Thu Jan 05, 2017 9:30 pm

Re: 140.163.4.200

Post by hhherby »

Can anyone even ping this server?

Pinging 140.163.4.200 with 32 bytes of data:
Request timed out.
Request timed out.
Request timed out.
Request timed out.
Joe_H
Site Admin
Posts: 7939
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: 140.163.4.200

Post by Joe_H »

The server is behind the MSKCC firewall, it blocks pings. If you want to check if the server is up, just enter the IP number into a browser window.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
TristanChen
Posts: 21
Joined: Tue May 30, 2017 4:55 am

Re: 140.163.4.200

Post by TristanChen »

Going to vent here a bit. The collection server (140.163.4.210) tied to this work server has been barely functional for half of December and is still 90% dead today.

I've got no less than 20 completed work units, some days old with 100+ retries, still waiting for the damned server to fix itself.

Can't admins at least set up some sort of redirect?! If 30% of my daily output is just going to be flushed down the drain anyway, then I might as well be running Nicehash...
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon [email protected], 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon [email protected], 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: [email protected], 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: 140.163.4.200

Post by Neil-B »

Still happen bit .. but better than April to June last year .. worth posting here as message can be got to the people who look after each impacted server by the core team .. over weekends/holidays issues can be more noticable and some of the servers are in different timezones where getting responses can be trickier
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
PantherX
Site Moderator
Posts: 6986
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: 140.163.4.200

Post by PantherX »

FYI, the CS 140.163.4.210 has an update of about 1 hour so was recently rebooted. I am aware that working is being done on it to improve certain aspects.

BTW, redirection will not work with the current setup. The WU will either try to reach out to the WS or the CS (if it is defined) which is determined when it was downloaded by the client. There's no way to dynamically update that information on the WU end.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Post Reply