Could not connect with OpenMM work servers for 15 hours
Posted: Thu May 07, 2020 12:55 pm
I've been consistently assigned GPU work units for the last 5 days. Yesterday morning there was a string of server connection errors. The delay between attempts was throttled up to one hour, and I could not get any GPU work unit for the next 15 hours.
Tried pause/unpause to no avail, then I restarted FAHClient and the same work servers started functioning again.
I wonder if that's the expected behaviour during high load?
I've checked the logs and there were 24 cases of 'Failed to get assignment',which seems normal to me. However other 28 resulted in an assignment to a few work servers, and three of them dropped almost all connections with some random error.
Here is the summary table that shows affected work servers and their error messages as a CSV text file.
Tried pause/unpause to no avail, then I restarted FAHClient and the same work servers started functioning again.
I wonder if that's the expected behaviour during high load?
I've checked the logs and there were 24 cases of 'Failed to get assignment',which seems normal to me. However other 28 resulted in an assignment to a few work servers, and three of them dropped almost all connections with some random error.
Here is the summary table that shows affected work servers and their error messages as a CSV text file.
Code: Select all
10:07:33 128.252.203.10 orkney.seas.wustl.edu "Exception: 10002: Received short response, expected 512 bytes, got 0"
18:43:07 128.252.203.10 orkney.seas.wustl.edu "Exception: 10002: Received short response, expected 512 bytes, got 0"
06:11:24 128.252.203.10 orkney.seas.wustl.edu "Exception: Failed to connect: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond."
11:07:55 128.252.203.10 orkney.seas.wustl.edu "Exception: Failed to connect: A connection attempt failed because..."
13:07:33 128.252.203.10 orkney.seas.wustl.edu "Exception: Failed to connect: A connection attempt failed because..."
16:07:34 128.252.203.10 orkney.seas.wustl.edu "Exception: Failed to connect: A connection attempt failed because..."
18:07:35 128.252.203.10 orkney.seas.wustl.edu "Exception: Failed to connect: A connection attempt failed because..."
19:38:31 128.252.203.10 orkney.seas.wustl.edu "Exception: Failed to connect: A connection attempt failed because..."
21:12:32 128.252.203.10 orkney.seas.wustl.edu "Exception: Failed to connect: A connection attempt failed because..."
06:51:32 128.252.203.10 orkney.seas.wustl.edu "Exception: Failed to connect: No connection could be made because..."
06:37:18 128.252.203.10 orkney.seas.wustl.edu "Exception: Server did not assign work unit"
19:13:43 128.252.203.10 orkney.seas.wustl.edu "Exception: Server did not assign work unit"
19:27:26 128.252.203.10 orkney.seas.wustl.edu "Exception: Transfer failed"
21:03:04 128.252.203.10 orkney.seas.wustl.edu "Receive error: 10053: An established connection was aborted by the software in your host machine."
06:22:30 128.252.203.10 orkney.seas.wustl.edu "Received short response, expected 512 bytes, got 0"
06:04:55 140.163.4.231 plfah1-1.mskcc.org "Exception: 10002: Received short response, expected 512 bytes, got 0"
06:07:11 140.163.4.231 plfah1-1.mskcc.org "Exception: 10002: Received short response, expected 512 bytes, got 0"
08:07:34 140.163.4.231 plfah1-1.mskcc.org "Exception: 10002: Received short response, expected 512 bytes, got 0"
09:07:33 140.163.4.231 plfah1-1.mskcc.org "Exception: 10002: Received short response, expected 512 bytes, got 0"
20:25:30 140.163.4.231 plfah1-1.mskcc.org "Exception: 10002: Received short response, expected 512 bytes, got 0"
21:00:27 140.163.4.231 plfah1-1.mskcc.org "Exception: 10002: Received short response, expected 512 bytes, got 0"
18:41:52 3.133.76.19 aws1.foldingathome.org "Exception: Failed to connect: A connection attempt failed because..."
19:10:33 3.133.76.19 aws1.foldingathome.org "Exception: Failed to connect: A connection attempt failed because..."
12:07:33 13.82.98.119 fah3.eastus.cloudapp.azure.com "Exception: Server did not assign work unit"
12:07:33 13.82.98.119 fah3.eastus.cloudapp.azure.com "Exception: Server did not assign work unit"
15:07:34 13.82.98.119 fah3.eastus.cloudapp.azure.com "Exception: Server did not assign work unit"
06:08:48 52.224.109.74 fah4.eastus.cloudapp.azure.com "Exception: Server did not assign work unit"
17:07:35 52.224.109.74 fah4.eastus.cloudapp.azure.com "Exception: Server did not assign work unit"