[Bug] Client got stuck after initiating connection to WS
Posted: Sat Apr 18, 2020 4:30 pm
My FAHClient has been stuck since 21:32:00 UTC, yesterday (April 17, 2020).
The last lines in the log were as follows:
Upon inspection, it appears that this connection is still in the ESTABLISHED state, some 15 hours later:
I would venture a theory that this connection is not actually still active; the connection simply died (without an RST) at a specific time when the client was waiting for a response, which it will never receive.
I could not clear this condition by pausing/unpausing, or using the request-id or request-ws commands. request-ws did connect to an AS and get an "Assigned to work server 150.136.14.110" message, but nothing else happened after that. The original socket to 13.90.152.57:8080 remained open in the ESTABLISHED state throughout these attempts to jog the client.
The FAHControl UI continued to display 13.90.152.57 as the work server, with no next attempt. Here's the full queue-info output:
I had to restart to get it folding again. Sending SIGINT once did not close the client; the 13.90.152.57:8080 socket remained open. I had to send it again to force exit.
The last lines in the log were as follows:
Code: Select all
21:32:00:WU00:FS00:Connecting to 18.218.241.186:80
21:32:00:WU00:FS00:Assigned to work server 13.90.152.57
21:32:00:WU00:FS00:Requesting new work unit for slot 00: READY cpu:24 from 13.90.152.57
21:32:00:WU00:FS00:Connecting to 13.90.152.57:8080
Code: Select all
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port
tcp ESTAB 0 0 10.176.100.154:38750 13.90.152.57:8080 users:(("FAHClient",pid=30164,fd=12))
I could not clear this condition by pausing/unpausing, or using the request-id or request-ws commands. request-ws did connect to an AS and get an "Assigned to work server 150.136.14.110" message, but nothing else happened after that. The original socket to 13.90.152.57:8080 remained open in the ESTABLISHED state throughout these attempts to jog the client.
The FAHControl UI continued to display 13.90.152.57 as the work server, with no next attempt. Here's the full queue-info output:
Code: Select all
{"id": "00", "state": "DOWNLOAD", "error": "NO_ERROR", "project": 0, "run": 0, "clone": 0, "gen": 0, "core": "unknown", "unit": "0x00000000000000000000000000000000", "percentdone": "0.00%", "eta": "0.00 secs", "ppd": "0", "creditestimate": "0", "waitingon": "", "nextattempt": "0.00 secs", "timeremaining": "unknown time", "totalframes": 0, "framesdone": 0, "assigned": "<invalid>", "timeout": "<invalid>", "deadline": "<invalid>", "ws": "13.90.152.57", "cs": "0.0.0.0", "attempts": 0, "slot": "00", "tpf": "0.00 secs", "basecredit": "0"}