Page 1 of 4

128.59.74.4 in Reject

Posted: Tue Feb 24, 2009 2:54 pm
by Mactin
I have two WUs trying to upload results to 128.59.74.4 in Reject mode. The Associated CS will not accept them either. I have three more comming due this morning to the same server.

Project: 3856 (Run 240, Clone 0, Gen 22)

Code: Select all

[13:12:18] + Attempting to send results [February 24 13:12:18 UTC]
[13:12:18] - Reading file work/wuresults_01.dat from core
[13:12:18]   (Read 1356362 bytes from disk)
[13:12:18] Connecting to http://128.59.74.4:8080/
[13:12:19] - Couldn't send HTTP request to server
[13:12:19] + Could not connect to Work Server (results)
[13:12:19]     (128.59.74.4:8080)
[13:12:19] + Retrying using alternative port
[13:12:19] Connecting to http://128.59.74.4:80/
[13:12:20] - Couldn't send HTTP request to server
[13:12:20] + Could not connect to Work Server (results)
[13:12:20]     (128.59.74.4:80)
[13:12:20] - Error: Could not transmit unit 01 (completed February 24) to work server.
[13:12:20] - 675 failed uploads of this unit.


[13:12:20] + Attempting to send results [February 24 13:12:20 UTC]
[13:12:20] - Reading file work/wuresults_01.dat from core
[13:12:20]   (Read 1356362 bytes from disk)
[13:12:20] Connecting to http://171.65.103.100:8080/
[13:12:20] - Couldn't send HTTP request to server
[13:12:20]   (Got status 503)
[13:12:20] + Could not connect to Work Server (results)
[13:12:20]     (171.65.103.100:8080)
[13:12:20] + Retrying using alternative port
[13:12:20] Connecting to http://171.65.103.100:80/
[13:12:23] Posted data.
[13:12:23] Initial: 0000; - Server reports packet it received specified a data size of 0.
[13:12:23]   (May be due to corruption during network transmission or a corrupted file.)
[13:12:23]   Could not transmit unit 01 to Collection server; keeping in queue.
[13:12:23] + Sent 0 of 1 completed units to the server
[13:12:23] - Failed to send all units to server
[13:12:23] ***** Got a SIGTERM signal (2)
[13:12:23] Killing all core threads
and
Project: 3859 (Run 7985, Clone 0, Gen 8)

Code: Select all

[13:10:44] + Attempting to send results [February 24 13:10:44 UTC]
[13:10:44] - Reading file work/wuresults_01.dat from core
[13:10:44]   (Read 871488 bytes from disk)
[13:10:44] Connecting to http://128.59.74.4:8080/
[13:10:45] - Couldn't send HTTP request to server
[13:10:45] + Could not connect to Work Server (results)
[13:10:45]     (128.59.74.4:8080)
[13:10:45] + Retrying using alternative port
[13:10:45] Connecting to http://128.59.74.4:80/
[13:10:46] - Couldn't send HTTP request to server
[13:10:46] + Could not connect to Work Server (results)
[13:10:46]     (128.59.74.4:80)
[13:10:46] - Error: Could not transmit unit 01 (completed February 24) to work server.
[13:10:46] - 748 failed uploads of this unit.


[13:10:46] + Attempting to send results [February 24 13:10:46 UTC]
[13:10:46] - Reading file work/wuresults_01.dat from core
[13:10:46]   (Read 871488 bytes from disk)
[13:10:46] Connecting to http://171.65.103.100:8080/
[13:10:46] - Couldn't send HTTP request to server
[13:10:46]   (Got status 503)
[13:10:46] + Could not connect to Work Server (results)
[13:10:46]     (171.65.103.100:8080)
[13:10:46] + Retrying using alternative port
[13:10:46] Connecting to http://171.65.103.100:80/
[13:10:46] - Couldn't send HTTP request to server
[13:10:46]   (Got status 503)
[13:10:46] + Could not connect to Work Server (results)
[13:10:46]     (171.65.103.100:80)
[13:10:46]   Could not transmit unit 01 to Collection server; keeping in queue.
[13:10:46] + Sent 0 of 1 completed units to the server
[13:10:46] - Failed to send all units to server
[13:10:46] ***** Got a SIGTERM signal (2)
[13:10:46] Killing all core threads
after 675 and 748 attemps to send overnight, I stopped them when I came in the office.
Username for these WUs is "Martin_UM_P4", team 96377 "Mactin".

BTW, I am having trouble obtaing WUs for the pas few weeks, nothing dramatic, but it can take some time to get to the AS and once assigned, it can take some more time to get a WU from busy servers. This behaviour was only seen with classic WUs until yesterday where it spread to my SMP machine at home.

Thanks

Re: 128.59.74.4 in Reject

Posted: Tue Feb 24, 2009 3:56 pm
by mrshirts
There's a filesystem problem; I didn't notice it earlier because it was possible to login (hence psummary worked) but no assignments are being accepted. I'm restarting and diagnosing it now.

Re: 128.59.74.4 in Reject

Posted: Tue Feb 24, 2009 4:47 pm
by mrshirts
Bad news. The RAID array may need to be rebuilt. I'll keep posting here, and I'll try to figure out the problem with the collection server as well.

Re: 128.59.74.4 in Reject

Posted: Tue Feb 24, 2009 7:36 pm
by toTOW
Vijay made an announcement about this server : viewtopic.php?f=24&t=8601

Re: 128.59.74.4 in Reject

Posted: Wed Feb 25, 2009 4:40 am
by Teddy
That would explain the 3 work units I cant return to that server.

Teddy

Re: 128.59.74.4 in Reject

Posted: Wed Feb 25, 2009 1:16 pm
by toTOW
Indeed ©

Re: 128.59.74.4 in Reject

Posted: Thu Feb 26, 2009 1:51 pm
by JanQ
[11:36:07] Completed 130780 out of 300000 steps (43%)
[11:36:16] - Couldn't send HTTP request to server
[11:36:16] + Could not connect to Work Server (results)
[11:36:16] (128.59.74.4:8080)
[11:36:16] + Retrying using alternative port
[11:36:37] - Couldn't send HTTP request to server
[11:36:37] + Could not connect to Work Server (results)
[11:36:37] (128.59.74.4:80)
[11:36:37] - Error: Could not transmit unit 02 (completed February 24) to work server.


[11:36:37] + Attempting to send results [February 26 11:36:37 UTC]
[11:36:53] - Couldn't send HTTP request to server
[11:36:53] + Could not connect to Work Server (results)
[11:36:53] (171.65.103.100:8080)
[11:36:53] + Retrying using alternative port
[11:37:09] - Couldn't send HTTP request to server
[11:37:09] + Could not connect to Work Server (results)
[11:37:09] (171.65.103.100:80)
[11:37:09] Could not transmit unit 02 to Collection server; keeping in queue.

:( :( :(

Re: 128.59.74.4 in Reject

Posted: Thu Feb 26, 2009 6:22 pm
by bruce
JanQ wrote:[11:36:37] (128.59.74.4:80)
[11:36:37] - Error: Could not transmit unit 02 (completed February 24) to work server.
[11:37:09] (171.65.103.100:80)
[11:37:09] Could not transmit unit 02 to Collection server; keeping in queue.
Yes, that's the issue mentioned in the announcement referenced above. There's nothing you can do except be patient.

Re: 128.59.74.4 in Reject

Posted: Thu Feb 26, 2009 6:29 pm
by Mactin
As a short term measure, would it be possible to enable the CS's to receive these WUs ?

Re: 128.59.74.4 in Reject

Posted: Thu Feb 26, 2009 8:32 pm
by bruce
Mactin wrote:As a short term measure, would it be possible to enable the CS's to receive these WUs ?
If you check http://fah-web.stanford.edu/serverstat.html you'll see that the CS (.100) is enabled but extremely busy. Unfortunately the WU that has been assigned to you can only be uploaded to a specific CS, hence my "be patient" comment. It's impossible to tell how many WUs are waiting to upload to the same overtaxed CS but I'd guess it's a lot. There's nothing can be done except wait.

Re: 128.59.74.4 in Reject

Posted: Mon Mar 02, 2009 6:09 pm
by BABackman
toTOW wrote:Vijay made an announcement about this server : viewtopic.php?f=24&t=8601
The Feb 24 message said a RAID rebuild would take "a few days." Any word on the patient's progress or prognosis? Hopefully, it's not as grim as it looks?

Re: 128.59.74.4 in Reject

Posted: Wed Mar 04, 2009 1:51 pm
by Robby_Firefox
I hope the repairs of this server is going well. Any news of when it will be returned to service?

I am using the classic client 6.23 on a Windows XP Pro platform at this time; and have: Project: 3859 -- Run 8092 waiting. About every six hours, it attempts to send this completed project back to FAH. Will 3859 stay in the client's queue until it gets successfully sent to FAH?

Thanks,
Robby / Team Firefox

Re: 128.59.74.4 in Reject

Posted: Wed Mar 04, 2009 2:16 pm
by toTOW
Yes it will stay in the queue until the server gets back online, or the final deadline passes.

Re: 128.59.74.4 in Reject

Posted: Thu Mar 05, 2009 5:10 am
by mrshirts
Update on 128.59.74.4:
The good news, all the data (2 TB) is safe. I was able to rebuild and mount the raid. The bad news is, the server won't boot normally. Since it's actually at Columbia (where I don't work anymore), I'm a bit at the mercy of the IT support staff there in terms of getting it up and running again. The current plan is therefore to copy the data off that is needed to continue the projects, and try to relay the IP to a different machine, putting it into accept only mode. Time line is probably going to be about a week, unfortunately.

Re: 128.59.74.4 in Reject

Posted: Thu Mar 05, 2009 6:23 am
by kelliegang
Thanks for the update mrshirts really appreciate it :) Glad nothing was lost also.