trouble Getting more work

Moderators: Site Moderators, FAHC Science Team

Post Reply
fredex
Posts: 48
Joined: Thu Apr 01, 2010 1:17 am
Location: stoneham, ma, us

trouble Getting more work

Post by fredex »

For the last day and a half (or maybe two days) one of my clients (I'm running two on a dual-core box) has been unable to get more work, while the other one just keeps on truckin'.

here's some log entries:
  • [18:17:27] + Attempting to get work packet
    [18:17:27] - Connecting to assignment server
    [18:17:27] - Successful: assigned to (129.74.85.15).
    [18:17:27] + News From Folding@Home: Welcome to Folding@Home
    [18:17:28] Loaded queue successfully.
    [18:17:28] - Attempt #40 to get work failed, and no other work to do.
    Waiting before retry.
    [19:05:39] + Attempting to get work packet
    [19:05:39] - Connecting to assignment server
    [19:05:39] - Successful: assigned to (129.74.85.15).
    [19:05:39] + News From Folding@Home: Welcome to Folding@Home
    [19:05:39] Loaded queue successfully.
    [19:05:41] - Attempt #41 to get work failed, and no other work to do.
    Waiting before retry.
and sometimes (occasionally, not often) it gets the same log entries for the address 171.64.65.111.

Looking at the server status page for 129.74.85.15, it is shown as "accepting" and as having 39130 WUs available.

Not sure why it would be refusing to hand out new work, with that much sitting there available.

Anyone know how I can "trick" the client into trying a different server, or maybe give that one a smack to wake it up? :)

The OTHER client (the one that works) appears to be getting WUs from the same server, and it's not getting refused:
  • [15:19:12] - Preparing to get new work unit...
    [15:19:12] + Attempting to get work packet
    [15:19:12] - Connecting to assignment server
    [15:19:13] - Successful: assigned to (129.74.85.15).
    [15:19:13] + News From Folding@Home: Welcome to Folding@Home
    [15:19:13] Loaded queue successfully.
    [15:19:14] + Closed connections
    [15:19:14]
    [15:19:14] + Processing work unit
    [15:19:14] Core required: FahCore_b4.exe
    [15:19:14] Core found.
    [15:19:14] Working on Unit 03 [April 10 15:19:14]
    [15:19:14] + Working ...
    [15:19:14] *********************** Log Started 10/Apr/2010 15:19:14 ***********************
    [15:19:14] ************************** ProtoMol Folding@Home Core **************************
    [15:19:14] Version: 23
    [15:19:14] Type: 180
    [15:19:14] Core: ProtoMol
    [15:19:14] Website: http://folding.stanford.edu/
    [15:19:14] Copyright: (c) 2009 Stanford University
    [15:19:14] Author: Joseph Coffland <[email protected]>
    [15:19:14] Args: -dir work/ -suffix 03 -checkpoint 15 -lifeline 5435 -version 602
    [15:19:14] ************************************ Build *************************************
I've no clue. Suggestions welcome. thanks!
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: trouble Getting more work

Post by bruce »

Are both clients running the same version of FAH with the same options? Make sure they're both configured for the same sized WUs, with the same amount of RAM if you used that setting, and both with or without -advmethods. The only "trick" that I'm aware of is to restart the client, but there's no certainty that it will matter.
fredex
Posts: 48
Joined: Thu Apr 01, 2010 1:17 am
Location: stoneham, ma, us

Re: trouble Getting more work

Post by fredex »

yes, both clients running the same version, and yes with the same options. both for big WUs and in both cases I told 'em 768M (the machine has 4 gigs, actually) and neither with advmethods.

I haven't yet tried restarting the clients, but I don't suppose it'll harm anything to try.
fredex
Posts: 48
Joined: Thu Apr 01, 2010 1:17 am
Location: stoneham, ma, us

Re: trouble Getting more work

Post by fredex »

restarting the clients doesn't seem to have made any difference....

one of 'em cranks away, the other one tries and fails to get new work.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: trouble Getting more work

Post by bruce »

I just came on-line so I'm not sure what's been happening today, but for most of the last couple weeks there has been a shortage of WUs for the uniprocessor clients. It comes and goes depending on many factors so it's not always easy to tell. That doesn't explain why one client consistently gets WUs while the other one does not, but that could be just randomness.

Adding the -advmethods flag will give you access to a wider variety of WUs, but if that happens to give you projects that don't run on your machine, it's not a good idea. You'll have to decide for yourself.
fredex
Posts: 48
Joined: Thu Apr 01, 2010 1:17 am
Location: stoneham, ma, us

Re: trouble Getting more work

Post by fredex »

So, trying to figure out why I couldn't get more work on one client of two, I tried renaming the work directory for the "bad" one. voila! now it starts up and gets work:

Code: Select all

[14:28:35] Work directory not found. Creating...
[14:28:35] Loaded queue successfully.
[14:28:35] - Preparing to get new work unit...
[14:28:35] + Attempting to get work packet
[14:28:35] - Connecting to assignment server
[14:28:35] - Successful: assigned to (129.74.85.15).
[14:28:35] + News From Folding@Home: Welcome to Folding@Home
[14:28:36] Loaded queue successfully.
[14:28:39] + Closed connections
[14:28:39] 
[14:28:39] + Processing work unit
[14:28:39] Core required: FahCore_b4.exe
[14:28:39] Core found.
[14:28:39] Working on Unit 04 [April 12 14:28:39]
[14:28:39] + Working ...

When it started it left this at the end of the log file:

Code: Select all

[14:28:39] Project: 10017 (Run 2811, Clone 0, Gen 1)
[14:28:39] Reading tar file par_all27_prot_lipid.inp
[14:28:39] Reading tar file scpismQuartic.inp
[14:28:39] Reading tar file ww.pdb
[14:28:39] Reading tar file ww.psf
[14:28:39] Reading tar file checkpt
[14:28:39] Reading tar file ww.71.pos
[14:28:39] Reading tar file ww.71.vel
[14:28:39] Reading tar file protomol.conf
[14:28:39] Reading tar file core.xml
[14:28:39] ERROR: fah/os/Thread.cpp:169:starter: Exception: In thread 12: @ fah/net/Socket.cpp:128:bind 0: Could not bind socket to 127.0.0.1:52753: Address already in use
[14:28:39] Completed 0 out of 499375 steps (0%)
I'm going to assume that the socket error was transient, and that it isn't causing a problem since it seems to have found work to do.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: trouble Getting more work

Post by bruce »

The socket error can be ignored. There's a limitation that restricts the viewer to the first client that starts and you only see the message when you have two clients running at the same time. Since few people use the viewer, few people even care.
Post Reply