Saved checkpoints not found

Moderators: Site Moderators, FAHC Science Team

sashawe
Posts: 10
Joined: Wed Aug 21, 2019 3:05 pm

Saved checkpoints not found

Post by sashawe »

Hi!
I hope this is the right place to post this!
When I turn my computer off, the F@H client always start at 0 % at a new project after reboot.
Such a waste of computation! Could it be that the saved checkpoints are not found after reboot?
Does anyone know how to fix it?
I installed the software a few days ago, on a machine running ubuntu 18.

Code: Select all

*********************** Log Started 2019-08-21T17:17:20Z ***********************
17:17:20:************************* Folding@home Client *************************
17:17:20:      Website: https://foldingathome.org/
17:17:20:    Copyright: (c) 2009-2018 foldingathome.org
17:17:20:       Author: Joseph Coffland <[email protected]>
17:17:20:         Args: --child --lifeline 1771 /etc/fahclient/config.xml --run-as
17:17:20:               fahclient --pid-file=/var/run/fahclient.pid --daemon
17:17:20:       Config: /etc/fahclient/config.xml
17:17:20:******************************** Build ********************************
17:17:20:      Version: 7.5.1
17:17:20:         Date: May 11 2018
17:17:20:         Time: 19:59:04
17:17:20:   Repository: Git
17:17:20:     Revision: 4705bf53c635f88b8fe85af7675557e15d491ff0
17:17:20:       Branch: master
17:17:20:     Compiler: GNU 6.3.0 20170516
17:17:20:      Options: -std=gnu++98 -O3 -funroll-loops
17:17:20:     Platform: linux2 4.14.0-3-amd64
17:17:20:         Bits: 64
17:17:20:         Mode: Release
17:17:20:******************************* System ********************************
17:17:20:          CPU: AMD Phenom(tm) II X6 1035T Processor
17:17:20:       CPU ID: AuthenticAMD Family 16 Model 10 Stepping 0
17:17:20:         CPUs: 6
17:17:20:       Memory: 7.76GiB
17:17:20:  Free Memory: 6.79GiB
17:17:20:      Threads: POSIX_THREADS
17:17:20:   OS Version: 4.15
17:17:20:  Has Battery: false
17:17:20:   On Battery: false
17:17:20:   UTC Offset: 2
17:17:20:          PID: 1776
17:17:20:          CWD: /var/lib/fahclient
17:17:20:           OS: Linux 4.15.0-58-generic x86_64
17:17:20:      OS Arch: AMD64
17:17:20:         GPUs: 1
17:17:20:        GPU 0: Bus:1 Slot:0 Func:0 NVIDIA:1 GT218 [GeForce 210]
17:17:20:CUDA Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:1.2 Driver:6.5
17:17:20:       OpenCL: Not detected: Failed to open dynamic library 'libOpenCL.so':
17:17:20:               libOpenCL.so: cannot open shared object file: No such file or
17:17:20:               directory
17:17:20:***********************************************************************
17:17:20:<config>
17:17:20:  <!-- Client Control -->
17:17:20:  <fold-anon v='true'/>
17:17:20:
17:17:20:  <!-- Folding Core -->
17:17:20:  <checkpoint v='30'/>
17:17:20:
17:17:20:  <!-- Folding Slot Configuration -->
17:17:20:  <cause v='PARKINSONS'/>
17:17:20:  <gpu v='false'/>
17:17:20:
17:17:20:  <!-- Network -->
17:17:20:  <proxy v=':8080'/>
17:17:20:
17:17:20:  <!-- User Information -->
17:17:20:  <passkey v='********************************'/>
17:17:20:  <user v='sasha_we*****'/>
17:17:20:
17:17:20:  <!-- Folding Slots -->
17:17:20:  <slot id='0' type='CPU'/>
17:17:20:</config>
17:17:20:Switching to user fahclient
17:17:20:Trying to access database...
17:17:20:Successfully acquired database lock
17:17:20:Enabled folding slot 00: READY cpu:5
17:17:20:WU01:FS00:Starting
17:17:20:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/Linux/AMD64/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version 705 -lifeline 1776 -checkpoint 30 -np 5
17:17:20:WU01:FS00:Started FahCore on PID 1818
17:17:20:WU01:FS00:Core PID:1825
17:17:20:WU01:FS00:FahCore 0xa7 started
17:17:20:WARNING:WU01:FS00:FahCore returned: BAD_FRAME_CHECKSUM (112 = 0x70)
17:17:20:WARNING:WU01:FS00:Fatal error, dumping
17:17:20:WU01:FS00:Sending unit results: id:01 state:SEND error:DUMPED project:13827 run:804 clone:0 gen:4 core:0xa7 unit:0x0000000580fccb095c9f8371fc9ed360
17:17:20:WU01:FS00:Connecting to 128.252.203.9:8080
17:17:21:WU00:FS00:Connecting to 65.254.110.245:8080
17:19:31:WARNING:WU01:FS00:WorkServer connection failed on port 8080 trying 80
17:19:31:WU01:FS00:Connecting to 128.252.203.9:80
17:19:31:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': Failed to connect to 65.254.110.245:8080: Connection timed out
17:19:31:WU00:FS00:Connecting to 18.218.241.186:80
17:19:31:WU00:FS00:Assigned to work server 155.247.166.219
17:19:31:WU00:FS00:Requesting new work unit for slot 00: READY cpu:5 from 155.247.166.219
17:19:31:WU00:FS00:Connecting to 155.247.166.219:8080
17:19:31:WU01:FS00:Server responded WORK_ACK (400)
17:19:31:WU01:FS00:Cleaning up
17:19:32:WU00:FS00:Downloading 294.13KiB
17:19:33:WU00:FS00:Download complete
17:19:33:WU00:FS00:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:14153 run:16 clone:359 gen:203 core:0xa7 unit:0x000000df0002894b5c6fca8897911ef3
17:19:33:WU00:FS00:Starting
17:19:33:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/Linux/AMD64/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 705 -lifeline 1776 -checkpoint 30 -np 5
17:19:33:WU00:FS00:Started FahCore on PID 3288
17:19:33:WU00:FS00:Core PID:3292
17:19:33:WU00:FS00:FahCore 0xa7 started
17:19:33:WU00:FS00:0xa7:*********************** Log Started 2019-08-21T17:19:33Z ***********************
17:19:33:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
17:19:33:WU00:FS00:0xa7:       Type: 0xa7
17:19:33:WU00:FS00:0xa7:       Core: Gromacs
17:19:33:WU00:FS00:0xa7:    Website: https://foldingathome.org/
17:19:33:WU00:FS00:0xa7:  Copyright: (c) 2009-2018 foldingathome.org
17:19:33:WU00:FS00:0xa7:     Author: Joseph Coffland <[email protected]>
17:19:33:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 705 -lifeline 3288 -checkpoint 30 -np 5
17:19:33:WU00:FS00:0xa7:     Config: <none>
17:19:33:WU00:FS00:0xa7:************************************ Build *************************************
17:19:33:WU00:FS00:0xa7:    Version: 0.0.17
17:19:33:WU00:FS00:0xa7:       Date: Apr 27 2018
17:19:33:WU00:FS00:0xa7:       Time: 19:09:25
17:19:33:WU00:FS00:0xa7: Repository: Git
17:19:33:WU00:FS00:0xa7:   Revision: 21359963583d09ec2063ef946399441c4df4ccd7
17:19:33:WU00:FS00:0xa7:     Branch: master
17:19:33:WU00:FS00:0xa7:   Compiler: GNU 6.3.0 20170516
17:19:33:WU00:FS00:0xa7:    Options: -std=gnu++98 -O3 -funroll-loops
17:19:33:WU00:FS00:0xa7:   Platform: linux2 4.14.0-3-amd64
17:19:33:WU00:FS00:0xa7:       Bits: 64
17:19:33:WU00:FS00:0xa7:       Mode: Release
17:19:33:WU00:FS00:0xa7:       SIMD: sse2
17:19:33:WU00:FS00:0xa7:************************************ System ************************************
17:19:33:WU00:FS00:0xa7:        CPU: AMD Phenom(tm) II X6 1035T Processor
17:19:33:WU00:FS00:0xa7:     CPU ID: AuthenticAMD Family 16 Model 10 Stepping 0
17:19:33:WU00:FS00:0xa7:       CPUs: 6
17:19:33:WU00:FS00:0xa7:     Memory: 7.76GiB
17:19:33:WU00:FS00:0xa7:Free Memory: 4.69GiB
17:19:33:WU00:FS00:0xa7:    Threads: POSIX_THREADS
17:19:33:WU00:FS00:0xa7: OS Version: 4.15
17:19:33:WU00:FS00:0xa7:Has Battery: false
17:19:33:WU00:FS00:0xa7: On Battery: false
17:19:33:WU00:FS00:0xa7: UTC Offset: 2
17:19:33:WU00:FS00:0xa7:        PID: 3292
17:19:33:WU00:FS00:0xa7:        CWD: /var/lib/fahclient/work
17:19:33:WU00:FS00:0xa7:         OS: Linux 4.15.0-58-generic x86_64
17:19:33:WU00:FS00:0xa7:    OS Arch: AMD64
17:19:33:WU00:FS00:0xa7:********************************************************************************
17:19:33:WU00:FS00:0xa7:Project: 14153 (Run 16, Clone 359, Gen 203)
17:19:33:WU00:FS00:0xa7:Unit: 0x000000df0002894b5c6fca8897911ef3
17:19:33:WU00:FS00:0xa7:Reading tar file core.xml
17:19:33:WU00:FS00:0xa7:Reading tar file frame203.tpr
17:19:33:WU00:FS00:0xa7:Digital signatures verified
17:19:33:WU00:FS00:0xa7:Reducing thread count from 5 to 4 to avoid domain decomposition by a prime number > 3
17:19:33:WU00:FS00:0xa7:Calling: mdrun -s frame203.tpr -o frame203.trr -cpt 30 -nt 4
17:19:33:WU00:FS00:0xa7:Steps: first=1015000000 total=5000000
17:19:33:WU00:FS00:0xa7:Completed 1 out of 5000000 steps (0%)
17:28:39:WU00:FS00:0xa7:Completed 50000 out of 5000000 steps (1%)
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Saved checkpoints not found

Post by bruce »

It looks like you may be experiencing a permissions problem. Are the files in var/lib/fahclient owned by the same user that is starting FAHClient?
The checkpoint files should have been written in /var/lib/fahclient/work/?? and should be readable when FAHClient is started in /var/lib/fahclient as it appears you have done to create that log.
sashawe
Posts: 10
Joined: Wed Aug 21, 2019 3:05 pm

Re: Saved checkpoints not found

Post by sashawe »

Thanks for the speedy reply bruce!
I checked the permissions in those folders now.
It seems to me that the work folder is accessible to all. I don't know if I'm missing something.
This is what it looked like:

Code: Select all

sasha@sashaspc:/var/lib/fahclient$ ls -la
total 32
drwxrwxr-x  6 fahclient root 4096 aug 22 08:51 .
drwxr-xr-x 86 root      root 4096 aug 17 20:27 ..
drwxrwxrwx  2 fahclient root 4096 aug 21 16:52 configs
drwxrwxrwx  3 fahclient root 4096 aug 17 20:27 cores
drwxrwxrwx  2 fahclient root 4096 aug 22 08:51 logs
-rw-r--r--  1 fahclient root 6224 aug 22 09:05 log.txt
drwxrwxrwx  3 fahclient root 4096 aug 22 08:51 work

sasha@sashaspc:/var/lib/fahclient/work$ ls -la
total 64
drwxrwxrwx 3 fahclient root  4096 aug 22 08:51 .
drwxrwxr-x 6 fahclient root  4096 aug 22 08:51 ..
drwxrwxrwx 3 fahclient root  4096 aug 22 09:05 00
-rw-r--r-- 1 fahclient root 32768 aug 22 09:05 client.db
-rw-r--r-- 1 fahclient root 16928 aug 22 09:05 client.db-journal
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Saved checkpoints not found

Post by bruce »

well, that looks right provided you're starting FAHClient with the script provided (which logs on as user "fahclient")

If the WU that is active (or was when FAHClient was shut dow) happens to be WU00, the checkpoints should be inside of work/00.
sashawe
Posts: 10
Joined: Wed Aug 21, 2019 3:05 pm

Re: Saved checkpoints not found

Post by sashawe »

Ok. So it isn't a permissions problem then?
I just have the regular ubuntu install, no special scripts or anything.
In my /etc/passwd file this line has been added, so at least I know the user fahclient exists:

fahclient:x:128:65534:Folding@home Client:/var/lib/fahclient:/usr/sbin/nologin
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Saved checkpoints not found

Post by bruce »

The script should have been installed when FAHClient was installed. etc/init.d/FAHClient start should start FAHClient as a service running as the user fahclient who owns those files. You don't start it under your own userid. Then you (as your normal userid) use FAHControl to manage what it's doing.
sashawe
Posts: 10
Joined: Wed Aug 21, 2019 3:05 pm

Re: Saved checkpoints not found

Post by sashawe »

Yes, I can confirm there is a script /etc/init.d/FAHClient.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Saved checkpoints not found

Post by bruce »

Kill FAHClient if it appears in "top" but it's not working. (It never uses very much CPU, so look for one or more FAHCore* which either is or is not use CPU resources.) Run su /etc.init.d/FAHClient start then check if things are running -- and ultimately, does it find checkpoints?
sashawe
Posts: 10
Joined: Wed Aug 21, 2019 3:05 pm

Re: Saved checkpoints not found

Post by sashawe »

bruce wrote:Kill FAHClient if it appears in "top" but it's not working. (It never uses very much CPU, so look for one or more FAHCore* which either is or is not use CPU resources.) Run su /etc.init.d/FAHClient start then check if things are running -- and ultimately, does it find checkpoints?
I dont't understand why I should do this. I can already see the client is running, I have FAHControl installed and it works fine, except for saving checkpoints - I haven't seen any lines in the log file about saving checkpoints in the recent logs, although I thought I saw it earlier. But maybe that was just me confusing chekcpoints with the progress of the WU:s.

What about these lines in my log? I think this could be the problem:

Code: Select all

17:17:20:       OpenCL: Not detected: Failed to open dynamic library 'libOpenCL.so':
17:17:20:               libOpenCL.so: cannot open shared object file: No such file or
17:17:20:               directory
Joe_H
Site Admin
Posts: 7995
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Studio M1 Max 32 GB smp6
Mac Hack i7-7700K 48 GB smp4
Location: W. MA

Re: Saved checkpoints not found

Post by Joe_H »

probably you confusing the progress messages. Writing a message in the log you see in FAHControl when a checkpoint has been written has been suggested as an enhancement, but currently that doe not happen. Some of the cores write that information in one of the files kept in the work folder, but I have not looked at all of the various files recently for all current cores to say that happens for all folding cores.

As for the message about OpenCL, it does not matter as you are not doing GPU folding on the unsupported video card in your system.
Image
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Saved checkpoints not found

Post by bruce »

libOpenCL.so is (obviously) part of the OpenCL installation and FAH needs it to run the GPUs. It actually comes in different versions which may depend on which version of Linux and/or which version of OpenCL is installed. We can spend some time trying to debug that ... when you're no longer feeling overwhelmed by the troubles we've been working on.

The only way I know to tell when a checkpoint is written is to watch the date-time of the files in fahclient/work/0?/* The checkpoints from a CPU WU will happen every 15 minutes unless you've set them for something else. The checkpoints from a GPU WU happen at a frequency that's set by the project owner and you cant change that except by installing a faster/slower GPU.
sashawe wrote:I don't understand why I should do this. I can already see the client is running...
Because of the permissions issue. If the files are created by user X when started by method A and by user Y when started by method B, you'll create a permissions problem.
sashawe
Posts: 10
Joined: Wed Aug 21, 2019 3:05 pm

Re: Saved checkpoints not found

Post by sashawe »

Ok, I see, thanks Joe and bruce!
Kill FAHClient if it appears in "top" but it's not working.
Ok. So I did ps -a in the terminal, and I found this line:

Code: Select all

 
PID TTY          TIME CMD
...
9400 tty1     00:00:09 FAHControl
But I could not see any FAHClient at all.
Run su /etc.init.d/FAHClient start then check if things are running

Code: Select all

sasha@sashaspc:~$ sudo /etc/init.d/FAHClient start
Starting fahclient ... FAILED
fahclient seems to be already running with PID 1721
sasha@sashaspc:~$ sudo kill 1721
sasha@sashaspc:~ $sudo /etc/init.d/FAHClient start
Starting fahclient ... OK
I checked in FAHControl that the client was really dead after killing it, and that it was alive again after restarting it. It worked, and the funny thing is that it resumed the same project as before, at 20.01%, and when I killed it it was at 20.29%. So that seems to work well! But maybe it's just different when you turn the whole computer off.
I tried this several times, and the first time gavethe above result, but all the following tries gave this output instead: Starting fahclient ... FAIL.
But the practical outcome was still the same.
JimboPalmer
Posts: 2522
Joined: Mon Feb 16, 2009 4:12 am
Location: Greenwood MS USA

Re: Saved checkpoints not found

Post by JimboPalmer »

[I am not a linux user, so forgive my inexperience]
Does that mean there is some difference between you starting the client and the init script starting the client? That does sound like a permissions or environment issue.
Tsar of all the Rushers
I tried to remain childlike, all I achieved was childish.
A friend to those who want no friends
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Saved checkpoints not found

Post by bruce »

Take a look at the files in fahclient/work/ when FAHClient is processing a WU. Stop FAHClient and restart it usimg the other method and let it get started. (Presumably it will not find the checkpoint and will download a new WU.) Did the ownershhip of those files change?

When you turn your computer off, I presume you're doing a normal shutdown, not killing it with the power switch or unplugging it.

We are NOT talking about FAHControl. Ignore its status It will run anywhere.
sashawe
Posts: 10
Joined: Wed Aug 21, 2019 3:05 pm

Re: Saved checkpoints not found

Post by sashawe »

JimboPalmer: yes, that is possible.

It seems that if I manually kill the client (as I did in my previous reply) and then turn off the computer, it works as it should. The same project is resumed when I start the computer again. So killing means a checkpoint is saved, but otherwise not?
Ok, so while the client is working, it looks like this:

Code: Select all

sasha@sashaspc:/var/lib/fahclient/work/01$ ls -la
total 14260
drwxrwxrwx 3 fahclient root    4096 aug 25 11:07 .
drwxrwxrwx 3 fahclient root    4096 aug 25 10:51 ..
drwxrwxrwx 2 fahclient root    4096 aug 25 11:06 01
-rw-r--r-- 1 fahclient root    2177 aug 25 01:00 logfile_01-20190825-085121.txt
-rw-r--r-- 1 fahclient root    1880 aug 25 11:07 logfile_01.txt
-rw-r--r-- 1 fahclient root  472012 aug 25 10:51 viewerFrame0.json
-rw-r--r-- 1 fahclient root  471990 aug 25 11:07 viewerFrame10.json
-rw-r--r-- 1 fahclient root  471908 aug 25 00:28 viewerFrame1.json
-rw-r--r-- 1 fahclient root  471953 aug 25 00:33 viewerFrame2.json
-rw-r--r-- 1 fahclient root  471951 aug 25 00:39 viewerFrame3.json
-rw-r--r-- 1 fahclient root  472045 aug 25 00:44 viewerFrame4.json
-rw-r--r-- 1 fahclient root  472045 aug 25 00:50 viewerFrame5.json
-rw-r--r-- 1 fahclient root  471966 aug 25 00:55 viewerFrame6.json
-rw-r--r-- 1 fahclient root  472012 aug 25 10:51 viewerFrame7.json
-rw-r--r-- 1 fahclient root  471950 aug 25 10:56 viewerFrame8.json
-rw-r--r-- 1 fahclient root  471950 aug 25 11:02 viewerFrame9.json
-rw-r--r-- 1 fahclient root  810710 aug 25 10:51 viewerTop.json
-rw-r--r-- 1 fahclient root 8535040 aug 25 00:15 wudata_01.dat
-rw-rw-rw- 1 fahclient root       5 aug 25 10:51 wudata_01.lock
-rw-r--r-- 1 fahclient root     512 aug 25 11:07 wuinfo_01.dat
So all the files are owned by fahclient. That's good I think. When I do as you said, the ownership of the files does not change after restarting the client.
Now I don't know what other method for stopping the client you are referring to, bruce, but I tried the sudo /etc/init.d/FAHClient stop, and that worked the same way as with a sudo kill <PID>. Meaning, the same project was resumed again when I ran sudo /etc/init.d/FAHClient start. So I don't know how to do this comparison of file ownership.

But maybe there is nothing wrong with the way it is started, but the way it is shut down when the computer is turned off.
And yes, that assumption was correct, bruce :)

For now, I suppose I could manually turn the client off every time I shut down the computer.
Post Reply