Page 1 of 2

16926 - Some sort of loops with this CPU WU

Posted: Mon Nov 30, 2020 3:16 am
by Knish
Initially got a problem at 2352Z kept seeing a loops regarding a cpu WU. paused/unpaused the slot, and also tried rebooting, still get the below:

Code: Select all

*********************** Log Started 2020-11-30T03:10:47Z ***********************
03:10:47:Trying to access database...
03:10:48:Successfully acquired database lock
03:10:48:Downloading GPUs.txt from assign1.foldingathome.org:80
03:10:48:Connecting to assign1.foldingathome.org:80
03:10:48:Read GPUs.txt
03:10:48:Enabled folding slot 00: READY cpu:4
03:10:50:Enabled folding slot 01: PAUSED gpu:0:GV100GL [Tesla V100 PCIe 16GB] M 14028 (by user)
03:10:50:****************************** FAHClient ******************************
03:10:50:        Version: 7.6.13
03:10:50:         Author: Joseph Coffland <[email protected]>
03:10:50:      Copyright: 2020 foldingathome.org
03:10:50:       Homepage: https://foldingathome.org/
03:10:50:           Date: Apr 28 2020
03:10:50:           Time: 04:20:16
03:10:50:       Revision: 5a652817f46116b6e135503af97f18e094414e3b
03:10:50:         Branch: master
03:10:50:       Compiler: GNU 8.3.0
03:10:50:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
03:10:50:                 -funroll-loops -fno-pie
03:10:50:       Platform: linux2 4.19.0-5-amd64
03:10:50:           Bits: 64
03:10:50:           Mode: Release
03:10:50:           Args: --child /etc/fahclient/config.xml --run-as fahclient
03:10:50:                 --pid-file=/var/run/fahclient.pid --daemon
03:10:50:         Config: /etc/fahclient/config.xml
03:10:50:******************************** CBang ********************************
03:10:50:           Date: Apr 25 2020
03:10:50:           Time: 00:07:53
03:10:50:       Revision: ea081a3b3b0f4a37c4d0440b4f1bc184197c7797
03:10:50:         Branch: master
03:10:50:       Compiler: GNU 8.3.0
03:10:50:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
03:10:50:                 -funroll-loops -fno-pie -fPIC
03:10:50:       Platform: linux2 4.19.0-5-amd64
03:10:50:           Bits: 64
03:10:50:           Mode: Release
03:10:50:******************************* System ********************************
03:10:50:            CPU: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
03:10:50:         CPU ID: GenuineIntel Family 6 Model 79 Stepping 1
03:10:50:           CPUs: 6
03:10:50:         Memory: 110.17GiB
03:10:50:    Free Memory: 109.39GiB
03:10:50:        Threads: POSIX_THREADS
03:10:50:     OS Version: 4.19
03:10:50:    Has Battery: false
03:10:50:     On Battery: false
03:10:50:     UTC Offset: 0
03:10:50:            PID: 651
03:10:50:            CWD: /var/lib/fahclient
03:10:50:             OS: Linux 4.19.0-12-cloud-amd64 x86_64
03:10:50:        OS Arch: AMD64
03:10:50:           GPUs: 1
03:10:50:          GPU 0: Bus:0 Slot:0 Func:0 NVIDIA:7 GV100GL [Tesla V100 PCIe 16GB] M
03:10:50:                 14028
03:10:50:  CUDA Device 0: Platform:0 Device:0 Bus:0 Slot:0 Compute:7.0 Driver:11.0
03:10:50:OpenCL Device 0: Platform:0 Device:0 Bus:0 Slot:0 Compute:1.2 Driver:450.80
03:10:50:******************************* libFAH ********************************
03:10:50:           Date: Apr 15 2020
03:10:50:           Time: 21:43:24
03:10:50:       Revision: 216968bc7025029c841ed6e36e81a03a316890d3
03:10:50:         Branch: master
03:10:50:       Compiler: GNU 8.3.0
03:10:50:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
03:10:50:                 -funroll-loops -fno-pie
03:10:50:       Platform: linux2 4.19.0-5-amd64
03:10:50:           Bits: 64
03:10:50:           Mode: Release
03:10:50:***********************************************************************
03:10:50:<config>
03:10:50:  <!-- Client Control -->
03:10:50:  <fold-anon v='true'/>
03:10:50:
03:10:50:  <!-- Folding Slot Configuration -->
03:10:50:  <cpus v='4'/>
03:10:50:
03:10:50:  <!-- HTTP Server -->redacted
03:10:50:
03:10:50:  <!-- Folding Slots -->
03:10:50:  <slot id='0' type='CPU'/>
03:10:50:  <slot id='1' type='GPU'>
03:10:50:    <paused v='true'/>
03:10:50:  </slot>
03:10:50:</config>
03:10:50:WU01:FS00:Starting
03:10:50:WU01:FS00:Removing old file 'work/01/logfile_01-20201130-023821.txt'
03:10:50:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.9/Core_a8.fah/FahCore_a8 -dir 01 -suffix 01 -version 706 -lifeline 651 -checkpoint 15 -np 4
03:10:50:WU01:FS00:Started FahCore on PID 768
03:10:50:WU01:FS00:Core PID:776
03:10:50:WU01:FS00:FahCore 0xa8 started
03:10:50:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
03:10:51:WU01:FS00:Starting
03:10:51:WU01:FS00:Removing old file 'work/01/logfile_01-20201130-023921.txt'
03:10:51:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.9/Core_a8.fah/FahCore_a8 -dir 01 -suffix 01 -version 706 -lifeline 651 -checkpoint 15 -np 4
03:10:51:WU01:FS00:Started FahCore on PID 1107
03:10:51:WU01:FS00:Core PID:1111
03:10:51:WU01:FS00:FahCore 0xa8 started
03:10:51:WARNING:WU01:FS00:FahCore returned: EARLY_UNIT_END (123 = 0x7b)
03:11:51:WU01:FS00:Starting
03:11:51:WU01:FS00:Removing old file 'work/01/logfile_01-20201130-024021.txt'
03:11:51:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.9/Core_a8.fah/FahCore_a8 -dir 01 -suffix 01 -version 706 -lifeline 651 -checkpoint 15 -np 4
03:11:51:WU01:FS00:Started FahCore on PID 1148
03:11:51:WU01:FS00:Core PID:1152
03:11:51:WU01:FS00:FahCore 0xa8 started
03:11:51:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)

Re: Some sort of loops with this CPU WU

Posted: Mon Nov 30, 2020 3:18 am
by Knish
this WU https://apps.foldingathome.org/wu#proje ... =491&gen=5

happened to this machine too:

Code: Select all

22:41:55:WU00:FS00:0xa8:Completed 49500000 out of 50000000 steps (99%)
22:41:56:WU01:FS00:Connecting to assign1.foldingathome.org:80
22:41:56:WU01:FS00:Assigned to work server 129.32.209.204
22:41:56:WU01:FS00:Requesting new work unit for slot 00: RUNNING cpu:5 from 129.32.209.204
22:41:56:WU01:FS00:Connecting to 129.32.209.204:8080
22:41:56:WU01:FS00:Downloading 49.00KiB
22:41:56:WU01:FS00:Download complete
22:41:56:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:16926 run:28 clone:177 gen:7 core:0xa8 unit:0x0000000d8120d1cc5fbd3bf8d609f936
22:44:34:WU00:FS00:0xa8:Completed 50000000 out of 50000000 steps (100%)
22:44:34:WU00:FS00:0xa8:Saving result file ../logfile_01.txt
22:44:34:WU00:FS00:0xa8:Saving result file frame3.gro
22:44:34:WU00:FS00:0xa8:Saving result file frame3.xtc
22:44:34:WU00:FS00:0xa8:Saving result file md.log
22:44:34:WU00:FS00:0xa8:Saving result file science.log
22:44:34:WU00:FS00:0xa8:Saving result file state.cpt
22:44:34:WU00:FS00:0xa8:Folding@home Core Shutdown: FINISHED_UNIT
22:44:34:WU00:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
22:44:34:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:16926 run:81 clone:72 gen:3 core:0xa8 unit:0x000000068120d1cc5fbd37672fb1556d
22:44:34:WU00:FS00:Uploading 972.00KiB to 129.32.209.204
22:44:34:WU00:FS00:Connecting to 129.32.209.204:8080
22:44:34:WU01:FS00:Starting
22:44:34:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.9/Core_a8.fah/FahCore_a8 -dir 01 -suffix 01 -version 706 -lifeline 639 -checkpoint 15 -np 5
22:44:34:WU01:FS00:Started FahCore on PID 19584
22:44:34:WU01:FS00:Core PID:19588
22:44:34:WU01:FS00:FahCore 0xa8 started
22:44:34:WU00:FS00:Upload complete
22:44:35:WU00:FS00:Server responded WORK_ACK (400)
22:44:35:WU00:FS00:Final credit estimate, 15250.00 points
22:44:35:WU00:FS00:Cleaning up
22:44:35:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
22:44:35:WU01:FS00:Starting
22:44:35:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.9/Core_a8.fah/FahCore_a8 -dir 01 -suffix 01 -version 706 -lifeline 639 -checkpoint 15 -np 5
22:44:35:WU01:FS00:Started FahCore on PID 19592
22:44:35:WU01:FS00:Core PID:19596
22:44:35:WU01:FS00:FahCore 0xa8 started
22:44:36:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
22:45:35:WU01:FS00:Starting
22:45:35:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.9/Core_a8.fah/FahCore_a8 -dir 01 -suffix 01 -version 706 -lifeline 639 -checkpoint 15 -np 5
22:45:35:WU01:FS00:Started FahCore on PID 19606
22:45:35:WU01:FS00:Core PID:19610
22:45:35:WU01:FS00:FahCore 0xa8 started
22:45:36:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)

Re: Some sort of loops with this CPU WU

Posted: Mon Nov 30, 2020 5:00 am
by Nuitari
Linux (Ubuntu 18.04)
Project 16926 (42, 459, 6)

Code: Select all

04:57:28:WU02:FS00:Starting
04:57:28:WARNING:WU02:FS00:AS lowered CPUs from 11 to 10
04:57:28:WU02:FS00:Removing old file 'work/02/logfile_01-20201130-042527.txt'
04:57:28:WU02:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.9/Core_a8.fah/FahCore_a8 -dir 02 -suffix 01 -version 706 -lifeline 773 -checkpoint 15 -np 10
04:57:28:WU02:FS00:Started FahCore on PID 14598
04:57:28:WU02:FS00:Core PID:14602
04:57:28:WU02:FS00:FahCore 0xa8 started
04:57:29:WU02:FS00:FahCore returned: INTERRUPTED (102 = 0x66)

Re: Some sort of loops with this CPU WU

Posted: Mon Nov 30, 2020 5:53 am
by Maddog
Had the same problem with 16926 (13 711 5). Pausing and rebooting did not work. Had to remove Work file to dump. :(

Re: Some sort of loops with this CPU WU

Posted: Mon Nov 30, 2020 8:23 am
by Knish
Ok, another machine is stuck on this too for 3 total machines now, all from WS 129.32.209.204 with WU's:
16926 9 491 5
16926 49 797 4
16926 28 177 7

i guess i'll have to dump these too

Re: 16926 - Some sort of loops with this CPU WU

Posted: Mon Nov 30, 2020 1:26 pm
by elblat
Seeing the same thing on two separate Ubuntu 20.04 machines:
16926 (4, 746, 4)
16926 (87, 735, 6)

Re: 16926 - Some sort of loops with this CPU WU

Posted: Mon Nov 30, 2020 3:50 pm
by samcarboni
Me too with 1xUbuntu 20.04 Desktop & 1xUbuntu 20.04 Server:
16926 (50, 429, 4) &
16926 (99, 821, 4), respectively

UPDATE:
I dumped 16926 (50, 429, 4) & got 16926 (74, 613, 0) which is running fine
I dumped 16926 (99, 821, 4) & got 17410 (0, 520, 233) which is running fine

Re: 16926 - Some sort of loops with this CPU WU

Posted: Mon Nov 30, 2020 4:19 pm
by Joe_H
You should see less of these issues if you change your system from requesting CPU:5 or multiples of 5. I will pass on the problem to the researcher, a setting on the server may be missing that would avoid use of 5 and its multiples. CPU settings of 4, 6 or 9 should be okay.

There is also a bug in the Linux client and core processing that results in this looping, the Windows client will drop the WU after a few retries. As I understand it, the current version of the client - 7.6.21 - should handle this properly.

Re: 16926 - Some sort of loops with this CPU WU

Posted: Mon Nov 30, 2020 4:22 pm
by aetch
Ubuntu 20.04 Desktop
I dumped project:16926 run:17 clone:661 gen:2

I did notice the download was unusually small at 49KB, I'm used to WUs being measured in MB. I wonder if the server is corrupting the WUs.

Re: 16926 - Some sort of loops with this CPU WU

Posted: Mon Nov 30, 2020 4:37 pm
by elblat
Joe_H wrote:You should see less of these issues if you change your system from requesting CPU:5 or multiples of 5. I will pass on the problem to the researcher, a setting on the server may be missing that would avoid use of 5 and its multiples. CPU settings of 4, 6 or 9 should be okay.

There is also a bug in the Linux client and core processing that results in this looping, the Windows client will drop the WU after a few retries. As I understand it, the current version of the client - 7.6.21 - should handle this properly.
Sorry I didn't post my log files, but I do have CPU:3 and CPU:9, so in my case that's not the issue.

Sounds like it's time to catch up the boxes to 7.6.21 though.

Re: 16926 - Some sort of loops with this CPU WU

Posted: Mon Nov 30, 2020 4:52 pm
by gilbertmc
Having the same issue here with the same WU.

Dumping the WU did the trick

Re: 16926 - Some sort of loops with this CPU WU

Posted: Mon Nov 30, 2020 4:54 pm
by Maddog
Dumped 2 more WU,s on different machines, one 6 cores and one 8 cores so also not an issue here.
Have had quite a few of these wu,s before and had no problems really good ppd as well.

Re: 16926 - Some sort of loops with this CPU WU

Posted: Mon Nov 30, 2020 4:59 pm
by mgetz
I also had to dump, the core count issue shouldn't be an issue on the Linux box I was running it on as it's a quad core i5 4c/4t... so five threads shouldn't be possible. Apologies for lacking a log. This WU should really be pulled until the researcher can fix it.

Re: 16926 - Some sort of loops with this CPU WU

Posted: Mon Nov 30, 2020 5:16 pm
by Joe_H
aetch wrote: I did notice the download was unusually small at 49KB, I'm used to WUs being measured in MB. I wonder if the server is corrupting the WUs.
These are simulating a very small protein system, only 1,314 atoms. So the download will be on the small side.

Constraints on the server should have been fixed to avoid assigning to 5 and 10. I am not familiar enough with the history and status of this project to guess much further on potential issues. The person running this project said he would monitor this topic.

Re: 16926 - Some sort of loops with this CPU WU

Posted: Mon Nov 30, 2020 7:09 pm
by aetch
Just dumped another from my Linux machine
Project: 16926 (Run 40, Clone 0, Gen 1)
My i7-5930K is configured to fold on only 8 of its threads.