16926 - Some sort of loops with this CPU WU

Moderators: Site Moderators, FAHC Science Team

Knish
Posts: 222
Joined: Tue Mar 17, 2020 5:20 am

16926 - Some sort of loops with this CPU WU

Post by Knish »

Initially got a problem at 2352Z kept seeing a loops regarding a cpu WU. paused/unpaused the slot, and also tried rebooting, still get the below:

Code: Select all

*********************** Log Started 2020-11-30T03:10:47Z ***********************
03:10:47:Trying to access database...
03:10:48:Successfully acquired database lock
03:10:48:Downloading GPUs.txt from assign1.foldingathome.org:80
03:10:48:Connecting to assign1.foldingathome.org:80
03:10:48:Read GPUs.txt
03:10:48:Enabled folding slot 00: READY cpu:4
03:10:50:Enabled folding slot 01: PAUSED gpu:0:GV100GL [Tesla V100 PCIe 16GB] M 14028 (by user)
03:10:50:****************************** FAHClient ******************************
03:10:50:        Version: 7.6.13
03:10:50:         Author: Joseph Coffland <[email protected]>
03:10:50:      Copyright: 2020 foldingathome.org
03:10:50:       Homepage: https://foldingathome.org/
03:10:50:           Date: Apr 28 2020
03:10:50:           Time: 04:20:16
03:10:50:       Revision: 5a652817f46116b6e135503af97f18e094414e3b
03:10:50:         Branch: master
03:10:50:       Compiler: GNU 8.3.0
03:10:50:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
03:10:50:                 -funroll-loops -fno-pie
03:10:50:       Platform: linux2 4.19.0-5-amd64
03:10:50:           Bits: 64
03:10:50:           Mode: Release
03:10:50:           Args: --child /etc/fahclient/config.xml --run-as fahclient
03:10:50:                 --pid-file=/var/run/fahclient.pid --daemon
03:10:50:         Config: /etc/fahclient/config.xml
03:10:50:******************************** CBang ********************************
03:10:50:           Date: Apr 25 2020
03:10:50:           Time: 00:07:53
03:10:50:       Revision: ea081a3b3b0f4a37c4d0440b4f1bc184197c7797
03:10:50:         Branch: master
03:10:50:       Compiler: GNU 8.3.0
03:10:50:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
03:10:50:                 -funroll-loops -fno-pie -fPIC
03:10:50:       Platform: linux2 4.19.0-5-amd64
03:10:50:           Bits: 64
03:10:50:           Mode: Release
03:10:50:******************************* System ********************************
03:10:50:            CPU: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
03:10:50:         CPU ID: GenuineIntel Family 6 Model 79 Stepping 1
03:10:50:           CPUs: 6
03:10:50:         Memory: 110.17GiB
03:10:50:    Free Memory: 109.39GiB
03:10:50:        Threads: POSIX_THREADS
03:10:50:     OS Version: 4.19
03:10:50:    Has Battery: false
03:10:50:     On Battery: false
03:10:50:     UTC Offset: 0
03:10:50:            PID: 651
03:10:50:            CWD: /var/lib/fahclient
03:10:50:             OS: Linux 4.19.0-12-cloud-amd64 x86_64
03:10:50:        OS Arch: AMD64
03:10:50:           GPUs: 1
03:10:50:          GPU 0: Bus:0 Slot:0 Func:0 NVIDIA:7 GV100GL [Tesla V100 PCIe 16GB] M
03:10:50:                 14028
03:10:50:  CUDA Device 0: Platform:0 Device:0 Bus:0 Slot:0 Compute:7.0 Driver:11.0
03:10:50:OpenCL Device 0: Platform:0 Device:0 Bus:0 Slot:0 Compute:1.2 Driver:450.80
03:10:50:******************************* libFAH ********************************
03:10:50:           Date: Apr 15 2020
03:10:50:           Time: 21:43:24
03:10:50:       Revision: 216968bc7025029c841ed6e36e81a03a316890d3
03:10:50:         Branch: master
03:10:50:       Compiler: GNU 8.3.0
03:10:50:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
03:10:50:                 -funroll-loops -fno-pie
03:10:50:       Platform: linux2 4.19.0-5-amd64
03:10:50:           Bits: 64
03:10:50:           Mode: Release
03:10:50:***********************************************************************
03:10:50:<config>
03:10:50:  <!-- Client Control -->
03:10:50:  <fold-anon v='true'/>
03:10:50:
03:10:50:  <!-- Folding Slot Configuration -->
03:10:50:  <cpus v='4'/>
03:10:50:
03:10:50:  <!-- HTTP Server -->redacted
03:10:50:
03:10:50:  <!-- Folding Slots -->
03:10:50:  <slot id='0' type='CPU'/>
03:10:50:  <slot id='1' type='GPU'>
03:10:50:    <paused v='true'/>
03:10:50:  </slot>
03:10:50:</config>
03:10:50:WU01:FS00:Starting
03:10:50:WU01:FS00:Removing old file 'work/01/logfile_01-20201130-023821.txt'
03:10:50:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.9/Core_a8.fah/FahCore_a8 -dir 01 -suffix 01 -version 706 -lifeline 651 -checkpoint 15 -np 4
03:10:50:WU01:FS00:Started FahCore on PID 768
03:10:50:WU01:FS00:Core PID:776
03:10:50:WU01:FS00:FahCore 0xa8 started
03:10:50:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
03:10:51:WU01:FS00:Starting
03:10:51:WU01:FS00:Removing old file 'work/01/logfile_01-20201130-023921.txt'
03:10:51:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.9/Core_a8.fah/FahCore_a8 -dir 01 -suffix 01 -version 706 -lifeline 651 -checkpoint 15 -np 4
03:10:51:WU01:FS00:Started FahCore on PID 1107
03:10:51:WU01:FS00:Core PID:1111
03:10:51:WU01:FS00:FahCore 0xa8 started
03:10:51:WARNING:WU01:FS00:FahCore returned: EARLY_UNIT_END (123 = 0x7b)
03:11:51:WU01:FS00:Starting
03:11:51:WU01:FS00:Removing old file 'work/01/logfile_01-20201130-024021.txt'
03:11:51:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.9/Core_a8.fah/FahCore_a8 -dir 01 -suffix 01 -version 706 -lifeline 651 -checkpoint 15 -np 4
03:11:51:WU01:FS00:Started FahCore on PID 1148
03:11:51:WU01:FS00:Core PID:1152
03:11:51:WU01:FS00:FahCore 0xa8 started
03:11:51:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
Last edited by Knish on Mon Nov 30, 2020 9:38 am, edited 1 time in total.
Knish
Posts: 222
Joined: Tue Mar 17, 2020 5:20 am

Re: Some sort of loops with this CPU WU

Post by Knish »

this WU https://apps.foldingathome.org/wu#proje ... =491&gen=5

happened to this machine too:

Code: Select all

22:41:55:WU00:FS00:0xa8:Completed 49500000 out of 50000000 steps (99%)
22:41:56:WU01:FS00:Connecting to assign1.foldingathome.org:80
22:41:56:WU01:FS00:Assigned to work server 129.32.209.204
22:41:56:WU01:FS00:Requesting new work unit for slot 00: RUNNING cpu:5 from 129.32.209.204
22:41:56:WU01:FS00:Connecting to 129.32.209.204:8080
22:41:56:WU01:FS00:Downloading 49.00KiB
22:41:56:WU01:FS00:Download complete
22:41:56:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:16926 run:28 clone:177 gen:7 core:0xa8 unit:0x0000000d8120d1cc5fbd3bf8d609f936
22:44:34:WU00:FS00:0xa8:Completed 50000000 out of 50000000 steps (100%)
22:44:34:WU00:FS00:0xa8:Saving result file ../logfile_01.txt
22:44:34:WU00:FS00:0xa8:Saving result file frame3.gro
22:44:34:WU00:FS00:0xa8:Saving result file frame3.xtc
22:44:34:WU00:FS00:0xa8:Saving result file md.log
22:44:34:WU00:FS00:0xa8:Saving result file science.log
22:44:34:WU00:FS00:0xa8:Saving result file state.cpt
22:44:34:WU00:FS00:0xa8:Folding@home Core Shutdown: FINISHED_UNIT
22:44:34:WU00:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
22:44:34:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:16926 run:81 clone:72 gen:3 core:0xa8 unit:0x000000068120d1cc5fbd37672fb1556d
22:44:34:WU00:FS00:Uploading 972.00KiB to 129.32.209.204
22:44:34:WU00:FS00:Connecting to 129.32.209.204:8080
22:44:34:WU01:FS00:Starting
22:44:34:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.9/Core_a8.fah/FahCore_a8 -dir 01 -suffix 01 -version 706 -lifeline 639 -checkpoint 15 -np 5
22:44:34:WU01:FS00:Started FahCore on PID 19584
22:44:34:WU01:FS00:Core PID:19588
22:44:34:WU01:FS00:FahCore 0xa8 started
22:44:34:WU00:FS00:Upload complete
22:44:35:WU00:FS00:Server responded WORK_ACK (400)
22:44:35:WU00:FS00:Final credit estimate, 15250.00 points
22:44:35:WU00:FS00:Cleaning up
22:44:35:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
22:44:35:WU01:FS00:Starting
22:44:35:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.9/Core_a8.fah/FahCore_a8 -dir 01 -suffix 01 -version 706 -lifeline 639 -checkpoint 15 -np 5
22:44:35:WU01:FS00:Started FahCore on PID 19592
22:44:35:WU01:FS00:Core PID:19596
22:44:35:WU01:FS00:FahCore 0xa8 started
22:44:36:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
22:45:35:WU01:FS00:Starting
22:45:35:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.9/Core_a8.fah/FahCore_a8 -dir 01 -suffix 01 -version 706 -lifeline 639 -checkpoint 15 -np 5
22:45:35:WU01:FS00:Started FahCore on PID 19606
22:45:35:WU01:FS00:Core PID:19610
22:45:35:WU01:FS00:FahCore 0xa8 started
22:45:36:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
Nuitari
Posts: 78
Joined: Sun Jun 09, 2019 4:03 am
Hardware configuration: 1x Nvidia 1050ti
1x Nvidia 1660Super
1x Nvidia GTX 660
1x Nvidia 1060 3gb
1x AMD rx570
2x AMD rx560
1x AMD Ryzen 7 PRO 1700
1x AMD Ryzen 7 3700X
1x AMD Phenom II
1x AMD A8-9600
1x Intel i5-4590S

Re: Some sort of loops with this CPU WU

Post by Nuitari »

Linux (Ubuntu 18.04)
Project 16926 (42, 459, 6)

Code: Select all

04:57:28:WU02:FS00:Starting
04:57:28:WARNING:WU02:FS00:AS lowered CPUs from 11 to 10
04:57:28:WU02:FS00:Removing old file 'work/02/logfile_01-20201130-042527.txt'
04:57:28:WU02:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.9/Core_a8.fah/FahCore_a8 -dir 02 -suffix 01 -version 706 -lifeline 773 -checkpoint 15 -np 10
04:57:28:WU02:FS00:Started FahCore on PID 14598
04:57:28:WU02:FS00:Core PID:14602
04:57:28:WU02:FS00:FahCore 0xa8 started
04:57:29:WU02:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
Image
Maddog
Posts: 15
Joined: Wed Sep 30, 2020 2:06 pm

Re: Some sort of loops with this CPU WU

Post by Maddog »

Had the same problem with 16926 (13 711 5). Pausing and rebooting did not work. Had to remove Work file to dump. :(
Knish
Posts: 222
Joined: Tue Mar 17, 2020 5:20 am

Re: Some sort of loops with this CPU WU

Post by Knish »

Ok, another machine is stuck on this too for 3 total machines now, all from WS 129.32.209.204 with WU's:
16926 9 491 5
16926 49 797 4
16926 28 177 7

i guess i'll have to dump these too
elblat
Posts: 15
Joined: Sun Mar 29, 2020 5:18 pm

Re: 16926 - Some sort of loops with this CPU WU

Post by elblat »

Seeing the same thing on two separate Ubuntu 20.04 machines:
16926 (4, 746, 4)
16926 (87, 735, 6)
Image
5800X + 4090 + Win11 | 5600X+ 3070 + 3070 + 2060 + Ubuntu 20.04 | 5600X + 3080 Ti + 3060 Ti + Ubuntu 20.04
samcarboni
Posts: 8
Joined: Tue May 19, 2020 8:12 pm

Re: 16926 - Some sort of loops with this CPU WU

Post by samcarboni »

Me too with 1xUbuntu 20.04 Desktop & 1xUbuntu 20.04 Server:
16926 (50, 429, 4) &
16926 (99, 821, 4), respectively

UPDATE:
I dumped 16926 (50, 429, 4) & got 16926 (74, 613, 0) which is running fine
I dumped 16926 (99, 821, 4) & got 17410 (0, 520, 233) which is running fine
Joe_H
Site Admin
Posts: 7929
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: 16926 - Some sort of loops with this CPU WU

Post by Joe_H »

You should see less of these issues if you change your system from requesting CPU:5 or multiples of 5. I will pass on the problem to the researcher, a setting on the server may be missing that would avoid use of 5 and its multiples. CPU settings of 4, 6 or 9 should be okay.

There is also a bug in the Linux client and core processing that results in this looping, the Windows client will drop the WU after a few retries. As I understand it, the current version of the client - 7.6.21 - should handle this properly.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
aetch
Posts: 436
Joined: Thu Jun 25, 2020 3:04 pm
Location: Between chair and keyboard

Re: 16926 - Some sort of loops with this CPU WU

Post by aetch »

Ubuntu 20.04 Desktop
I dumped project:16926 run:17 clone:661 gen:2

I did notice the download was unusually small at 49KB, I'm used to WUs being measured in MB. I wonder if the server is corrupting the WUs.
Folding Rigs - None (25-Jun-2022)

ImageImage
elblat
Posts: 15
Joined: Sun Mar 29, 2020 5:18 pm

Re: 16926 - Some sort of loops with this CPU WU

Post by elblat »

Joe_H wrote:You should see less of these issues if you change your system from requesting CPU:5 or multiples of 5. I will pass on the problem to the researcher, a setting on the server may be missing that would avoid use of 5 and its multiples. CPU settings of 4, 6 or 9 should be okay.

There is also a bug in the Linux client and core processing that results in this looping, the Windows client will drop the WU after a few retries. As I understand it, the current version of the client - 7.6.21 - should handle this properly.
Sorry I didn't post my log files, but I do have CPU:3 and CPU:9, so in my case that's not the issue.

Sounds like it's time to catch up the boxes to 7.6.21 though.
Image
5800X + 4090 + Win11 | 5600X+ 3070 + 3070 + 2060 + Ubuntu 20.04 | 5600X + 3080 Ti + 3060 Ti + Ubuntu 20.04
gilbertmc
Posts: 4
Joined: Thu May 14, 2020 12:47 pm

Re: 16926 - Some sort of loops with this CPU WU

Post by gilbertmc »

Having the same issue here with the same WU.

Dumping the WU did the trick
Last edited by gilbertmc on Mon Nov 30, 2020 5:06 pm, edited 2 times in total.
Maddog
Posts: 15
Joined: Wed Sep 30, 2020 2:06 pm

Re: 16926 - Some sort of loops with this CPU WU

Post by Maddog »

Dumped 2 more WU,s on different machines, one 6 cores and one 8 cores so also not an issue here.
Have had quite a few of these wu,s before and had no problems really good ppd as well.
mgetz
Posts: 57
Joined: Tue Aug 11, 2020 6:23 pm

Re: 16926 - Some sort of loops with this CPU WU

Post by mgetz »

I also had to dump, the core count issue shouldn't be an issue on the Linux box I was running it on as it's a quad core i5 4c/4t... so five threads shouldn't be possible. Apologies for lacking a log. This WU should really be pulled until the researcher can fix it.
Image
Joe_H
Site Admin
Posts: 7929
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: 16926 - Some sort of loops with this CPU WU

Post by Joe_H »

aetch wrote: I did notice the download was unusually small at 49KB, I'm used to WUs being measured in MB. I wonder if the server is corrupting the WUs.
These are simulating a very small protein system, only 1,314 atoms. So the download will be on the small side.

Constraints on the server should have been fixed to avoid assigning to 5 and 10. I am not familiar enough with the history and status of this project to guess much further on potential issues. The person running this project said he would monitor this topic.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
aetch
Posts: 436
Joined: Thu Jun 25, 2020 3:04 pm
Location: Between chair and keyboard

Re: 16926 - Some sort of loops with this CPU WU

Post by aetch »

Just dumped another from my Linux machine
Project: 16926 (Run 40, Clone 0, Gen 1)
My i7-5930K is configured to fold on only 8 of its threads.
Folding Rigs - None (25-Jun-2022)

ImageImage
Post Reply