Decomposition problem - continuous loop

Moderators: Site Moderators, FAHC Science Team

Post Reply
DaveHarper
Posts: 4
Joined: Wed Jun 17, 2020 1:24 pm
Location: Fairview, TX

Decomposition problem - continuous loop

Post by DaveHarper »

Yesterday I opened FAHControl and it appeared that nothing was being done. Looking at the log file showed the fatal decomposition error. I've only been using FAH for a couple of months (and it's been great up to now) so I was a little unsure of what to do. I started digging and came across troubleshooting information that indicated I should change the CPU value. I did this (from -1 to 32) and it appeared to have no effect. I also exited and restarted FAH (and even rebooted) but it simply appeared to pickup again from where it left off. My issue at this point is how to terminate the current loop where it seems to be stuck. System and a repeating section of the log information are shown below. Thanks for any help.
Dave

System Information:
O.S.: Linux Mint v19.3
Hardware: Home Brew
Processor: Intel Core i9-7900X (20 core)
Motherboard: ASUS PRIME X299-DELUXE
Memory: 64GB
Storage: 6T (RAID 10)

Code: Select all

08:19:59:WU02:FS00:Starting
08:19:59:WU02:FS00:Removing old file 'work/02/logfile_01-20200617-074758.txt'
08:19:59:WU02:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 02 -suffix 01 -version 706 -lifeline 2969 -checkpoint 15 -np 20
08:19:59:WU02:FS00:Started FahCore on PID 13055
08:19:59:WU02:FS00:Core PID:13059
08:19:59:WU02:FS00:FahCore 0xa7 started
08:20:00:WU02:FS00:0xa7:*********************** Log Started 2020-06-17T08:19:59Z ***********************
08:20:00:WU02:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
08:20:00:WU02:FS00:0xa7:       Type: 0xa7
08:20:00:WU02:FS00:0xa7:       Core: Gromacs
08:20:00:WU02:FS00:0xa7:       Args: -dir 02 -suffix 01 -version 706 -lifeline 13055 -checkpoint 15 -np
08:20:00:WU02:FS00:0xa7:             20
08:20:00:WU02:FS00:0xa7:************************************ CBang *************************************
08:20:00:WU02:FS00:0xa7:       Date: Nov 5 2019
08:20:00:WU02:FS00:0xa7:       Time: 06:06:57
08:20:00:WU02:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
08:20:00:WU02:FS00:0xa7:     Branch: master
08:20:00:WU02:FS00:0xa7:   Compiler: GNU 8.3.0
08:20:00:WU02:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
08:20:00:WU02:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
08:20:00:WU02:FS00:0xa7:       Bits: 64
08:20:00:WU02:FS00:0xa7:       Mode: Release
08:20:00:WU02:FS00:0xa7:************************************ System ************************************
08:20:00:WU02:FS00:0xa7:        CPU: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
08:20:00:WU02:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 85 Stepping 4
08:20:00:WU02:FS00:0xa7:       CPUs: 20
08:20:00:WU02:FS00:0xa7:     Memory: 62.59GiB
08:20:00:WU02:FS00:0xa7:Free Memory: 24.07GiB
08:20:00:WU02:FS00:0xa7:    Threads: POSIX_THREADS
08:20:00:WU02:FS00:0xa7: OS Version: 4.15
08:20:00:WU02:FS00:0xa7:Has Battery: false
08:20:00:WU02:FS00:0xa7: On Battery: false
08:20:00:WU02:FS00:0xa7: UTC Offset: -5
08:20:00:WU02:FS00:0xa7:        PID: 13059
08:20:00:WU02:FS00:0xa7:        CWD: /var/lib/fahclient/work
08:20:00:WU02:FS00:0xa7:******************************** Build - libFAH ********************************
08:20:00:WU02:FS00:0xa7:    Version: 0.0.18
08:20:00:WU02:FS00:0xa7:     Author: Joseph Coffland <[email protected]>
08:20:00:WU02:FS00:0xa7:  Copyright: 2019 foldingathome.org
08:20:00:WU02:FS00:0xa7:   Homepage: https://foldingathome.org/
08:20:00:WU02:FS00:0xa7:       Date: Nov 5 2019
08:20:00:WU02:FS00:0xa7:       Time: 06:13:26
08:20:00:WU02:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
08:20:00:WU02:FS00:0xa7:     Branch: master
08:20:00:WU02:FS00:0xa7:   Compiler: GNU 8.3.0
08:20:00:WU02:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
08:20:00:WU02:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
08:20:00:WU02:FS00:0xa7:       Bits: 64
08:20:00:WU02:FS00:0xa7:       Mode: Release
08:20:00:WU02:FS00:0xa7:************************************ Build *************************************
08:20:00:WU02:FS00:0xa7:       SIMD: avx_256
08:20:00:WU02:FS00:0xa7:********************************************************************************
08:20:00:WU02:FS00:0xa7:Project: 14524 (Run 589, Clone 2, Gen 34)
08:20:00:WU02:FS00:0xa7:Unit: 0x0000003580fccb0a5e459b9f0ada19d2
08:20:00:WU02:FS00:0xa7:Reading tar file core.xml
08:20:00:WU02:FS00:0xa7:Reading tar file frame34.tpr
08:20:00:WU02:FS00:0xa7:Digital signatures verified
08:20:00:WU02:FS00:0xa7:Calling: mdrun -s frame34.tpr -o frame34.trr -x frame34.xtc -cpt 15 -nt 20
08:20:00:WU02:FS00:0xa7:Steps: first=8500000 total=250000
08:20:00:WU02:FS00:0xa7:ERROR:
08:20:00:WU02:FS00:0xa7:ERROR:-------------------------------------------------------
08:20:00:WU02:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
08:20:00:WU02:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
08:20:00:WU02:FS00:0xa7:ERROR:
08:20:00:WU02:FS00:0xa7:ERROR:Fatal error:
08:20:00:WU02:FS00:0xa7:ERROR:There is no domain decomposition for 16 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm
08:20:00:WU02:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
08:20:00:WU02:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
08:20:00:WU02:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
08:20:00:WU02:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
08:20:00:WU02:FS00:0xa7:ERROR:-------------------------------------------------------
08:20:04:WU02:FS00:0xa7:WARNING:Unexpected exit() call
08:20:04:WU02:FS00:0xa7:WARNING:Unexpected exit from science code
08:20:04:WU02:FS00:0xa7:Saving result file ../logfile_01.txt
08:20:04:WU02:FS00:0xa7:Saving result file md.log
08:20:05:WU02:FS00:0xa7:Saving result file science.log
08:20:05:WU02:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
Joe_H
Site Admin
Posts: 7926
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: Decomposition problem - continuous loop

Post by Joe_H »

You need to pause the WU long enough for it to stop. Then change the CPU thread count and reenable folding. Use only values from 1 to the number of CPU threads your processor supports - 20 - do not use numbers higher.

The provided section of log is not enough to determine the initial CPU thread setting this WU was downloaded with, but probably 20. I would recommend trying 12 or 18, not multiples of 5. Skip 11, 13, 17 and 19.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
DaveHarper
Posts: 4
Joined: Wed Jun 17, 2020 1:24 pm
Location: Fairview, TX

Re: Decomposition problem - continuous loop

Post by DaveHarper »

@Joe_H: It took a while but I appear to be back up and running now. I tried 20 but it still seemed to be hung in the loop. Finally I tried 12 and that is what seems to be working. Thanks for the help.
Joe_H
Site Admin
Posts: 7926
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: Decomposition problem - continuous loop

Post by Joe_H »

This is a project that does not run on 20 threads (multiple of 5) for many of its Runs. They have been having some issues getting the filter on the server set correctly and and the setting sticking so it doesn't get assigned to 20 or one of the other problematic numbers. They think they have the settings right now, so you shouldn't get another WU for this project when set to 20.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
DaveHarper
Posts: 4
Joined: Wed Jun 17, 2020 1:24 pm
Location: Fairview, TX

Re: Decomposition problem - continuous loop

Post by DaveHarper »

I successfully completed the WU when set at 12 and have now configured for 20 and am running a new WU. I'll keep an eye on things and update this post if I hit any further issues. I appreciate the information and the help on getting this resolved.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Decomposition problem - continuous loop

Post by bruce »

18 would be a better default setting than 20 simply because it's not divisible by 5.
DaveHarper
Posts: 4
Joined: Wed Jun 17, 2020 1:24 pm
Location: Fairview, TX

Re: Decomposition problem - continuous loop

Post by DaveHarper »

It actually ran for about 2 months with the default setting of -1 and, every time I checked system utilization, all 20 cores were maxed out. I'll be watching it closely to see if the new server filter settings are effective and, if there's a problem, I'll try 18 next.
Post Reply