Some computers getting work, others not?

If you're new to FAH and need help getting started or you have very basic questions, start here.

Moderators: Site Moderators, FAHC Science Team

Post Reply
ccgllc
Posts: 20
Joined: Sun Apr 05, 2020 5:09 am

Some computers getting work, others not?

Post by ccgllc »

Hi all.

I have a pair of I5's that is receiving CPU work (no GPU installed).
I have an I5 (6) GPU box that keeps 2-6 CPUs busy
I have an I7 that is receiving CPU work (no GPU installed).
I have a Xeon E3-1220 quad processer that only gets messages like:

Code: Select all

******************************* Date: 2020-04-16 *******************************
15:13:41:WU00:FS00:Connecting to 65.254.110.245:8080
15:14:13:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': Failed to connect to 65.254.110.245:8080: Connection timed out
15:14:13:WU00:FS00:Connecting to 18.218.241.186:80
15:14:45:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': Failed to connect to 18.218.241.186:80: Connection timed out
15:14:45:ERROR:WU00:FS00:Exception: Could not get an assignment
20:35:41:WU00:FS00:Connecting to 65.254.110.245:8080
20:36:13:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': Failed to connect to 65.254.110.245:8080: Connection timed out
20:36:13:WU00:FS00:Connecting to 18.218.241.186:80
20:36:45:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': Failed to connect to 18.218.241.186:80: Connection timed out
20:36:45:ERROR:WU00:FS00:Exception: Could not get an assignment
******************************* Date: 2020-04-17 *******************************
And I have a dual Xeon E5-2680 V4 system that only gets:

Code: Select all

06:54:19:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
06:54:19:WU01:FS00:0xa7:ERROR:
06:54:19:WU01:FS00:0xa7:ERROR:Fatal error:
06:54:19:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 40 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm
06:54:19:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
06:54:19:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
06:54:19:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
06:54:19:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
06:54:19:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
06:54:24:WU01:FS00:0xa7:WARNING:Unexpected exit() call
06:54:24:WU01:FS00:0xa7:WARNING:Unexpected exit from science code
06:54:24:WU01:FS00:0xa7:Saving result file ../logfile_01.txt
06:54:24:WU01:FS00:0xa7:Saving result file md.log
06:54:24:WU01:FS00:0xa7:Saving result file science.log
06:54:24:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
Any idea why the I5's and I7s work as expected, but the Xeon's do not?
PantherX
Site Moderator
Posts: 6986
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Some computers getting work, others not?

Post by PantherX »

For the Xeon E3-1220 quad processer, we will need so see how many CPUs are configured for it to fold. Those messages means that for some reason, you can't connect to the Assignment Server. Please review this topic for troubleshooting steps: viewtopic.php?f=18&t=17794

For the dual Xeon E5-2680 V4 system, reduce the CPU from 40 to 32 and see what happens. The message, domain decomposition means that the assigned WU can't be successfully sub-divided among the 40 CPUs so is unable to fold. Reducing the value help in most cases.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon [email protected], 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon [email protected], 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: [email protected], 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Some computers getting work, others not?

Post by Neil-B »

If the 40 to 32 reduction works (it should - I rarely if ever get issues with my 32slot) then you could fire up a second slot using the remaining 8 PCU threads.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
ccgllc
Posts: 20
Joined: Sun Apr 05, 2020 5:09 am

Re: Some computers getting work, others not?

Post by ccgllc »

On the dual xeon, changed the slot type from CPU to SMP and set the number to 32 (machine actually supports 56 threads). It got work instantly.

The E3-1220 is being less obvious. I checked the network, via a ping to www.google.com (eg. name services is working, network is working).
Machine is on a different firewall than the others, and "wget 65.254.110.245:8080" is timing out, where it works fine on any machine on the other network. I've been experiencing weird network issues like this on that network, so off to solve it! Good news, at least I have a solid test case now!
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Some computers getting work, others not?

Post by bruce »

Some projects have discovered that 40 doesn't work and have excluded assignments to machines seeking work for 40. I don't know if this is one of those projects. 32 + 8 is less likely to have troubles.. or maybe 16 + 24.
ccgllc
Posts: 20
Joined: Sun Apr 05, 2020 5:09 am

Re: Some computers getting work, others not?

Post by ccgllc »

Thanks. In my case, I'm running 32 for FAH and letting BOINC's Rosetta suck up the rest.
Jesse_V
Site Moderator
Posts: 2850
Joined: Mon Jul 18, 2011 4:44 am
Hardware configuration: OS: Windows 10, Kubuntu 19.04
CPU: i7-6700k
GPU: GTX 970, GTX 1080 TI
RAM: 24 GB DDR4
Location: Western Washington

Re: Some computers getting work, others not?

Post by Jesse_V »

Another option is to add another CPU slot and set the number of cores to 8. That way you have a 24+8 setup, as Bruce was saying, though I'm certain Rosetta appreciates the eight cores too. Either way works.
F@h is now the top computing platform on the planet and nothing unites people like a dedicated fight against a common enemy. This virus affects all of us. Lets end it together.
Post Reply