Page 1 of 1

Some computers getting work, others not?

Posted: Fri Apr 17, 2020 6:55 am
by ccgllc
Hi all.

I have a pair of I5's that is receiving CPU work (no GPU installed).
I have an I5 (6) GPU box that keeps 2-6 CPUs busy
I have an I7 that is receiving CPU work (no GPU installed).
I have a Xeon E3-1220 quad processer that only gets messages like:

Code: Select all

******************************* Date: 2020-04-16 *******************************
15:13:41:WU00:FS00:Connecting to 65.254.110.245:8080
15:14:13:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': Failed to connect to 65.254.110.245:8080: Connection timed out
15:14:13:WU00:FS00:Connecting to 18.218.241.186:80
15:14:45:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': Failed to connect to 18.218.241.186:80: Connection timed out
15:14:45:ERROR:WU00:FS00:Exception: Could not get an assignment
20:35:41:WU00:FS00:Connecting to 65.254.110.245:8080
20:36:13:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': Failed to connect to 65.254.110.245:8080: Connection timed out
20:36:13:WU00:FS00:Connecting to 18.218.241.186:80
20:36:45:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': Failed to connect to 18.218.241.186:80: Connection timed out
20:36:45:ERROR:WU00:FS00:Exception: Could not get an assignment
******************************* Date: 2020-04-17 *******************************
And I have a dual Xeon E5-2680 V4 system that only gets:

Code: Select all

06:54:19:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
06:54:19:WU01:FS00:0xa7:ERROR:
06:54:19:WU01:FS00:0xa7:ERROR:Fatal error:
06:54:19:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 40 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm
06:54:19:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
06:54:19:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
06:54:19:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
06:54:19:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
06:54:19:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
06:54:24:WU01:FS00:0xa7:WARNING:Unexpected exit() call
06:54:24:WU01:FS00:0xa7:WARNING:Unexpected exit from science code
06:54:24:WU01:FS00:0xa7:Saving result file ../logfile_01.txt
06:54:24:WU01:FS00:0xa7:Saving result file md.log
06:54:24:WU01:FS00:0xa7:Saving result file science.log
06:54:24:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
Any idea why the I5's and I7s work as expected, but the Xeon's do not?

Re: Some computers getting work, others not?

Posted: Fri Apr 17, 2020 7:07 am
by PantherX
For the Xeon E3-1220 quad processer, we will need so see how many CPUs are configured for it to fold. Those messages means that for some reason, you can't connect to the Assignment Server. Please review this topic for troubleshooting steps: viewtopic.php?f=18&t=17794

For the dual Xeon E5-2680 V4 system, reduce the CPU from 40 to 32 and see what happens. The message, domain decomposition means that the assigned WU can't be successfully sub-divided among the 40 CPUs so is unable to fold. Reducing the value help in most cases.

Re: Some computers getting work, others not?

Posted: Fri Apr 17, 2020 10:04 am
by Neil-B
If the 40 to 32 reduction works (it should - I rarely if ever get issues with my 32slot) then you could fire up a second slot using the remaining 8 PCU threads.

Re: Some computers getting work, others not?

Posted: Fri Apr 17, 2020 6:19 pm
by ccgllc
On the dual xeon, changed the slot type from CPU to SMP and set the number to 32 (machine actually supports 56 threads). It got work instantly.

The E3-1220 is being less obvious. I checked the network, via a ping to www.google.com (eg. name services is working, network is working).
Machine is on a different firewall than the others, and "wget 65.254.110.245:8080" is timing out, where it works fine on any machine on the other network. I've been experiencing weird network issues like this on that network, so off to solve it! Good news, at least I have a solid test case now!

Re: Some computers getting work, others not?

Posted: Fri Apr 17, 2020 6:58 pm
by bruce
Some projects have discovered that 40 doesn't work and have excluded assignments to machines seeking work for 40. I don't know if this is one of those projects. 32 + 8 is less likely to have troubles.. or maybe 16 + 24.

Re: Some computers getting work, others not?

Posted: Fri Apr 17, 2020 8:55 pm
by ccgllc
Thanks. In my case, I'm running 32 for FAH and letting BOINC's Rosetta suck up the rest.

Re: Some computers getting work, others not?

Posted: Fri Apr 17, 2020 9:25 pm
by Jesse_V
Another option is to add another CPU slot and set the number of cores to 8. That way you have a 24+8 setup, as Bruce was saying, though I'm certain Rosetta appreciates the eight cores too. Either way works.