Project: 16451 (Run 48, Clone 1, Gen 56) Domain
Moderators: Site Moderators, FAHC Science Team
-
- Posts: 336
- Joined: Fri Jun 26, 2009 4:34 am
Project: 16451 (Run 48, Clone 1, Gen 56) Domain
This project 16451 caused a domain decomposition error with 48 cpus. As r2w_ben has suggested for other problem children in the past, I tried it at 45. It runs perfectly with 45 cpus. I searched for 16451 reports and didn't find any but announcements. I'm not sure how this made it to gen 56 without triggering an error. Just letting y'all know. I can post appropriate logs if anyone needs the specifics.
Re: Project: 16451 (Run 48, Clone 1, Gen 56) Domain
The sample I have of p16451 would run on 48 threads. There are rare work units where the atoms have moved enough that the box changes shape or the estimated PME load changes.
For 4x4x3 projects, 48 works when PME load is around 0.18. Once it drops towards 0.17, these CPU counts no longer work: 24, 30, 36, 48, 54, and 60. Temporarily decreasing to 21, 27, 32, or 45 in those scenarios should allow the work unit to finish.
For 4x4x3 projects, 48 works when PME load is around 0.18. Once it drops towards 0.17, these CPU counts no longer work: 24, 30, 36, 48, 54, and 60. Temporarily decreasing to 21, 27, 32, or 45 in those scenarios should allow the work unit to finish.
-
- Posts: 2040
- Joined: Sat Dec 01, 2012 3:43 pm
- Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441
Re: Project: 16451 (Run 48, Clone 1, Gen 56) Domain
Maybe that would be a workaround for the problem in some work units if fahcore would itself downscale the number of threads when this error occurs?
-
- Site Admin
- Posts: 7926
- Joined: Tue Apr 21, 2009 4:41 pm
- Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2 - Location: W. MA
Re: Project: 16451 (Run 48, Clone 1, Gen 56) Domain
Yes, that has been suggested and is being investigated. But it is not just a change to the core, but also to the FAHCoreWrapper process it runs within. It would have to capture the domain decomposition error and restart the CPU core with the changed core count. It would also have to do this in a way that does not trigger the max error threshold and cause the WU to be returned as faulty.
But as mentioned the bounding box can change in size during folding. That and the distribution between regular processing threads and PME threads for thread counts over 18-20 possibly shifting can make it a bit complicated.
But as mentioned the bounding box can change in size during folding. That and the distribution between regular processing threads and PME threads for thread counts over 18-20 possibly shifting can make it a bit complicated.
iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
-
- Posts: 336
- Joined: Fri Jun 26, 2009 4:34 am
Re: Project: 16451 (Run 48, Clone 1, Gen 56) Domain
_r2W_ben and Joe_H, 16451 behaved normally after this one work unit. I picked up eight more 16451s over the last two days and all ran perfectly with 48 threads. The domain decomposition problem is going to be a tough one to solve. Thank goodness it does not happen constantly - hats off to the beta crew and staffers for insulating us from this for the most part.