Fatal Error with WU

Moderators: Site Moderators, FAHC Science Team

_r2w_ben
Posts: 285
Joined: Wed Apr 23, 2008 3:11 pm

Re: Fatal Error with WU

Post by _r2w_ben »

jaos wrote:Another bad WU
Can you also include the following lines in your reports? (The first line will always be present but the second one is rare.)

Code: Select all

12:57:58:WU03:FS00:Requesting new work unit for slot 00: RUNNING cpu:21 from 128.252.203.4
01:30:04:WARNING:WU02:FS00:AS lowered CPUs from 21 to 18
Then it's clear how many CPUs the assignment server based it's decision on and whether it instructed the client to use less. Thanks!
uyaem
Posts: 219
Joined: Sat Mar 21, 2020 7:35 pm
Location: Esslingen, Germany

Re: Fatal Error with WU

Post by uyaem »

I had the same failure on the same project, but the 2nd line isn't present.
Should it be on default log level?

On the upside, the client discards the WU very quickly, so I'd guess it will be available to another user soon.
For the sake of completeness, here's my log (Translation of "Der Prozess kann nicht ..." = "The process cannot access the file because it is used by another process", but it does get cleaned up eventually):

Code: Select all

22:39:20:WU00:FS00:0xa7:*********************** Log Started 2020-06-09T22:39:19Z ***********************
22:39:20:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
22:39:20:WU00:FS00:0xa7:       Type: 0xa7
22:39:20:WU00:FS00:0xa7:       Core: Gromacs
22:39:20:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 706 -lifeline 15928 -checkpoint 15 -np
22:39:20:WU00:FS00:0xa7:             21
22:39:20:WU00:FS00:0xa7:************************************ CBang *************************************
22:39:20:WU00:FS00:0xa7:       Date: Oct 26 2019
22:39:20:WU00:FS00:0xa7:       Time: 01:38:25
22:39:20:WU00:FS00:0xa7:   Revision: c46a1a011a24143739ac7218c5a435f66777f62f
22:39:20:WU00:FS00:0xa7:     Branch: master
22:39:20:WU00:FS00:0xa7:   Compiler: Visual C++ 2008
22:39:20:WU00:FS00:0xa7:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
22:39:20:WU00:FS00:0xa7:   Platform: win32 10
22:39:20:WU00:FS00:0xa7:       Bits: 64
22:39:20:WU00:FS00:0xa7:       Mode: Release
22:39:20:WU00:FS00:0xa7:************************************ System ************************************
22:39:20:WU00:FS00:0xa7:        CPU: AMD Ryzen 9 3900X 12-Core Processor
22:39:20:WU00:FS00:0xa7:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
22:39:20:WU00:FS00:0xa7:       CPUs: 24
22:39:20:WU00:FS00:0xa7:     Memory: 31.95GiB
22:39:20:WU00:FS00:0xa7:Free Memory: 19.42GiB
22:39:20:WU00:FS00:0xa7:    Threads: WINDOWS_THREADS
22:39:20:WU00:FS00:0xa7: OS Version: 6.2
22:39:20:WU00:FS00:0xa7:Has Battery: false
22:39:20:WU00:FS00:0xa7: On Battery: false
22:39:20:WU00:FS00:0xa7: UTC Offset: 2
22:39:20:WU00:FS00:0xa7:        PID: 21004
22:39:20:WU00:FS00:0xa7:        CWD: C:\Users\X\AppData\Roaming\FAHClient\work
22:39:20:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
22:39:20:WU00:FS00:0xa7:    Version: 0.0.18
22:39:20:WU00:FS00:0xa7:     Author: Joseph Coffland <[email protected]>
22:39:20:WU00:FS00:0xa7:  Copyright: 2019 foldingathome.org
22:39:20:WU00:FS00:0xa7:   Homepage: https://foldingathome.org/
22:39:20:WU00:FS00:0xa7:       Date: Oct 26 2019
22:39:20:WU00:FS00:0xa7:       Time: 01:52:30
22:39:20:WU00:FS00:0xa7:   Revision: c1e3513b1bc0c16013668f2173ee969e5995b38e
22:39:20:WU00:FS00:0xa7:     Branch: master
22:39:20:WU00:FS00:0xa7:   Compiler: Visual C++ 2008
22:39:20:WU00:FS00:0xa7:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
22:39:20:WU00:FS00:0xa7:   Platform: win32 10
22:39:20:WU00:FS00:0xa7:       Bits: 64
22:39:20:WU00:FS00:0xa7:       Mode: Release
22:39:20:WU00:FS00:0xa7:************************************ Build *************************************
22:39:20:WU00:FS00:0xa7:       SIMD: avx_256
22:39:20:WU00:FS00:0xa7:********************************************************************************
22:39:20:WU00:FS00:0xa7:Project: 14524 (Run 553, Clone 3, Gen 19)
22:39:20:WU00:FS00:0xa7:Unit: 0x0000001e80fccb0a5e781bdd6f4762b6
22:39:20:WU00:FS00:0xa7:Reading tar file core.xml
22:39:20:WU00:FS00:0xa7:Reading tar file frame19.tpr
22:39:20:WU00:FS00:0xa7:Digital signatures verified
22:39:20:WU00:FS00:0xa7:Calling: mdrun -s frame19.tpr -o frame19.trr -x frame19.xtc -cpt 15 -nt 21
22:39:20:WU00:FS00:0xa7:Steps: first=4750000 total=250000
22:39:20:WU00:FS00:0xa7:ERROR:
22:39:20:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
22:39:20:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
22:39:20:WU00:FS00:0xa7:ERROR:Source code file: C:\build\fah\core-a7-avx-release\windows-10-64bit-core-a7-avx-release\gromacs-core\build\gromacs\src\gromacs\mdlib\domdec.c, line: 6902
22:39:20:WU00:FS00:0xa7:ERROR:
22:39:20:WU00:FS00:0xa7:ERROR:Fatal error:
22:39:20:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 16 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm
22:39:20:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
22:39:20:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
22:39:20:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
22:39:20:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
22:39:20:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
22:39:24:WU00:FS00:0xa7:WARNING:Unexpected exit() call
22:39:24:WU00:FS00:0xa7:WARNING:Unexpected exit from science code
22:39:24:WU00:FS00:0xa7:Saving result file ..\logfile_01.txt
22:39:24:WU00:FS00:0xa7:Saving result file md.log
22:39:24:WU00:FS00:0xa7:Saving result file science.log
22:39:24:WU00:FS00:0xa7:WARNING:While cleaning up: boost::filesystem::remove: Der Prozess kann nicht auf die Datei zugreifen, da sie von einem anderen Prozess verwendet wird: "01/md.log"
22:39:24:WU00:FS00:0xa7:Folding@home Core Shutdown: BAD_WORK_UNIT
22:39:25:WU02:FS00:Upload 42.28%
22:39:25:WARNING:WU00:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
22:39:25:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:14524 run:553 clone:3 gen:19 core:0xa7 unit:0x0000001e80fccb0a5e781bdd6f4762b6
Image
CPU: Ryzen 9 3900X (1x21 CPUs) ~ GPU: nVidia GeForce GTX 1660 Super (Asus)
PantherX
Site Moderator
Posts: 6986
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Fatal Error with WU

Post by PantherX »

The line about AS lowered CPUs will be present in your log file at a default logging level.

Can you please show the log file where it requested that WU from the Server?
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
_r2w_ben
Posts: 285
Joined: Wed Apr 23, 2008 3:11 pm

Re: Fatal Error with WU

Post by _r2w_ben »

uyaem wrote:I had the same failure on the same project, but the 2nd line isn't present.
Should it be on default log level?
Since this project is not supposed to be assigned to 21 threads, the second line being present would mean that the servers realised that and said, "You can have this work unit, but please run it on 18 threads because that number is allowed." If it was working as expected, the message would be there at the default log level.
uyaem
Posts: 219
Joined: Sat Mar 21, 2020 7:35 pm
Location: Esslingen, Germany

Re: Fatal Error with WU

Post by uyaem »

Posting everything up to the line posted previously. I'm keeping it complete, even if it contains the download of the next WU, just so we don't miss anything.
Please keep in mind that this log is a few days old... :)

Code: Select all

22:38:48:WU00:FS00:Connecting to assign1.foldingathome.org:80
22:38:48:WU00:FS00:Assigned to work server 128.252.203.10
22:38:48:WU00:FS00:Requesting new work unit for slot 00: RUNNING cpu:21 from 128.252.203.10
22:38:48:WU00:FS00:Connecting to 128.252.203.10:8080
22:38:49:WU00:FS00:Downloading 1.06MiB
22:38:50:WU00:FS00:Download complete
22:38:50:WU00:FS00:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:14524 run:553 clone:3 gen:19 core:0xa7 unit:0x0000001e80fccb0a5e781bdd6f4762b6
22:39:17:WU02:FS00:0xa7:Completed 250000 out of 250000 steps (100%)
22:39:19:WU02:FS00:0xa7:Saving result file ..\logfile_01.txt
22:39:19:WU02:FS00:0xa7:Saving result file dhdl.xvg
22:39:19:WU02:FS00:0xa7:Saving result file frame101.trr
22:39:19:WU02:FS00:0xa7:Saving result file md.log
22:39:19:WU02:FS00:0xa7:Saving result file pullf.xvg
22:39:19:WU02:FS00:0xa7:Saving result file pullx.xvg
22:39:19:WU02:FS00:0xa7:Saving result file science.log
22:39:19:WU02:FS00:0xa7:Saving result file traj_comp.xtc
22:39:19:WU02:FS00:0xa7:Folding@home Core Shutdown: FINISHED_UNIT
22:39:19:WU02:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
22:39:19:WU02:FS00:Sending unit results: id:02 state:SEND error:NO_ERROR project:14722 run:38 clone:0 gen:101 core:0xa7 unit:0x0000007c9bf7a4d65ea0712cb6852f9b
22:39:19:WU02:FS00:Uploading 6.80MiB to 155.247.164.214
22:39:19:WU00:FS00:Starting
22:39:19:WU02:FS00:Connecting to 155.247.164.214:8080
22:39:19:WU00:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\X\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/avx/Core_a7.fah/FahCore_a7.exe -dir 00 -suffix 01 -version 706 -lifeline 8288 -checkpoint 15 -np 21
22:39:19:WU00:FS00:Started FahCore on PID 15928
22:39:19:WU00:FS00:Core PID:21004
22:39:19:WU00:FS00:FahCore 0xa7 started
22:39:20:WU00:FS00:0xa7:*********************** Log Started 2020-06-09T22:39:19Z ***********************
Image
CPU: Ryzen 9 3900X (1x21 CPUs) ~ GPU: nVidia GeForce GTX 1660 Super (Asus)
PantherX
Site Moderator
Posts: 6986
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Fatal Error with WU

Post by PantherX »

Thanks for the log file, uyaem. I have notified the researcher about this.

FYI, I personally use this:
<next-unit-percentage v='100'/>
Since I don't want to wait for 1% of the WU to be over before starting the downloaded WU. Thus, I do gain a tiny amount of points :)
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
uyaem
Posts: 219
Joined: Sat Mar 21, 2020 7:35 pm
Location: Esslingen, Germany

Re: Fatal Error with WU

Post by uyaem »

PantherX wrote:Thanks for the log file, uyaem. I have notified the researcher about this.

FYI, I personally use this:
<next-unit-percentage v='100'/>
Since I don't want to wait for 1% of the WU to be over before starting the downloaded WU. Thus, I do gain a tiny amount of points :)
Maybe I shouldn't be asking this, but are you sure this is correct?

Code: Select all

  next-unit-percentage <integer=99>
    Pre-download the next work unit when the current one is this far along.
The way I understand this is that the next WU will be downloaded once the current one is at X percent.
Wouldn't a setting of 100 mean that the download only starts after the current one completes?
I don't understand how this would gain you extra points.
Image
CPU: Ryzen 9 3900X (1x21 CPUs) ~ GPU: nVidia GeForce GTX 1660 Super (Asus)
Joe_H
Site Admin
Posts: 7937
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: Fatal Error with WU

Post by Joe_H »

uyaem wrote:The way I understand this is that the next WU will be downloaded once the current one is at X percent.
Wouldn't a setting of 100 mean that the download only starts after the current one completes?
I don't understand how this would gain you extra points.
There are two parts to this. The bonus is based on the download and upload times for the WU. Depending on the TPF for the current WU, a download at 99% can be sitting on your computer waiting for 1 minute, or 20 minutes. That reduces the bonus slightly.

When a WU gets to 100% there is still some post-processing to be completed before it gets sent in, but the download starts right then. On anything but a slow internet connection most WUs can download before the post-processing finishes and start immediately.

You can see this in log file entries, 100% is reached, a download starts, there will be some messages about the WU just completed being prepared to be returned, and then the folding core exits. Depending on download size it should also show as completing somewhere in that message sequence, and as soon as the folding core exits for the just completed WU, the new one will be started.

Personally I leave the setting at the default of 99%. I am on a DSL connection, so some of the time a WU download has not finished before the upload starts. Both an upload and a download happening at the same time has a very negative effect on my connection.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
uyaem
Posts: 219
Joined: Sat Mar 21, 2020 7:35 pm
Location: Esslingen, Germany

Re: Fatal Error with WU

Post by uyaem »

Ah of course, gotcha. :)

EDIT: If only it wasn't an integer value, I'd be min/maxing to 99.8 ;)
Image
CPU: Ryzen 9 3900X (1x21 CPUs) ~ GPU: nVidia GeForce GTX 1660 Super (Asus)
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Fatal Error with WU

Post by bruce »

WUs come in different sizes, which means the data consolidation step can take a varying amount of time. In the log above, it took from 22:39:17 to 22:39:19 ... 2 seconds. From what I'm reading about unreleased cores, the sizes of the upload package and the download package are growing though there still are plenty of small WUs. They're adding compression to the sequence which will add a certain amount of time to that 2 seconds.

The download of the new WU took from 22:38:48 to 22:39:19 or 31 seconds. so in this case, it might have cost you another 28 seconds. Hardly worth worrying about, either way, given that the processing generally takes many hours.
Rel25917
Posts: 303
Joined: Wed Aug 15, 2012 2:31 am

Re: Fatal Error with WU

Post by Rel25917 »

Next unit percent was much more useful before everyone had super fast broadband connections, can still be useful for some people.
Post Reply