Fresh install on Windows fails with FileOpenError `Must be unbuffered` when attempting to open 'xxxx-0.fdq' file

When attempting to install fdb 5.1.7 on a newly installed win10 machine, we got this weird error:

<Event Severity="40" Time="1528381129.831068" Type="FileOpenError" Machine="127.0.0.1:4500" ID="0000000000000000" Reason="Must be unbuffered" Flags="131074" File="C:\ProgramData\foundationdb\data\4500\storage-811f51d3f2dd5414739706e38748a17b-0.fdq" logGroup="default" Backtrace=""/>
<Event Severity="40" Time="1528381129.831068" Type="DiskQueueShutdownError" Machine="127.0.0.1:4500" ID="811f51d3f2dd5414" Reason="unknown" Error="io_error" ErrorDescription="Disk i/o operation failed" ErrorCode="1510" logGroup="default" Backtrace=""/>

This is right after installing via the setup (that pre-configures the db using the memory storage engine).

When connecting to fdbcli, the cluster was unavailable. Attemtping to “configure single ssd” did not seem to work at first. Then after restarting everything, it now works fine, and it is using the ssd-2 storage engine. The last logged error for the node is DiskQueueShutdownError

I think that for some reason, the memory engine did try to open a file on windows without specifying the OPEN_UNBUFFERED that is tested in AsyncFileWinASIO::open(..) but I have difficulty, looking at the code, to find why.

After the restart, my guess is that the storage was successfully converted from memory to ssd-2 and the error went away.

The file name in the error is xxxx-0.fdq which seems to be only used by the Memory engine, so maybe only this path may forget to set the flag OPEN_UNBUFFERED in some cases?

The equivalent AsyncFileKAIO::open() for linux seem to check the flag via ASSERT( flags & OPEN_UNBUFFERED ); while the Widows version is more explicit in tracing and throwing an error… maybe there is difference in behavior there?

So this is indeed a DiskQueue file from a KeyValueStoreMemory. However, I’m baffled that the code to open this DiskQueue is the exact same code used to open a DiskQueue for the transaction logs, which appears to have succeeded in your case.

So I’m kind of more baffled about where this request is coming from. You wouldn’t happen to be able to offer a symbolicated backtrace, would you?

Sorry no, it was a long time ago and using the officially released binaries (we do not build our own binaries yet). And all recent install were done with 5.2.5 which did not seem to reproduce this particular issue.

I’ve had other problems with the Windows setup or linux packages creating a new cluster using the memory storage engine, and then having to transition to ssd-2 engine, or worse, stop the service, delete the data, and then restart just so we can join an existing cluster (we’ve had a recent customer issue were a chain of events starting from this brought down the entire cluster for several hours until we were able to diagnose the root cause).

As a side note: It would be really cool if the setup would NOT automatically create a new cluster (or at least a checkbox in the setup to do it or not).

We do have a set of basic tests that involve installing our windows package on … some version of windows, so I think we’d catch if creating a memory storage engine was horrifically broken. If you happen to have it occur again, and get any sort of information as to how/why, I’d be happy to take a look. I’ll at minimum flip the windows behavior to match the linux behavior, as I’m not aware of any reason that they should be different.

Yeah, the dockerization attempts has revealed this to be quite an annoyance. I’d be happy to see it removed, but unfortunately, there’s no one around that seems to have a solid understanding of our packaging scripts to be able to make them do the right thing easily.

I getting the same kind of errors again, on Windows, with v6.0.15, but this time it seems to happen while the files are being closed and then re-opened (right when doing a “configure double” after adding more nodes to a cluster)

see XML log below

The code that trigger the “must be unbuffered” error is in AsynFileWinASIO::open() (-> here)

The corresponding code for linux KAIO does an ASSERT(...) (-> here) while the Windows version does an actual if (..) { ... return io_error(); }, so maybe the bug is not visible with Linux release builds, but will blow up on Windows release builds?

EDIT: the configure double failure is unrelated, it’s a different issue that has been solved, see Unable to fake multiple "machine_id" on the same host for testing ). Though maybe the I/O error sometimes happening at the same time are a contributing factors…

Example log from one of the nodes:

<Event Severity="10" Time="1545415114.576911" Type="DQDestroy" ID="6a277c88dd62d671" LastPoppedSeq="745557" PoppedSeq="745557" NextPageSeq="815104" File0Name="C:\ProgramData\foundationdb\data\4501\logqueue-6a277c88dd62d6719e1e3995266d60ba-0.fdq" Machine="10.10.0.173:4501" LogGroup="default" Roles="RV,SS" />
<Event Severity="10" Time="1545415114.576911" Type="DiskQueueShutdownDeleting" ID="6a277c88dd62d671" File0="C:\ProgramData\foundationdb\data\4501\logqueue-6a277c88dd62d6719e1e3995266d60ba-0.fdq" File1="C:\ProgramData\foundationdb\data\4501\logqueue-6a277c88dd62d6719e1e3995266d60ba-1.fdq" Machine="10.10.0.173:4501" LogGroup="default" Roles="RV,SS" />
<Event Severity="40" Time="1545415114.576911" Type="FileOpenError" ID="0000000000000000" Reason="Must be unbuffered" Flags="131074" File="C:\ProgramData\foundationdb\data\4501\logqueue-6a277c88dd62d6719e1e3995266d60ba-0.fdq" Backtrace="" Machine="10.10.0.173:4501" LogGroup="default" Roles="RV,SS" />
<Event Severity="40" Time="1545415114.576911" Type="DiskQueueShutdownError" ID="6a277c88dd62d671" Error="io_error" ErrorDescription="Disk i/o operation failed" ErrorCode="1510" Reason="unknown" Backtrace="" Machine="10.10.0.173:4501" LogGroup="default" Roles="RV,SS" />
<Event Severity="10" Time="1545415114.576911" Type="TLogTerminated" ID="6a277c88dd62d671" Error="worker_removed" ErrorDescription="Normal worker shut down" ErrorCode="1202" Machine="10.10.0.173:4501" LogGroup="default" Roles="RV,SS" />
...
<Event Severity="10" Time="1545415114.588058" Type="KVWriterDestroying" ID="6a277c88dd62d671" Machine="10.10.0.173:4501" LogGroup="default" Roles="RV,SS" />
<Event Severity="10" Time="1545415114.588058" Type="KVWriterDestroyed" ID="6a277c88dd62d671" Machine="10.10.0.173:4501" LogGroup="default" Roles="RV,SS" />
<Event Severity="40" Time="1545415114.595561" Type="FileOpenError" ID="0000000000000000" Reason="Must be unbuffered" Flags="131074" File="C:\ProgramData\foundationdb\data\4501\log-6a277c88dd62d6719e1e3995266d60ba.sqlite" Backtrace="" Machine="10.10.0.173:4501" LogGroup="default" Roles="RV,SS" />
<Event Severity="40" Time="1545415114.595561" Type="KVDoCloseError" ID="6a277c88dd62d671" Error="io_error" ErrorDescription="Disk i/o operation failed" ErrorCode="1510" Reason="unknown" Backtrace="" Machine="10.10.0.173:4501" LogGroup="default" Roles="RV,SS" />
<Event Severity="10" Time="1545415114.595561" Type="KVClosed" ID="6a277c88dd62d671" Machine="10.10.0.173:4501" LogGroup="default" Roles="RV,SS" />

I see the same things on several of the processes, and they are happening for log-*, logqueue-*, storage-* files, so bascially any file.

Meng posted PR #1232 to make windows and linux match in their error handling here. Looking through the code again, I still can’t see how a disk queue can get opened without UNBUFFERED, and the fact that windows appears to be missing stack traces is very unfortunate. :confused:

2 Likes

Right now, we just make the sanity check consistent both in Linux and Windows.
Hopefully, the weird error can surface in the Linux environment and preferably in the simulation test later.

Thanks for fixing this!

I also tried to find where UNBUFFERED comes from, but without any luck.

The issue seems to happen randomly (on several identical VM, only one showed the problem) so I suspect some sort of race condition?

I would say if the problem only exists in VM, it may be some kind of subtle bug in hypervisor (or VM monitor).

It happened on both Hyper-V (ours), VMWare (customer), and also one time on a brand new physical machine during installation (top message of this thread). It was also accross multiple versions of fdb (though this part of the code probably did not change).

It’s doing it again! Everytime I start the service on this particular machine (4 processes), at least one or two processes complain about Must be unbuffered. I tried stopping the service, removing all previous data, restarting, doing configure new single ssd and everytime I restart the services, I get the same result.

The pattern I see is a StorageServerTerminated event, followed by a bunch of CoroStop, then a AllAlternativesFailed and finally FileOpenError/KVDoCloseError. This time it’s on a .sqlite file.

v6.0.15, 4 processes on single machine, data split between two disks (C: and D:)

Config file:

[fdbmonitor]
restart_delay = 20

[general]
cluster_file=C:\ProgramData\foundationdb\fdb.cluster

[fdbserver]
public_address = auto:$ID
listen_address = public
parentpid = $PID
command=C:\Program Files\foundationdb\bin\fdbserver.exe
datadir=C:\ProgramData\foundationdb\data\$ID
logdir=C:\ProgramData\foundationdb\logs

[fdbserver.4500]

[fdbserver.4501]

[fdbserver.4502]
datadir=D:\foundationdb\data\$ID

[fdbserver.4503]
datadir=D:\foundationdb\data\$ID

[backup-agent]
command=C:\Program Files\foundationdb\bin\backup_agent.exe
logdir=C:\ProgramData\foundationdb\logs

[backup-agent.1]

Log extract:

...
<Event Severity="10" Time="1553945442.597377" Type="TLogJoinedMe" ID="c512472fafc50a3d" TLog="56659ccf2ca43a92" Address="10.0.1.26:4501" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.597377" Type="CoroStop" ID="0000000000000000" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.597377" Type="KVWriterDestroying" ID="b60c68f39e5a077c" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.597377" Type="KVWriterDestroyed" ID="b60c68f39e5a077c" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.600754" Type="KVClose" ID="d9a605321616b006" Del="1" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.600754" Type="WorkPool_Stop" ID="0000000000000000" Workers="1" Idle="1" Work="0" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.600754" Type="WorkPool_Stop" ID="0000000000000000" Workers="64" Idle="64" Work="0" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.600754" Type="StorageServerTerminated" ID="d9a605321616b006" Error="worker_removed" ErrorDescription="Normal worker shut down" ErrorCode="1202" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.605091" Type="CommitAttachID" ID="4ba4f4861037259a" To="8c45ffdb86141495" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.605091" Type="CommitDebug" ID="8c45ffdb86141495" Location="TLog.tLogCommit.BeforeWaitForVersion" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.605091" Type="CommitDebug" ID="8c45ffdb86141495" Location="TLog.tLogCommit.Before" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.605091" Type="CommitDebug" ID="8c45ffdb86141495" Location="TLog.tLogCommit.AfterTLogCommit" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.606858" Type="CommitDebug" ID="8c45ffdb86141495" Location="TLog.tLogCommit.After" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.606858" Type="CoroStop" ID="0000000000000000" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.606858" Type="CoroStop" ID="0000000000000000" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.606858" Type="CoroStop" ID="0000000000000000" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.606858" Type="CoroStop" ID="0000000000000000" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
... same event repeated 10 times....
<Event Severity="10" Time="1553945442.608516" Type="CoroStop" ID="0000000000000000" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.608516" Type="CoroStop" ID="0000000000000000" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.608782" Type="CodeCoverage" ID="0000000000000000" File="DiskQueue.actor.cpp" Line="655" Condition="contents.size() &amp;&amp; pushedPageCount()" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.608782" Type="CoroStop" ID="0000000000000000" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
... removed 30+ times same event...
<Event Severity="10" Time="1553945442.609032" Type="CoroStop" ID="0000000000000000" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.609032" Type="CoroStop" ID="0000000000000000" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.609032" Type="CoroStop" ID="0000000000000000" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.609290" Type="AllAlternativesFailed" ID="0000000000000000" Interval="1" Alternatives="bebba47071638ebd" Delay="0.05" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="40" Time="1553945442.609290" Type="FileOpenError" ID="0000000000000000" Reason="Must be unbuffered" Flags="131074" File="D:\foundationdb\data\4503\storage-b60c68f39e5a077c0c974b6d0118c118.sqlite" Backtrace="" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="40" Time="1553945442.609290" Type="KVDoCloseError" ID="b60c68f39e5a077c" Error="io_error" ErrorDescription="Disk i/o operation failed" ErrorCode="1510" Reason="unknown" Backtrace="" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.609290" Type="KVClosed" ID="b60c68f39e5a077c" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.609290" Type="Role" ID="b60c68f39e5a077c" Error="success" ErrorDescription="Success" ErrorCode="0" Transition="End" As="StorageServer" Reason="Error" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" TrackLatestType="Original" />
<Event Severity="10" Time="1553945442.609290" Type="WorkPool_Stop" ID="0000000000000000" Workers="1" Idle="0" Work="0" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.609290" Type="WorkPool_Stop" ID="0000000000000000" Workers="64" Idle="0" Work="0" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.612864" Type="TransactionAttachID" ID="10188f69184bbdef" To="edd39d8b61f68931" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.612864" Type="TransactionDebug" ID="edd39d8b61f68931" Location="TLogServer.TLogConfirmRunningRequest" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.612864" Type="CoroStop" ID="0000000000000000" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.612864" Type="KVWriterDestroying" ID="d9a605321616b006" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.612864" Type="KVWriterDestroyed" ID="d9a605321616b006" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="40" Time="1553945442.616773" Type="FileOpenError" ID="0000000000000000" Reason="Must be unbuffered" Flags="131074" File="D:\foundationdb\data\4503\storage-d9a605321616b0064898aff70659a5a5.sqlite" Backtrace="" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="40" Time="1553945442.616773" Type="KVDoCloseError" ID="d9a605321616b006" Error="io_error" ErrorDescription="Disk i/o operation failed" ErrorCode="1510" Reason="unknown" Backtrace="" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.616773" Type="KVClosed" ID="d9a605321616b006" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.616773" Type="Role" ID="d9a605321616b006" Error="success" ErrorDescription="Success" ErrorCode="0" Transition="End" As="StorageServer" Reason="Error" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" TrackLatestType="Original" />
<Event Severity="10" Time="1553945442.616773" Type="WorkPool_Stop" ID="0000000000000000" Workers="1" Idle="0" Work="0" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.616773" Type="WorkPool_Stop" ID="0000000000000000" Workers="64" Idle="0" Work="0" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.618938" Type="SSTag" ID="ff08bffb954efca7" MyTag="0:2" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
<Event Severity="10" Time="1553945442.618938" Type="StorageServerStartingCore" ID="ff08bffb954efca7" TimeTaken="1.34955" Machine="10.0.1.26:4503" LogGroup="default" Roles="MS,SS,TL" />
...

Noticed that I forgot to remove data files on second disk while recreating a new cluster, so wiped everything, retried. On first two restart no issues, but on 3rd attempt, one of the process had the issue, this time on a tlog file. Here is extract of the log:

On all occurrences I’ve seen so that, it’s always after a role (storage, tlog, …) is terminated…?

<Event Severity="10" Time="1553946541.844301" Type="TLogError" ID="0c07d4b07d4cbefb" Error="worker_removed" ErrorDescription="Normal worker shut down" ErrorCode="1202" Machine="10.0.1.26:4502" LogGroup="default" Roles="MP,RV,SS" />
<Event Severity="10" Time="1553946541.844301" Type="Role" ID="0c07d4b07d4cbefb" Transition="End" As="SharedTLog" Reason="Error" Machine="10.0.1.26:4502" LogGroup="default" Roles="MP,RV,SS" TrackLatestType="Original" />
<Event Severity="10" Time="1553946541.844301" Type="KVClose" ID="0c07d4b07d4cbefb" Del="1" Machine="10.0.1.26:4502" LogGroup="default" Roles="MP,RV,SS" />
<Event Severity="10" Time="1553946541.844301" Type="WorkPool_Stop" ID="0000000000000000" Workers="1" Idle="0" Work="7" Machine="10.0.1.26:4502" LogGroup="default" Roles="MP,RV,SS" />
<Event Severity="10" Time="1553946541.844301" Type="WorkPool_Stop" ID="0000000000000000" Workers="64" Idle="64" Work="0" Machine="10.0.1.26:4502" LogGroup="default" Roles="MP,RV,SS" />
<Event Severity="10" Time="1553946541.844301" Type="DQDestroy" ID="0c07d4b07d4cbefb" LastPoppedSeq="753749" PoppedSeq="753749" NextPageSeq="761856" File0Name="D:\foundationdb\data\4502\logqueue-0c07d4b07d4cbefb4f64550649c1f256-1.fdq" Machine="10.0.1.26:4502" LogGroup="default" Roles="MP,RV,SS" />
<Event Severity="10" Time="1553946541.844301" Type="DiskQueueShutdownDeleting" ID="0c07d4b07d4cbefb" File0="D:\foundationdb\data\4502\logqueue-0c07d4b07d4cbefb4f64550649c1f256-0.fdq" File1="D:\foundationdb\data\4502\logqueue-0c07d4b07d4cbefb4f64550649c1f256-1.fdq" Machine="10.0.1.26:4502" LogGroup="default" Roles="MP,RV,SS" />
<Event Severity="40" Time="1553946541.844301" Type="FileOpenError" ID="0000000000000000" Reason="Must be unbuffered" Flags="131074" File="D:\foundationdb\data\4502\logqueue-0c07d4b07d4cbefb4f64550649c1f256-0.fdq" Backtrace="" Machine="10.0.1.26:4502" LogGroup="default" Roles="MP,RV,SS" />
<Event Severity="40" Time="1553946541.844301" Type="DiskQueueShutdownError" ID="0c07d4b07d4cbefb" Error="io_error" ErrorDescription="Disk i/o operation failed" ErrorCode="1510" Reason="unknown" Backtrace="" Machine="10.0.1.26:4502" LogGroup="default" Roles="MP,RV,SS" />
<Event Severity="10" Time="1553946541.844301" Type="TLogTerminated" ID="0c07d4b07d4cbefb" Error="worker_removed" ErrorDescription="Normal worker shut down" ErrorCode="1202" Machine="10.0.1.26:4502" LogGroup="default" Roles="MP,RV,SS" />
<Event Severity="10" Time="1553946541.844301" Type="Role" ID="0c07d4b07d4cbefb" Error="success" ErrorDescription="Success" ErrorCode="0" Transition="End" As="SharedTLog" Reason="Error" Machine="10.0.1.26:4502" LogGroup="default" Roles="MP,RV,SS" TrackLatestType="Original" />
<Event Severity="10" Time="1553946541.844301" Type="CodeCoverage" ID="0000000000000000" File="storageserver.actor.cpp" Line="3236" Condition="self-&gt;logSystem" Machine="10.0.1.26:4502" LogGroup="default" Roles="MP,RV,SS" />

Looking at the code, it lead me to the deletion of old files:

I see the trace for DiskQueueShutdownDeleting but not for DiskQueueShutdownComplete and instead we get an error DiskQueueShutdownError which cause an error on the DiskQueue actor?

This points to incrementalDeleteHelper() that attempts to open the file (to get it’s size) but does not specify the OPEN_UNBUFFERED flag, triggering the failure in AsyncFileWinASIO::open()!

This explains why it does not happen on first boot (no files yet) or after a full cleanup (no files to delete), but when restarting the service OR when changing from one storage engine to the other.

On Windows, before the recent change (and in the binary that I’m running), this was a hard error instead of an ASSERT, but then why is it not ASSERTing on both linux and windows when this code runs in debug mode?

I’ve opened an issue for this: https://github.com/apple/foundationdb/issues/1380

Fixed in https://github.com/apple/foundationdb/pull/1396