FoundationDB

Fresh install on Windows fails with FileOpenError `Must be unbuffered` when attempting to open 'xxxx-0.fdq' file


(Christophe Chevalier) #1

When attempting to install fdb 5.1.7 on a newly installed win10 machine, we got this weird error:

<Event Severity="40" Time="1528381129.831068" Type="FileOpenError" Machine="127.0.0.1:4500" ID="0000000000000000" Reason="Must be unbuffered" Flags="131074" File="C:\ProgramData\foundationdb\data\4500\storage-811f51d3f2dd5414739706e38748a17b-0.fdq" logGroup="default" Backtrace=""/>
<Event Severity="40" Time="1528381129.831068" Type="DiskQueueShutdownError" Machine="127.0.0.1:4500" ID="811f51d3f2dd5414" Reason="unknown" Error="io_error" ErrorDescription="Disk i/o operation failed" ErrorCode="1510" logGroup="default" Backtrace=""/>

This is right after installing via the setup (that pre-configures the db using the memory storage engine).

When connecting to fdbcli, the cluster was unavailable. Attemtping to “configure single ssd” did not seem to work at first. Then after restarting everything, it now works fine, and it is using the ssd-2 storage engine. The last logged error for the node is DiskQueueShutdownError

I think that for some reason, the memory engine did try to open a file on windows without specifying the OPEN_UNBUFFERED that is tested in AsyncFileWinASIO::open(..) but I have difficulty, looking at the code, to find why.

After the restart, my guess is that the storage was successfully converted from memory to ssd-2 and the error went away.

The file name in the error is xxxx-0.fdq which seems to be only used by the Memory engine, so maybe only this path may forget to set the flag OPEN_UNBUFFERED in some cases?

The equivalent AsyncFileKAIO::open() for linux seem to check the flag via ASSERT( flags & OPEN_UNBUFFERED ); while the Widows version is more explicit in tracing and throwing an error… maybe there is difference in behavior there?


(Alex Miller) #2

So this is indeed a DiskQueue file from a KeyValueStoreMemory. However, I’m baffled that the code to open this DiskQueue is the exact same code used to open a DiskQueue for the transaction logs, which appears to have succeeded in your case.

So I’m kind of more baffled about where this request is coming from. You wouldn’t happen to be able to offer a symbolicated backtrace, would you?


(Christophe Chevalier) #3

Sorry no, it was a long time ago and using the officially released binaries (we do not build our own binaries yet). And all recent install were done with 5.2.5 which did not seem to reproduce this particular issue.

I’ve had other problems with the Windows setup or linux packages creating a new cluster using the memory storage engine, and then having to transition to ssd-2 engine, or worse, stop the service, delete the data, and then restart just so we can join an existing cluster (we’ve had a recent customer issue were a chain of events starting from this brought down the entire cluster for several hours until we were able to diagnose the root cause).

As a side note: It would be really cool if the setup would NOT automatically create a new cluster (or at least a checkbox in the setup to do it or not).


(Alex Miller) #4

We do have a set of basic tests that involve installing our windows package on … some version of windows, so I think we’d catch if creating a memory storage engine was horrifically broken. If you happen to have it occur again, and get any sort of information as to how/why, I’d be happy to take a look. I’ll at minimum flip the windows behavior to match the linux behavior, as I’m not aware of any reason that they should be different.

Yeah, the dockerization attempts has revealed this to be quite an annoyance. I’d be happy to see it removed, but unfortunately, there’s no one around that seems to have a solid understanding of our packaging scripts to be able to make them do the right thing easily.