We (Wavefront) were debugging a data-loss situation that we are mostly blaming application code (but we don’t have the smoking gun yet) but the issue is largely observed as a large swath of keys being cleared in FDB (in the form of an txn.clear() from Java code). We ruled out that it’s a key-by-key clear() since operating space recovered almost instantly and so it must have been a single erroneous clear that spanned a ton of rows.
This led us to think whether:
a) Should we have SSL on all the time? There is CRC AIACT in the RPC logic so bit flips on the network shouldn’t be a culprit unless it’s a super rare occurrence. For those who wonder how often it happens, since we operate a plaintext protocol for metrics at Wavefront, we see it all the time in public and private clouds. Essentially, we think that if a bit flip occurs very early in the “to” key of the clear, it could accidentally clear a ton of things, e.g.
clear \x02c... \x02c... becomes
b) Should FDB have a mode where clears can only happen to ranges that are read and will reject the transaction if this condition doesn’t hold (either at the client-level or at the proxy). Yes, there are times when you’re at the CLI and you need to clear, say, the entire
\x02c\x00 ... \x02c\x01 range without reading it but for most application code, if there’s a mode that requires read conflict ranges to be present before a clear range can occur for that range (or a subset) would protect against 1) application bugs and 2) any other unknown transport/library bugs. Essentially most application code is going to be reading a region of keys, pick a sub-region to clear (or the entire read range) and commit. If this is a transaction option, the odd case of having to blind-clear another range could still be accommodated. If this is something that makes sense, I think that’s something that we might look into.
c) Should FDB have a way to log large clears or even block them? If there’s a way to say clearing more than 100k or even 1M rows if not a valid operation, or prevent a clear from affecting more than 2 shards, that might limit the damage. Obviously, the transaction layer doesn’t know what the actual keyspace looks like so at best I think is you can query the boundaries for a clear and see if it’s hitting more than X shards and log a message perhaps. Still requires forensics to keep these logs for a long time in order to find the occurrence though.
Obviously backups and application-level backups/redundancy saved us in this case but having some protection against this would make us sleep better at night.