Preventing Large Range Clears

We (Wavefront) were debugging a data-loss situation that we are mostly blaming application code (but we don’t have the smoking gun yet) but the issue is largely observed as a large swath of keys being cleared in FDB (in the form of an txn.clear() from Java code). We ruled out that it’s a key-by-key clear() since operating space recovered almost instantly and so it must have been a single erroneous clear that spanned a ton of rows.

This led us to think whether:

a) Should we have SSL on all the time? There is CRC AIACT in the RPC logic so bit flips on the network shouldn’t be a culprit unless it’s a super rare occurrence. For those who wonder how often it happens, since we operate a plaintext protocol for metrics at Wavefront, we see it all the time in public and private clouds. Essentially, we think that if a bit flip occurs very early in the “to” key of the clear, it could accidentally clear a ton of things, e.g. clear \x02c... \x02c... becomes \x02c... \x02d.

b) Should FDB have a mode where clears can only happen to ranges that are read and will reject the transaction if this condition doesn’t hold (either at the client-level or at the proxy). Yes, there are times when you’re at the CLI and you need to clear, say, the entire \x02c\x00 ... \x02c\x01 range without reading it but for most application code, if there’s a mode that requires read conflict ranges to be present before a clear range can occur for that range (or a subset) would protect against 1) application bugs and 2) any other unknown transport/library bugs. Essentially most application code is going to be reading a region of keys, pick a sub-region to clear (or the entire read range) and commit. If this is a transaction option, the odd case of having to blind-clear another range could still be accommodated. If this is something that makes sense, I think that’s something that we might look into.

c) Should FDB have a way to log large clears or even block them? If there’s a way to say clearing more than 100k or even 1M rows if not a valid operation, or prevent a clear from affecting more than 2 shards, that might limit the damage. Obviously, the transaction layer doesn’t know what the actual keyspace looks like so at best I think is you can query the boundaries for a clear and see if it’s hitting more than X shards and log a message perhaps. Still requires forensics to keep these logs for a long time in order to find the occurrence though.

Obviously backups and application-level backups/redundancy saved us in this case but having some protection against this would make us sleep better at night. :slight_smile:

A crc32c was added to network messages in… 5.2 IIRC. It’s disabled if you use TLS, as TLS will offer the same check. You should be pretty protected against random bit flips either way.

(b) and (c) I have no strong feelings about, because I don’t think we’ve run into this problem. (b) sounds like requesting blind range clears do be removed from the API, and range clears being replaced with a range read and point deletes, which I think actually makes life easier for FDB storage engines and concurrency control, but not users. (c) is a thing that I’d probably suggest implementing in the client rather than in FDB, but maybe that wouldn’t give you the level of safety you’re looking for.

You could do this by wrapping the range clear API on your client with a check that consulted the locality API first to see if the storage servers for the begin and end of the range were different, correct? This would have “false positive” rejections if a single logical record spanned multiple keys as those could be on different storage servers, but I think that would be the case regardless of where this logic were implemented.

For b) it would be a transaction option that’s not on by default (although we have been talking about cluster-level option in the client so it could be something enforced everywhere) but it would be asserted at the proxy so that we are absolutely sure that the network or the library couldn’t have tampered with the range clear. Since it’s not usually possible to read millions of keys in a single transaction, unless you change the read conflict range manually, it would protect against buggy code. Then again, we actually manage read conflict ranges ourselves in a number of scenarios so that wouldn’t have helped =p. But if you tamper with read conflict ranges, you’re probably on your own.

c) in the client should be easy to do as you said since boundaries can be heavily cached (and a large range clear by definition will take time to build so sudden boundary changes should be safe). Not fool-proof though.

Alternatively, if there’s a way to specified the expected number of rows cleared, the client API can assert that by checking the local cache to see if we have that row range and just check that the number of rows is that number. If it doesn’t have the range, it can do a read (which would lock with a read conflict range) and assert that.

I think an extra parameter to the range clear that must be a strict superset of the range would be useful. For example, say you have a compound key that is something like k1/k2/k3/[timestamp] and you want to clear a timestamp range in k1/k2/k3 you could do that by specifying it like:

clearRange(k1/k2/k3/t1, k1/k2/k3/t2, k1/k2/k3)

At least that way if one of your start/end keys is wrong it would fail as it wouldn’t fall below k1/k2/k3.

That’s what we are adding to code right now but agreed that sending that to the server would mean a lesser chance of bit flips or whatever causing an issue (and it could go all the way to the SS but then the question is how to handle it I guess, perhaps it can only log a message)