Behavior when all replicas of some shards are missing?

gaurav · October 19, 2019, 4:08am

I wanted to check what is the expected behavior of FDB when all replicas of some of the shards are missing (due to some of SS nodes being unavailable):

NO reads/writes are possible across the entire DB?
Only reads are possible from available shards but no writes are possible to any shard?
Reads/writes to non-missing shards are possible, but reads/writes to missing shards are not possible?
Reads for keys on missing shard will be treated as if those keys were unset and therefore return ‘null’ kv row without any exception?

How would shard splitting and merging work under such a situation - given that some shards are not available, how will any shard merge/split work that wants to touch a part of key range associated with missing shards?

Would there be any other aspects of FDB that will not function till such a situation continues?

Is there a setting to automatically make the entire database “un-available” (no reads and no writes) till there is at least one replica available for each of the shard?

alexmiller · November 13, 2019, 2:06am

I finally looked at my todo list to ask Evan about this while Evan was still sitting behind me, as both AJ and I were not entirely certain without doing a good amount of testing/code reading.

The answer for reads and writes is 3.5: Reads and writes to non-missing shards will continue as normal. Writes to missing shards will be possible, but will buffer indefinitely on TLogs, potentially eventually filling up their disk space. Reads from missing shards will hang indefinitely.

We believe that data distribution will probably get stuck on some of the servers, so it’s likely that splitting and merging for servers that shared a team with the now-missing shard will have trouble adapting to new load. Ratekeeper should continue as is, as it only monitors things that are alive.

I’m not aware of any knob or client setting that currently exists that would make FDB act as unavailable until one replica of each shard is present.

gaurav · November 15, 2019, 1:46pm

Thanks Alex!
Could you clarify this a bite more:

Which thread/component will hang here ? Is it the SS server main event loop thread? Or fdb_client network thread? Or is it that transaction will time_out with some specific error code?

ajbeamon · November 15, 2019, 4:36pm

The read will block on the client, though you could set a timeout if you want it to terminate after some amount of time. This doesn’t cause the client itself to hang unless you block on the result of the read.

gaurav · November 15, 2019, 6:00pm

Thanks. That makes it clear!

Topic		Replies	Views
Identifying shards associated with replica unavailability; shard selection when dropping redundancy levels Running FoundationDB	15	2120	December 11, 2021
What Happens When a Server Crash and It is the Last Source for a Shard? Using FoundationDB	1	54	September 13, 2024
Unexpected cluster state - Unable to read database configuration Using FoundationDB	1	1473	December 14, 2022
Debugging Data Distribution Using FoundationDB	3	911	November 14, 2018
Disk snapshots on FDB Running FoundationDB	1	461	September 1, 2023

Behavior when all replicas of some shards are missing?

Related topics