Simulating FDB data center failure

Hi,

As discussed during the FoundationDB summit event, my deployment got some issue when I experimented with DC failure. I am going to describe it in detail at this post.

The FDB java client version I used for this setup is 6.0.15, and FDB database version is 6.0.18.

My FDB cluster deployment was 3 DCs, 2 regions. Initially, DC1 is the primary DC. DC2 is satellite DC, and DC3 is the one in a different region. At each region, I have some service pods that serve requests, resulting in traffic to the FDB cluster. At the starting point, throughputs are stable and comparable between the two regions.

Note that the requests from the clients are read-only requests (no-write).

Then, I followed these steps to simulate data center failures:

Step 1. First, I configured DC3 to become the primary DC by using this command fileconfigure inverse_3dc_pre_usable_region.json, where the content of inverse_3dc_pre_usable_region.json is:

"regions":[{
    "datacenters":[{
        "id":"dc1",
        "priority": -1
    },{
        "id":"dc2",
        "priority": 0,
        "satellite": 1
    }],
    "satellite_redundancy_mode":"one_satellite_double",
    "satellite_logs": 20
    },
    {
    "datacenters":[{
        "id": "dc3",
        "priority": 2
       }]
    }]
}

After this step, the throughput at each region did not change much. This is what I expected.

Step 2. I simulated DC1 failure by issuing this command to FDB cluster: configure usable_regions=1

Result: After this command, the service throughput at the region that has DC1 drops about 90% (from 4k to 400 TPS). The service throughput at the region that has DC3 reduces a little bit (nearly the same). This is also an expected behavior since my service pods at DC1 direct the requests to DC3.

Step 3. I tried to bring back DC1 by issuing this command: configure usable_regions=2

Result. After this command, I saw a full data movement from DC3 to DC1 (around ~2 TB). The throughput of DC1 remains very low (400), while the throughput of DC3 drops because of data movement in the background (~3k --> 2.9k after 10 mins). Fundamentally, FDB considers DC1 has no data and try to move the data from DC3 to DC1. The migration is undesirable since our workload is read-only and I expected that DC1 should come back without data movement.

Step 4. I waited until the data migration to complete (which took several hours) and hoped that the throughput of DC1 will be back to as before.

Result: Unfortunately, the throughput of DC1 remained very low, while the throughput of DC3 was around 3k (The throughput of DC3 is expected but not the throughput of DC1). These numbers are similar to the ones when the migration started (at the end of Step 3). I did not expect this because shouldn’t service pods at DC1 recognize that FDB pods at DC1 are back and direct the requests there?

Step 5. I switched to make DC1 the primaryDC.

Result: Throughput of DC1 slightly dropped (while I expected it should go up 10 times to resume what it is originally).

Step 6. I performed a rolling update on my service pods.

Result: The throughput of DC1 increased back to normal (~4k TPS).

My questions are:

  1. Is configure usable_regions=1 a correct way to simulate dc failure? Are they equivalent to killing all pods at that dc? Is there any other approach to simulate dc failure?
  2. Why the FDB java client did not recognize that DC1 was back and directed the traffic there? Was it an FDB client issue or did I do something wrong? Since the throughput can be resumed after I did a rolling update on my service pods, I suspect that this is an FDB java client issue.
  3. Would the newer version of FDB and FDB java client resolve the issue mentioned above?

Using usable_regions=1 to simulate a DC failure causes FDB to tell all of the storage servers in DC1 to die and throw their data away. This is why a full data movement is required from DC3 to DC1 once you configure back to two usable regions. If you killed all the DC1 fdbserver processes, and left their files in place, and then later restarted the DC1 processes, then there wouldn’t have been a full data movement.

Step 4 and 5 sound weird to me also, that’s not the behavior I would expect. If data distribution for DC1 hadn’t actually caught up yet, this could make more sense? How were you monitoring for when the background data movement completed? But in any case, I don’t have an answer for why doing a rolling update of your service pods would have changed anything. Pinging @ajbeamon for his wisdom.

It sounds a little bit like the clients have cached the locations of the data in the cluster and failed to update when new options became available. If so, restarting the processes would have reset the cache and resolved the problem. This isn’t supposed to happen, but I’ll also note that this multi-region aspect of the product was much less mature in 6.0. Let me toy around with it a bit and see if I can confirm whether this is what happened.

I did a simple test in 6.2 and got what appears to be the correct behavior. Are you able to try this on a newer version and see if it reproduces for you?

It turns out 6.2.10 has a bug with similar symptoms that can cause clients to not update their storage cache. This is fixed in 6.2.11, so it may be best to wait until that version is available for download to do a test.