Simulating FDB data center failure

Hi,

As discussed during the FoundationDB summit event, my deployment got some issue when I experimented with DC failure. I am going to describe it in detail at this post.

The FDB java client version I used for this setup is 6.0.15, and FDB database version is 6.0.18.

My FDB cluster deployment was 3 DCs, 2 regions. Initially, DC1 is the primary DC. DC2 is satellite DC, and DC3 is the one in a different region. At each region, I have some service pods that serve requests, resulting in traffic to the FDB cluster. At the starting point, throughputs are stable and comparable between the two regions.

Note that the requests from the clients are read-only requests (no-write).

Then, I followed these steps to simulate data center failures:

Step 1. First, I configured DC3 to become the primary DC by using this command fileconfigure inverse_3dc_pre_usable_region.json, where the content of inverse_3dc_pre_usable_region.json is:

"regions":[{
    "datacenters":[{
        "id":"dc1",
        "priority": -1
    },{
        "id":"dc2",
        "priority": 0,
        "satellite": 1
    }],
    "satellite_redundancy_mode":"one_satellite_double",
    "satellite_logs": 20
    },
    {
    "datacenters":[{
        "id": "dc3",
        "priority": 2
       }]
    }]
}

After this step, the throughput at each region did not change much. This is what I expected.

Step 2. I simulated DC1 failure by issuing this command to FDB cluster: configure usable_regions=1

Result: After this command, the service throughput at the region that has DC1 drops about 90% (from 4k to 400 TPS). The service throughput at the region that has DC3 reduces a little bit (nearly the same). This is also an expected behavior since my service pods at DC1 direct the requests to DC3.

Step 3. I tried to bring back DC1 by issuing this command: configure usable_regions=2

Result. After this command, I saw a full data movement from DC3 to DC1 (around ~2 TB). The throughput of DC1 remains very low (400), while the throughput of DC3 drops because of data movement in the background (~3k --> 2.9k after 10 mins). Fundamentally, FDB considers DC1 has no data and try to move the data from DC3 to DC1. The migration is undesirable since our workload is read-only and I expected that DC1 should come back without data movement.

Step 4. I waited until the data migration to complete (which took several hours) and hoped that the throughput of DC1 will be back to as before.

Result: Unfortunately, the throughput of DC1 remained very low, while the throughput of DC3 was around 3k (The throughput of DC3 is expected but not the throughput of DC1). These numbers are similar to the ones when the migration started (at the end of Step 3). I did not expect this because shouldn’t service pods at DC1 recognize that FDB pods at DC1 are back and direct the requests there?

Step 5. I switched to make DC1 the primaryDC.

Result: Throughput of DC1 slightly dropped (while I expected it should go up 10 times to resume what it is originally).

Step 6. I performed a rolling update on my service pods.

Result: The throughput of DC1 increased back to normal (~4k TPS).

My questions are:

  1. Is configure usable_regions=1 a correct way to simulate dc failure? Are they equivalent to killing all pods at that dc? Is there any other approach to simulate dc failure?
  2. Why the FDB java client did not recognize that DC1 was back and directed the traffic there? Was it an FDB client issue or did I do something wrong? Since the throughput can be resumed after I did a rolling update on my service pods, I suspect that this is an FDB java client issue.
  3. Would the newer version of FDB and FDB java client resolve the issue mentioned above?