Killing 10% of Storage Servers Leads to Unknown Read Rate in FDB Status Report

jltz · June 10, 2019, 9:39am

I have the FDB configuration in one DC, say DC1, that runs on Kubernetes. In this DC, there are 9 stateless pods (each running one FDB server process), 24 storage pods (each running two FDB server processes) and 10 transaction pods (each running one FDB server process). The total setup is 3 DC with DC3 mirrored with DC1 and DC2 being the log-store only DC, following the FDB architecture’s 2-region configuration. In each DC, data replication is configured with “triple”.

For my resiliency testing, I only focused on DC1 described above. In my test, I killed 1 out 9 stateless pods, and 3 out of 24 storage pods, and 1 out of 10 storage pods. The way of “kill” that I have is to replace each pod with a dummy image, and thus all of the active FDB server/FDB monitor processes in the patched pod get killed immediately.

Here is what I observed from the FDBCLI status report, 1 or two minutes after the above pods being killed:

Data:
Replication health - HEALING: Only two replicas remain of some data
Moving data - 11.136 GB
Sum of key-value sizes - 270.241 GB
Disk space used - 15.245 TB

Workload:
Read rate - unknown
Write rate - 1034 Hz
Transactions started - 9271 Hz
Transactions committed - 15 Hz
Conflict rate - 0 Hz

Notice that in the “workload” section, the read rate becomes unknown. In addition, large data moving is reported. The status.json correctly reported that 6 FDB Storage server processes were reduced.

The “read rate unknown” continuously gets reported, until I recovered the pods (by patching back the original correct FDB server image), and the storage pods being active and re-join the FDB cluster (no IP addresses are changed before and after the image patching).

Correspondingly, status.json does not report any read related metrics, such as read operations/second and read request/second, during the incident.

On the other hand, from status.json’s write related metrics, the write traffic is normal, without being impacted by killing pods.

So my question is: what does the “read rate unknown” really mean? Does it mean that there is no read traffic during this incident? How can 3 out 24 storage pods being killed (that is, 1/8 of capacity) leads to such big impact to the FDB cluster, if there is really no traffic going through.

ajbeamon · June 10, 2019, 2:54pm

It sounds like you are running into a status reporting bug that should be fixed as of 6.1. Are you running an older version, and if so are you able to try 6.1 to see if it works better?

jltz · June 10, 2019, 5:56pm

We are still running in 6.0.15. I am checking the 6.1 Release Note, https://apple.github.io/foundationdb/release-notes.html. And I have not found the fix report that is related to what I reported above. The closest one is:

Status could report an incorrect reason for ongoing data movement.

But I checked the JIRA task, it is related to team tracker.

Could you point me to the actual JIRA task that is related to what I reported earlier on "read state unknown?

ajbeamon · June 10, 2019, 6:59pm

It’s this line:

Read workload status metrics would disappear when a storage server was missing. (PR #1348)