I have the FDB configuration in one DC, say DC1, that runs on Kubernetes. In this DC, there are 9 stateless pods (each running one FDB server process), 24 storage pods (each running two FDB server processes) and 10 transaction pods (each running one FDB server process). The total setup is 3 DC with DC3 mirrored with DC1 and DC2 being the log-store only DC, following the FDB architecture’s 2-region configuration. In each DC, data replication is configured with “triple”.
For my resiliency testing, I only focused on DC1 described above. In my test, I killed 1 out 9 stateless pods, and 3 out of 24 storage pods, and 1 out of 10 storage pods. The way of “kill” that I have is to replace each pod with a dummy image, and thus all of the active FDB server/FDB monitor processes in the patched pod get killed immediately.
Here is what I observed from the FDBCLI status report, 1 or two minutes after the above pods being killed:
Data:
Replication health - HEALING: Only two replicas remain of some data
Moving data - 11.136 GB
Sum of key-value sizes - 270.241 GB
Disk space used - 15.245 TB
Workload:
Read rate - unknown
Write rate - 1034 Hz
Transactions started - 9271 Hz
Transactions committed - 15 Hz
Conflict rate - 0 Hz
Notice that in the “workload” section, the read rate becomes unknown. In addition, large data moving is reported. The status.json correctly reported that 6 FDB Storage server processes were reduced.
The “read rate unknown” continuously gets reported, until I recovered the pods (by patching back the original correct FDB server image), and the storage pods being active and re-join the FDB cluster (no IP addresses are changed before and after the image patching).
Correspondingly, status.json does not report any read related metrics, such as read operations/second and read request/second, during the incident.
On the other hand, from status.json’s write related metrics, the write traffic is normal, without being impacted by killing pods.
So my question is: what does the “read rate unknown” really mean? Does it mean that there is no read traffic during this incident? How can 3 out 24 storage pods being killed (that is, 1/8 of capacity) leads to such big impact to the FDB cluster, if there is really no traffic going through.