A FoundationDB layer for apiserver as an alternative to etcd

PierreZ · October 15, 2021, 9:24am

I don’t remember exactly how the watch cache works, but I remember that it was causing a lot of issues:

watch through grpc proxy result in delay or miss event · Issue #9988 · etcd-io/etcd · GitHub
Kubernetes is vulnerable to stale reads, violating critical pod safety guarantees · Issue #59848 · kubernetes/kubernetes · GitHub
https://github.com/kubernetes/enhancements/pull/1404
[BUG] PVC can be accidentally deleted when controller reads stale data from apiserver · Issue #402 · instaclustr/cassandra-operator · GitHub
https://twitter.com/embano1/status/1349484575756607490?s=20

I left OVHcloud before digging into it.

On another layer using the Record-Layer, I used the setSplitLongRecords options to handle large values. With this option, the record-layer will spread the value over multiple k/v. It is not yet used on this layer.

I found easier to implement this using the record-layer than forking Kubernetes
Also, it is a side project I used to discover FDB. But this is definitely another road that can be taken, but merging it upstream will be difficult IMO (see DynamoDb Support for API Server · Issue #53162 · kubernetes/kubernetes · GitHub)

You are right, if we have an ETCD that can scale well under “high” query per second, we could remove the cache I guess. From what I saw in my previous company, high qps for ETCD is around:

~2k ranges/s
~800 txn/s
~1.6k msg/s sent through Watch

Pushing further was(is?) triggering a lot of pain for the SRE team.

These numbers can be seen as really high, or really low depending on your background/experience with distributed systems. For me it is both: it is low for a distributed systems, but it is not that bad for a single shard/region.

You can view ETCD as a single-group Raft, where the whole keyspace must be holded by all ETCD members(this is why ETCD storage limits is pretty low). The keyspace cannot be splitted into several regions/shards to spread the load. If you are overloading the Raft group, you are forced to boot another ETCD cluster, split your keys and operate two clusters. Or 3, or 4, or many, many, many clusters to handle customers growth

I admit this is not an mainstream issue , but everyone who has either a very large K8S cluster or is trying to reuse ETCD over multiple apiservers is hitting this.

I trully think that the lifecycle of shards/regions should be left to the database itself: operators should not have to manually split/move/merge key ranges. You are forced to do this if you are spawning multiple ETCD clusters, for example to spread/balance the keys and the load across your clusters.

This opinion is also backed by years of on-call duty around HBase 1.X, where region’s lifecycle is a mess: we always have to run some hbck to fix things

I have not yet operate an FDB cluster, but I have great hopes for it, in term of correctness and scalability

Topic		Replies	Views
Fdb-zk: rough cut of Zookeeper API layer Using FoundationDB	6	5260	April 9, 2019
Run FoundationDB cluster on multi Kuberbetes clusters Kubernetes Operator	20	2186	February 6, 2023
Changefeeds (watching and getting updates on ranges of keys) Using FoundationDB	8	4379	July 6, 2018
A few design-pattern + check-my-understanding questions Using FoundationDB	9	2425	February 21, 2019
Performance characteristics of using Watches for Distributed Task Scheduling FoundationDB Layers performance	27	5329	August 16, 2018

A FoundationDB layer for apiserver as an alternative to etcd

Related topics