A FoundationDB layer for apiserver as an alternative to etcd

I don’t remember exactly how the watch cache works, but I remember that it was causing a lot of issues:

I left OVHcloud before digging into it.

On another layer using the Record-Layer, I used the setSplitLongRecords options to handle large values. With this option, the record-layer will spread the value over multiple k/v. It is not yet used on this layer.

I found easier to implement this using the record-layer than forking Kubernetes :grinning_face_with_smiling_eyes:
Also, it is a side project I used to discover FDB. But this is definitely another road that can be taken, but merging it upstream will be difficult IMO (see DynamoDb Support for API Server · Issue #53162 · kubernetes/kubernetes · GitHub)

You are right, if we have an ETCD that can scale well under “high” query per second, we could remove the cache I guess. From what I saw in my previous company, high qps for ETCD is around:

  • ~2k ranges/s
  • ~800 txn/s
  • ~1.6k msg/s sent through Watch

Pushing further was(is?) triggering a lot of pain for the SRE team.

These numbers can be seen as really high, or really low depending on your background/experience with distributed systems. For me it is both: it is low for a distributed systems, but it is not that bad for a single shard/region.

You can view ETCD as a single-group Raft, where the whole keyspace must be holded by all ETCD members(this is why ETCD storage limits is pretty low). The keyspace cannot be splitted into several regions/shards to spread the load. If you are overloading the Raft group, you are forced to boot another ETCD cluster, split your keys and operate two clusters. Or 3, or 4, or many, many, many clusters to handle customers growth :laughing:

I admit this is not an mainstream issue , but everyone who has either a very large K8S cluster or is trying to reuse ETCD over multiple apiservers is hitting this.

I trully think that the lifecycle of shards/regions should be left to the database itself: operators should not have to manually split/move/merge key ranges. You are forced to do this if you are spawning multiple ETCD clusters, for example to spread/balance the keys and the load across your clusters.

This opinion is also backed by years of on-call duty around HBase 1.X, where region’s lifecycle is a mess: we always have to run some hbck to fix things :laughing:

I have not yet operate an FDB cluster, but I have great hopes for it, in term of correctness and scalability :rocket:

1 Like