A FoundationDB layer for apiserver as an alternative to etcd

wilsonwang371 · May 11, 2021, 5:23pm

Hi guys,

I have a question regarding the possibilities to replace k8s cluster etcd with foundationdb by adding an extra layer.

To support etcd, this layer needs to provide Txn(), Put(), Range(), Watch() at least.

I know foundationdb supports watch but the watch provided is different from the ones available in etcd. We need range watch support.

To implement a mvcc support that fits apiserver’s need, we need to add a KV table which has a ever increasing number which might cause hotspot read in multi-raft database. Will this happen on foundationdb too?

Wilson

alloc · May 12, 2021, 3:30pm

You might be interested in this thread that @PierreZ started a while ago about the etcd API: Multiple questions about Indexes, functions and watches to implement etcd-layer

Yeah, that’s right. What FDB calls “watch” is a feature that lets you know if a specific key has changed, but it doesn’t tell you what it was changed to. The etcd “watch” API gives you a list of all changes to the specified key range (until the watch is ended). I think, with the right indexes, you could probably use FDB watches to implement etcd watches, but I’m eliding over a lot of implementation details there.

Hmm, I’m not sure I completely understand the question. It’s certainly possible to create read hot spots in FDB, and if the idea is to implement the “version” in MVCC by including a single key which is incremented every time a change is made, then yes, that key would quickly become a hotspot. I think most proposals for putting something like etcd on top of FDB would be to leverage FDB’s existing “version” in some way. Each FDB cluster already maintains a monotonically increasing 64-bit version that it uses for MVCC (though note that the MVCC history window is not persisted, and so it can only be queried historically for 5 seconds, i.e., the duration the MVCC history window is available in memory), and it also exposes operations to include the version in keys and values (which is equivalent to, say, including the “commit timestamp” in a column in an alternative system). Then I think you could build persistent MVCC into the layer’s data model with those tools. But it would require some data modeling work to get right.

PierreZ · May 12, 2021, 5:01pm

Thanks for sharing my little side-project @alloc You can find the etcd-layer here, and I will add some additional informations beside everything @alloc already commented

Under the hood, ETCD is storing things in bbolt (a kv), and the key is the revision, not the key as we would expect. There is also another etcdshim called Kine which is doing the same thing. Regarding what ETCD and other ETCDshims are doing, I´m rather happy with my current implementation, which is keeping every operations (until compactions), and transform Watch in a scheduled query that will every 100ms use an index on the read_version to fetch missed events. You can find the code here (not merged yet).

It depends on the design of the layer. My initial goal was to use versionstamp as the revision to achieve better concurrency on Put/Txn. Turns out the library I´m using to build the etcd-layer does not support versionstamp as primary key yet As a workaround, I´m extracting the read-version during Transaction and use it as to set fields like create_revision or mod_revision (again, far from optimal)

This layer was my first application using FDB, so there´s a ton of improvements to be done here, which is not tracked by Github issues, I will take the time to do so soon. I´m rather focused on implementing the directory in Rust´s binding for now, but I will soon come back to the layer to improve it

A few more Kubernetes´s related informations about my layer:

I would love to have some automated tests that are spawning a fake apiServer to test the layer Do you have an idea on how I could achieve this?
The layer is multi-tenant by design relying on the user/password, but this is not finished yet, and I´m not sure if the apiServer is implementing password-based auth. I remember only seeing cert-auth.

Also, feel free to ask any questions you may have

wilsonwang371 · May 13, 2021, 9:10pm

Thank you guys for the informative replies. @alloc @PierreZ

Yeah, I have read Kine’s code. Kine is building another mvcc layer on top of the backend db. That’s why I am asking about the monotonically increasing id. Using such an ID, we can use two table to achieve something similar to what etcd did. One table is key → ID mapping, another table holds ID → value mapping. Of course, to better support Watch and also to improve the performance, this final design may be different from this.

To make FDB a usable etcd replacement in k8s cluster, it is not a easy task. There many issues we need to resolve.

Performance needs to be better than etcd.
Horizontally Scalable. But Watch event should not be lost or reordered.
Avoid single-point failures.

Performance requires Scalable. Scalable requires distributed processing. However, distributed processing can hardly make sure the ordering and no event lost. That’s why I feel it is really hard to achieve these goals.

Kine could not resolve all the above issues. This makes it not a feasible alternative in our enterprise environment. Good thing is that FDB supports Watch. But the Watch support from FDB is also not very limited.

PierreZ · May 14, 2021, 7:38am

Disclaimer: I worked a lot around ETCD in my previous work at OVHcloud(I left two weeks ago ), especially around this issue. OVHcloud is using one ETCD for hundred of ApiServers and it was a pain for SRE team (more details here). I wrote the layer during the first french lockdown just before joining the K8S team to learn both FDB and ETCD at the same time. After joining the team, a few months later, we decided to test some ETCDshims. Our goal was to avoid stacking ETCD´s cluster like they are doing now and have something designed to scale nicely as we are adding new customers. I talked with Darren Shepherd a lot about Kine and I was hyped about it. They replaced their ETCDs in Rancher by Kine and they are really happy about it. But the tradeoff here is that they are using Amazon managed SQL products.

You are right! The monotonic ID is interesting, but only works when running an non-distributed datastore. We tried something like CockroachDB below Kine to distribute the tables, but we were experiencing too much constraints on both the SQL layer and the SEQUENCE to have a nice performances.

Well, to be fair, we had better performances than ETCD, but it did not scale enough as we were having a lot transactions restarts with the ReadWithinUncertaintyInterval error.

I also tried to forked Kine to handle revision internally in the ETCDshim instead of relying on the database to generate them but I left OVHcloud before putting some tests in it.

After {writing, playing with} several ETCDshims, I feel like my first approach was the right one, FoundationDB´s interface is insanely good to help you carefully design a data service thanks to features like the byte-ordered key-Value, transactions, Versionstamp, Tuples/Subspaces/Directories and so on

I´m not scared with 1 and 3 because of how bad ETCD behave under “high” QPS
2 is completely tied to the design of the layer.

I feel like it is quite easy to design the layer with that in my mind. I prefer to pay a bit additional cost during GetRange to gain a perfectly handled Watch.

FDB´s watches are only the top of the iceberg to implement ETCD´s watches. I was like you at the beginning of the layer “I should use the Watches directly!!” Then I realized than they are two differents beasts:

ETCD´s watch is a stream of mutations,
FDB´s watch is a notification of a key change.

The first item means that ETCD need to keep an exact history of its keyspace, like this:

// ...
revision 432: added key foo
revision 433: deleted bar key
// ...

so that a Watch is simply a scheduled query, retrieving any new revisions. ETCD, Kine and my layer are designed like this, with the revision as a key. This is the only way to be sure that your Watches sees every mutations.

Well not the only way, you could use something like Kafka or Pulsar as the storage layer of an ETCDshim, but this will shift the design completely as the Watches will be simple but the kv interface will give you extra work.

I also feels like FDB´s watches have an impact on production clusters, but that is just a feeling, no production experience here

wilsonwang371 · May 14, 2021, 5:18pm

Nice comment! Let me think about this and post more details later. I finally found the right place and the right person to discuss these headaches now.

PierreZ · May 17, 2021, 4:16pm

Take your time, and feel free to ask any questions

wilsonwang371 · July 8, 2021, 5:44pm

Hi Pierre,

Due to the incoming work from other more urgent tasks, we are going to put this task aside for a few weeks. I think this will be a long term task for us and a lot of evaluations are needed before making a decision.

Thank you for the very helpful comments and lets discuss this later after I collecting more information on this.

julienlau · October 11, 2021, 3:49pm

It may be worth a try to see how kubernetes behaves when skipping all intermediates version at watch triggers. In my limited understanding of Kubernetes, these stale state are only necessary to update all the layers of watch cache by increment (Delta FIFO) ?

The bad thing is I think on large clusters some k/v pairs can get huge and exceed etcd maxvalue so there may be some tricks to decrease amount of streamed data by using etcd intermediary values but I cannot tell.

Anyhow, it may be a better idea to fork kubernetes store and client-go to support tweaked watch semantics rather than try to reproduce exactly etcd watch behavior on top of FDB ?

In my view, the interest of overcoming etcd scalability issues is to be able to get rid of all the etcd cache layers on kubernetes side ?

PierreZ · October 15, 2021, 9:24am

I don’t remember exactly how the watch cache works, but I remember that it was causing a lot of issues:

watch through grpc proxy result in delay or miss event · Issue #9988 · etcd-io/etcd · GitHub
Kubernetes is vulnerable to stale reads, violating critical pod safety guarantees · Issue #59848 · kubernetes/kubernetes · GitHub
https://github.com/kubernetes/enhancements/pull/1404
[BUG] PVC can be accidentally deleted when controller reads stale data from apiserver · Issue #402 · instaclustr/cassandra-operator · GitHub
https://twitter.com/embano1/status/1349484575756607490?s=20

I left OVHcloud before digging into it.

On another layer using the Record-Layer, I used the setSplitLongRecords options to handle large values. With this option, the record-layer will spread the value over multiple k/v. It is not yet used on this layer.

I found easier to implement this using the record-layer than forking Kubernetes
Also, it is a side project I used to discover FDB. But this is definitely another road that can be taken, but merging it upstream will be difficult IMO (see DynamoDb Support for API Server · Issue #53162 · kubernetes/kubernetes · GitHub)

You are right, if we have an ETCD that can scale well under “high” query per second, we could remove the cache I guess. From what I saw in my previous company, high qps for ETCD is around:

~2k ranges/s
~800 txn/s
~1.6k msg/s sent through Watch

Pushing further was(is?) triggering a lot of pain for the SRE team.

These numbers can be seen as really high, or really low depending on your background/experience with distributed systems. For me it is both: it is low for a distributed systems, but it is not that bad for a single shard/region.

You can view ETCD as a single-group Raft, where the whole keyspace must be holded by all ETCD members(this is why ETCD storage limits is pretty low). The keyspace cannot be splitted into several regions/shards to spread the load. If you are overloading the Raft group, you are forced to boot another ETCD cluster, split your keys and operate two clusters. Or 3, or 4, or many, many, many clusters to handle customers growth

I admit this is not an mainstream issue , but everyone who has either a very large K8S cluster or is trying to reuse ETCD over multiple apiservers is hitting this.

I trully think that the lifecycle of shards/regions should be left to the database itself: operators should not have to manually split/move/merge key ranges. You are forced to do this if you are spawning multiple ETCD clusters, for example to spread/balance the keys and the load across your clusters.

This opinion is also backed by years of on-call duty around HBase 1.X, where region’s lifecycle is a mess: we always have to run some hbck to fix things

I have not yet operate an FDB cluster, but I have great hopes for it, in term of correctness and scalability

Topic		Replies	Views
Fdb-zk: rough cut of Zookeeper API layer Using FoundationDB	6	5104	April 9, 2019
Changefeeds (watching and getting updates on ranges of keys) Using FoundationDB	8	4223	July 6, 2018
A few design-pattern + check-my-understanding questions Using FoundationDB	9	2275	February 21, 2019
Performance characteristics of using Watches for Distributed Task Scheduling FoundationDB Layers performance	27	5181	August 16, 2018
Watchers and not missing changes Using FoundationDB	13	2000	June 5, 2018

A FoundationDB layer for apiserver as an alternative to etcd

Related topics