I have a question regarding the possibilities to replace k8s cluster etcd with foundationdb by adding an extra layer.
To support etcd, this layer needs to provide Txn(), Put(), Range(), Watch() at least.
I know foundationdb supports watch but the watch provided is different from the ones available in etcd. We need range watch support.
To implement a mvcc support that fits apiserver’s need, we need to add a KV table which has a ever increasing number which might cause hotspot read in multi-raft database. Will this happen on foundationdb too?
Yeah, that’s right. What FDB calls “watch” is a feature that lets you know if a specific key has changed, but it doesn’t tell you what it was changed to. The etcd “watch” API gives you a list of all changes to the specified key range (until the watch is ended). I think, with the right indexes, you could probably use FDB watches to implement etcd watches, but I’m eliding over a lot of implementation details there.
Hmm, I’m not sure I completely understand the question. It’s certainly possible to create read hot spots in FDB, and if the idea is to implement the “version” in MVCC by including a single key which is incremented every time a change is made, then yes, that key would quickly become a hotspot. I think most proposals for putting something like etcd on top of FDB would be to leverage FDB’s existing “version” in some way. Each FDB cluster already maintains a monotonically increasing 64-bit version that it uses for MVCC (though note that the MVCC history window is not persisted, and so it can only be queried historically for 5 seconds, i.e., the duration the MVCC history window is available in memory), and it also exposes operations to include the version in keys and values (which is equivalent to, say, including the “commit timestamp” in a column in an alternative system). Then I think you could build persistent MVCC into the layer’s data model with those tools. But it would require some data modeling work to get right.
Thanks for sharing my little side-project @alloc You can find the etcd-layer here, and I will add some additional informations beside everything @alloc already commented
It depends on the design of the layer. My initial goal was to use versionstamp as the revision to achieve better concurrency on Put/Txn. Turns out the library I´m using to build the etcd-layer does not support versionstamp as primary key yet As a workaround, I´m extracting the read-version during Transaction and use it as to set fields like create_revision or mod_revision (again, far from optimal)
This layer was my first application using FDB, so there´s a ton of improvements to be done here, which is not tracked by Github issues, I will take the time to do so soon. I´m rather focused on implementing the directory in Rust´s binding for now, but I will soon come back to the layer to improve it
A few more Kubernetes´s related informations about my layer:
I would love to have some automated tests that are spawning a fake apiServer to test the layer Do you have an idea on how I could achieve this?
The layer is multi-tenant by design relying on the user/password, but this is not finished yet, and I´m not sure if the apiServer is implementing password-based auth. I remember only seeing cert-auth.
Thank you guys for the informative replies. @alloc@PierreZ
Yeah, I have read Kine’s code. Kine is building another mvcc layer on top of the backend db. That’s why I am asking about the monotonically increasing id. Using such an ID, we can use two table to achieve something similar to what etcd did. One table is key → ID mapping, another table holds ID → value mapping. Of course, to better support Watch and also to improve the performance, this final design may be different from this.
To make FDB a usable etcd replacement in k8s cluster, it is not a easy task. There many issues we need to resolve.
Performance needs to be better than etcd.
Horizontally Scalable. But Watch event should not be lost or reordered.
Avoid single-point failures.
Performance requires Scalable. Scalable requires distributed processing. However, distributed processing can hardly make sure the ordering and no event lost. That’s why I feel it is really hard to achieve these goals.
Kine could not resolve all the above issues. This makes it not a feasible alternative in our enterprise environment. Good thing is that FDB supports Watch. But the Watch support from FDB is also not very limited.
Disclaimer: I worked a lot around ETCD in my previous work at OVHcloud(I left two weeks ago ), especially around this issue. OVHcloud is using one ETCD for hundred of ApiServers and it was a pain for SRE team (more details here). I wrote the layer during the first french lockdown just before joining the K8S team to learn both FDB and ETCD at the same time. After joining the team, a few months later, we decided to test some ETCDshims. Our goal was to avoid stacking ETCD´s cluster like they are doing now and have something designed to scale nicely as we are adding new customers. I talked with Darren Shepherd a lot about Kine and I was hyped about it. They replaced their ETCDs in Rancher by Kine and they are really happy about it. But the tradeoff here is that they are using Amazon managed SQL products.
You are right! The monotonic ID is interesting, but only works when running an non-distributed datastore. We tried something like CockroachDB below Kine to distribute the tables, but we were experiencing too much constraints on both the SQL layer and the SEQUENCE to have a nice performances.
Well, to be fair, we had better performances than ETCD, but it did not scale enough as we were having a lot transactions restarts with the ReadWithinUncertaintyInterval error.
I also tried to forked Kine to handle revision internally in the ETCDshim instead of relying on the database to generate them but I left OVHcloud before putting some tests in it.
After {writing, playing with} several ETCDshims, I feel like my first approach was the right one, FoundationDB´s interface is insanely good to help you carefully design a data service thanks to features like the byte-ordered key-Value, transactions, Versionstamp, Tuples/Subspaces/Directories and so on
I´m not scared with 1 and 3 because of how bad ETCD behave under “high” QPS 2 is completely tied to the design of the layer.
I feel like it is quite easy to design the layer with that in my mind. I prefer to pay a bit additional cost during GetRange to gain a perfectly handled Watch.
FDB´s watches are only the top of the iceberg to implement ETCD´s watches. I was like you at the beginning of the layer “I should use the Watches directly!!” Then I realized than they are two differents beasts:
ETCD´s watch is a stream of mutations,
FDB´s watch is a notification of a key change.
The first item means that ETCD need to keep an exact history of its keyspace, like this:
so that a Watch is simply a scheduled query, retrieving any new revisions. ETCD, Kine and my layer are designed like this, with the revision as a key. This is the only way to be sure that your Watches sees every mutations.
Well not the only way, you could use something like Kafka or Pulsar as the storage layer of an ETCDshim, but this will shift the design completely as the Watches will be simple but the kv interface will give you extra work.
I also feels like FDB´s watches have an impact on production clusters, but that is just a feeling, no production experience here
Nice comment! Let me think about this and post more details later. I finally found the right place and the right person to discuss these headaches now.
Due to the incoming work from other more urgent tasks, we are going to put this task aside for a few weeks. I think this will be a long term task for us and a lot of evaluations are needed before making a decision.
Thank you for the very helpful comments and lets discuss this later after I collecting more information on this.
It may be worth a try to see how kubernetes behaves when skipping all intermediates version at watch triggers. In my limited understanding of Kubernetes, these stale state are only necessary to update all the layers of watch cache by increment (Delta FIFO) ?
The bad thing is I think on large clusters some k/v pairs can get huge and exceed etcd maxvalue so there may be some tricks to decrease amount of streamed data by using etcd intermediary values but I cannot tell.
Anyhow, it may be a better idea to fork kubernetes store and client-go to support tweaked watch semantics rather than try to reproduce exactly etcd watch behavior on top of FDB ?
In my view, the interest of overcoming etcd scalability issues is to be able to get rid of all the etcd cache layers on kubernetes side ?
On another layer using the Record-Layer, I used the setSplitLongRecords options to handle large values. With this option, the record-layer will spread the value over multiple k/v. It is not yet used on this layer.
I found easier to implement this using the record-layer than forking Kubernetes
Also, it is a side project I used to discover FDB. But this is definitely another road that can be taken, but merging it upstream will be difficult IMO (see DynamoDb Support for API Server · Issue #53162 · kubernetes/kubernetes · GitHub)
You are right, if we have an ETCD that can scale well under “high” query per second, we could remove the cache I guess. From what I saw in my previous company, high qps for ETCD is around:
~2k ranges/s
~800 txn/s
~1.6k msg/s sent through Watch
Pushing further was(is?) triggering a lot of pain for the SRE team.
These numbers can be seen as really high, or really low depending on your background/experience with distributed systems. For me it is both: it is low for a distributed systems, but it is not that bad for a single shard/region.
You can view ETCD as a single-group Raft, where the whole keyspace must be holded by all ETCD members(this is why ETCD storage limits is pretty low). The keyspace cannot be splitted into several regions/shards to spread the load. If you are overloading the Raft group, you are forced to boot another ETCD cluster, split your keys and operate two clusters. Or 3, or 4, or many, many, many clusters to handle customers growth
I admit this is not an mainstream issue , but everyone who has either a very large K8S cluster or is trying to reuse ETCD over multiple apiservers is hitting this.
I trully think that the lifecycle of shards/regions should be left to the database itself: operators should not have to manually split/move/merge key ranges. You are forced to do this if you are spawning multiple ETCD clusters, for example to spread/balance the keys and the load across your clusters.
This opinion is also backed by years of on-call duty around HBase 1.X, where region’s lifecycle is a mess: we always have to run some hbck to fix things
I have not yet operate an FDB cluster, but I have great hopes for it, in term of correctness and scalability