We are planning to deploy FDB clusters in Kubernetes using a StatefulSet. We ran some tests with success, but we’re posting this here to double check that what we’re doing makes sense, before moving to production.
Our setup supports N nodes an K coordinators with the configuration:
- PersistentVolumes
- PersistentVolume-0, PersistentVolume-1 … PersistentVolume-N
- these are all manually created, and assigned a static awsElasticBlockStore.volumeID which points to an existing EBS volumes
- StatefulSet
- storage-0, storage-1 … storage-N
- out of N pods, we consider first K to be the coordinators
- Headless Service
- named “storage”
- selecting all pods of the stateful set
Generating the cluster file (fdb.cluster) and configuration file (foundationdb.conf)
We run our own process named “storage” side by side with fdbserver and fdbmonitor in each POD.
This process is responsible for generating both fdb.cluster and foundationdb.conf files.
The process will take as a parameter K - the number of coordinators, and will start to resolve the IPs for each coordinator, by resolving the POD dns names in the Kube cluster: storage-0.storage, storage-1.storage, … storage-K.storage.
We run the resolving in a loop, and when we find out that the IPs changed, we generate a new fdb.cluster file, and we change the foundationdb.conf to point to it. fdbmonitor automatically restarts the fdbserver processes.
NOTE that we never run the “coordinators” command, and we never change the cluster description or id.
This has worked successfully in a test environment but we want to make sure we won’t end up with a bad configuration where we lose coordinator sync.
Sample Kubernetes yaml files:
- pvs.yml, stateful-set.yml, service.yml - Github Link
Open Questions:
-
Since we never use the “coordinators” command, could we end up in a locked state when adding/removing coordinators? Pods are restarted serially, so technically there will be inconsistent fdb.cluster files co-existing, one for the old coordinator set, and one for the new coordinator set.
-
We want to be able to survive a full cluster outage, since we keep all pods on a single node. So far, deleting and re-creating the StatefulSet was successful, but I wonder if this will become a problem. in the future.