There’s been some talk on recent threads about the management of fdb.cluster
files, but here’s a question I have: would it be possible for the client APIs to support a mode where, rather than reading the initial cluster information from a file, it could simply be passed as a string, and any “rewrites” of that file just happen in-memory? As well as the ability to retrieve that information?
Here’s a more concrete use case with a bit more specifics (though you don’t need to focus on them too much): due to the fact that the FoundationDB client library (for all realistic purposes) is single threaded (in just about every language binding), a multi-processing approach with stateless processes is often necessary for clients to scale out reads/writes when deploying a service on a single machine. In my theoretical scenario, I have a server built on FDB, that offers something like an HTTP API. Upon startup, it forks itself into multiple processes to scale out (in this specific case, I’m using SO_REUSEPORT
on Linux 3.x to round-robin load balance incoming TCP connections across a set of processes all listening on the same port – SO_REUSEPORT
ensures the balancing is fair)
Now, each process needs to connect to the cluster, but in order to do so it must do it through an fdb.cluster
file. This seems like a waste, because:
-
I now need to write out a file for each client somewhere (having multiple processes synchronize on one
fdb.cluster
file may be safe, but gives me the spooks, still.) So I need something to copy it, or maybe copy it beforefork()
ing in my own code, or something else… -
You often already have completely fixed connection information anyway. In a lot of orchestration tools, you’re going to do something like allocate fixed IPs to your FoundationDB coordinators, ensuring that the coordinator IPs never change, even if the underlying machines do. The durability of the
fdb.cluster
file seems much less useful in a scenario like this, because A) I will always know the exact coordination/DB info, and B) an ops person is (hopefully) going to step in if something like a coordinator explodes. Actual permanent changes to coordinator topology seem like relatively controlled events. In the case of something like a rack exploding and taking down a coordinator, I’d presumably just migrate the elastic IP address to a new server and move on, not actually rewrite it on-the-spot.
So really in a case like this, having a durable cluster file at the client level is just kind of annoying. Ideally, my machines running client applications would be nearly 100% stateless, and just do something like pull the cluster connection information out of e.g. EC2 instance metadata. This can be done directly in my application, and I can do something like (in pseudo-Java):
void go() {
FDB fdb = FDB.selectAPIVersion(...);
String clusterInfo = grabEc2Metadata("fdb.cluster.info"); // HTTP request to EC2 instance metadata IP
// clusterInfo = "xxxxxxxx:yyyyyyyy@192.168.xxx.yyy,...
try (Database db = fdb.connect(clusterInfo)) {
// network thread started, connects, opens database
}
}
In this case, if the cluster configuration changes (say my ops person deploys another coordinator), then, assuming the above client is actively connected in a healthy manner, it will just rewrite its in-memory copy of the information. In the event I started a new process – well, no fixes needed (assuming my ops person also updated the EC2 metadata!)
In fact, for Linux, you can actually kind of get around this pretty easily: you use open("/tmp", O_TMPFILE | O_EXCL | O_RDWR | O_CLOEXEC)
in order to get a file descriptor (say fd = 20
) that doesn’t exist in the physical file system and closes and is unlinked upon exit, except, while the process is running, there is a “logical” path under /proc/self/fd/20
file, which can only be read from that process and its children (hence O_CLOEXEC
, to avoid leaking it into children as well). So you could wrap all this up in a 20 line function that looks something like this (pseudo-ish):
void connect(const char* cluster_info) {
// private file, read and write, cannot be linked into filesystem (O_EXCL), and
// is _not_ shared by children (O_CLOEXEC) -- they may outlive the parent,
// leaking an fd, and/or open their own private files
int fd = open("/tmp", O_TMPFILE | O_RDWR | O_EXCL | O_CLOEXEC);
write(fd, cluster_info, strlen(cluster_info)); // assume write(2) succeeds
char fd_path[PATH_MAX+1] = {0,};
snprintf(fd_path, PATH_MAX-1, "/proc/self/fd/%d", fd);
fdb_database_t database = fdb_open(fd_path); // open, using the hidden file
// go on with life. the private fdb.cluster file will magically
// disappear when the process exits.
return database;
}
But I think something like this might be generally useful given that you can often make coordinator info very “static”. Also, you ideally might want to get the cluster configuration back, but I’m not sure if there’s a simple API call for this; you could of course query the \xff\xff\status\json
key or whatever it is, I suppose…
Thoughts? I might be willing to write a patch for this if it doesn’t seem too hard.