FoundationDB

Allowing client APIs to use an "in-memory" fdb.cluster file?


(Austin Seipp) #1

There’s been some talk on recent threads about the management of fdb.cluster files, but here’s a question I have: would it be possible for the client APIs to support a mode where, rather than reading the initial cluster information from a file, it could simply be passed as a string, and any “rewrites” of that file just happen in-memory? As well as the ability to retrieve that information?

Here’s a more concrete use case with a bit more specifics (though you don’t need to focus on them too much): due to the fact that the FoundationDB client library (for all realistic purposes) is single threaded (in just about every language binding), a multi-processing approach with stateless processes is often necessary for clients to scale out reads/writes when deploying a service on a single machine. In my theoretical scenario, I have a server built on FDB, that offers something like an HTTP API. Upon startup, it forks itself into multiple processes to scale out (in this specific case, I’m using SO_REUSEPORT on Linux 3.x to round-robin load balance incoming TCP connections across a set of processes all listening on the same port – SO_REUSEPORT ensures the balancing is fair)

Now, each process needs to connect to the cluster, but in order to do so it must do it through an fdb.cluster file. This seems like a waste, because:

  • I now need to write out a file for each client somewhere (having multiple processes synchronize on one fdb.cluster file may be safe, but gives me the spooks, still.) So I need something to copy it, or maybe copy it before fork()ing in my own code, or something else…

  • You often already have completely fixed connection information anyway. In a lot of orchestration tools, you’re going to do something like allocate fixed IPs to your FoundationDB coordinators, ensuring that the coordinator IPs never change, even if the underlying machines do. The durability of the fdb.cluster file seems much less useful in a scenario like this, because A) I will always know the exact coordination/DB info, and B) an ops person is (hopefully) going to step in if something like a coordinator explodes. Actual permanent changes to coordinator topology seem like relatively controlled events. In the case of something like a rack exploding and taking down a coordinator, I’d presumably just migrate the elastic IP address to a new server and move on, not actually rewrite it on-the-spot.

So really in a case like this, having a durable cluster file at the client level is just kind of annoying. Ideally, my machines running client applications would be nearly 100% stateless, and just do something like pull the cluster connection information out of e.g. EC2 instance metadata. This can be done directly in my application, and I can do something like (in pseudo-Java):

void go() {
  FDB fdb = FDB.selectAPIVersion(...);
  String clusterInfo = grabEc2Metadata("fdb.cluster.info"); // HTTP request to EC2 instance metadata IP
  // clusterInfo = "xxxxxxxx:yyyyyyyy@192.168.xxx.yyy,...

  try (Database db = fdb.connect(clusterInfo)) {
    // network thread started, connects, opens database
  }
}

In this case, if the cluster configuration changes (say my ops person deploys another coordinator), then, assuming the above client is actively connected in a healthy manner, it will just rewrite its in-memory copy of the information. In the event I started a new process – well, no fixes needed (assuming my ops person also updated the EC2 metadata!)


In fact, for Linux, you can actually kind of get around this pretty easily: you use open("/tmp", O_TMPFILE | O_EXCL | O_RDWR | O_CLOEXEC) in order to get a file descriptor (say fd = 20) that doesn’t exist in the physical file system and closes and is unlinked upon exit, except, while the process is running, there is a “logical” path under /proc/self/fd/20 file, which can only be read from that process and its children (hence O_CLOEXEC, to avoid leaking it into children as well). So you could wrap all this up in a 20 line function that looks something like this (pseudo-ish):

void connect(const char* cluster_info) {
  // private file, read and write, cannot be linked into filesystem (O_EXCL), and
  // is _not_ shared by children (O_CLOEXEC) -- they may outlive the parent,
  // leaking an fd, and/or open their own private files
  int fd = open("/tmp", O_TMPFILE | O_RDWR | O_EXCL | O_CLOEXEC);

  write(fd, cluster_info, strlen(cluster_info)); // assume write(2) succeeds

  char fd_path[PATH_MAX+1] = {0,};
  snprintf(fd_path, PATH_MAX-1, "/proc/self/fd/%d", fd);
  fdb_database_t database = fdb_open(fd_path); // open, using the hidden file

  // go on with life. the private fdb.cluster file will magically
  // disappear when the process exits.
  return database;
}

But I think something like this might be generally useful given that you can often make coordinator info very “static”. Also, you ideally might want to get the cluster configuration back, but I’m not sure if there’s a simple API call for this; you could of course query the \xff\xff\status\json key or whatever it is, I suppose…

Thoughts? I might be willing to write a patch for this if it doesn’t seem too hard.


(Alec Grieser) #2

I believe there is an issue to that affect here, with no one assigned: https://github.com/apple/foundationdb/issues/567

There’s also some discussion about that in this forum post: Proposal: Don’t identify a coordinator based on its IP

I think something like that sounds reasonable enough for the reasons you’ve outlined above. I noticed that in your example, you are proposing connecting to some kind of config-vending solution rather than hard-coding the connection information into memory. I’m of two minds about this. The first is that that is probably necessary in many deploys where you might need to respond to a changing cluster file (when, for example, the coordinators change) but you also find that each time your instance is brought up, it might be on a random host from tabula rasa. But the second is that it also means that now your service is dependent on your config-vending solution being highly available. which seems suboptimal. Perhaps there is a middle ground where you have a default cluster configuration string that you default to if you cannot communicate with your config-vendor after some timeout?

But this is somewhat of a digression.

I think we’d probably not want to use temporary, in-memory files in the solution if possible (though maybe it’d be fine). In particular, I think the concern would be that it might not be easily portable to other systems. I will add that internally, most things already use what is known as the “cluster config string” (i.e., the cluster file contents) rather than the cluster file path (or file handle) as the way they connect to server processes. So I suspect that it probably wouldn’t be that hard to refactor the parts that do use the file to instead let you use a string and then most of the code probably doesn’t change that much.


(Austin Seipp) #3

I think something like that sounds reasonable enough for the reasons you’ve outlined above. I noticed that in your example, you are proposing connecting to some kind of config-vending solution rather than hard-coding the connection information into memory. I’m of two minds about this. The first is that that is probably necessary in many deploys where you might need to respond to a changing cluster file (when, for example, the coordinators change) but you also find that each time your instance is brought up, it might be on a random host from tabula rasa . But the second is that it also means that now your service is dependent on your config-vending solution being highly available. which seems suboptimal. Perhaps there is a middle ground where you have a default cluster configuration string that you default to if you cannot communicate with your config-vendor after some timeout?

Right, there’s definitely the aspect that making your system require the interaction between complex components (such as EC2 metadata) dramatically increases your surface area and failure rate. Ideally I’d just get away with FoundationDB and nothing else, but in practice I still need to rely on things like S3 for backups, hosting APIs for querying information about my nodes (not even directly, but through tools like Terraform, etc), and all manner of stuff.

In a sense, given how rigorously FoundationDB is tested compared to most other software – databases or not – (which, in all my experience, seems to be quite evident, given how it’s impossible to get it “stuck”), I sort of think of it like: I’m already relying on provisioning tools, my cloud provider, etc, to be stable. Sometimes they aren’t. If the EC2 metadata service isn’t working, for example, a lot of my services are going to fall over probably (such as adjacent policy services, any kind of node bringup, etc). I always already going to risk that possibility, by using EC2 – so it’s not like I’m taking on substantially more risk than I would, otherwise. FoundationDB doesn’t magically add “negative complexity”, but it doesn’t add substantially more failure modes than I’d have to deal with otherwise, I feel, had I chose another datastore.

For example, as you suggest, a hard-coded cluster configuration put somewhere in the binary, on the filesystem, etc, as a backup in case the metadata service is down is probably something worth implementing! But I was always going to have to implement that, probably, regardless of which underlying data store I chose. (I just don’t have to worry about FoundationDB becoming inconsistent like other offerings…)

(Here’s another example, to maybe elaborate more. One major aspect of FoundationDB’s design – which I completely agree with – is the lack of any sort of sophisticated policy authorization at the database level. This is a good design choice, but it also means policy and authorization decisions often will get shoved into a layer above the database, where they’re handled best. But in my experience policy authorization almost always requires some sort of external integration, lest you replicate policy methods in a lot of places, and get it wrong to the detriment of your security and users. As a concrete example, I want to use Vault next to my FoundationDB cluster to control things like access to keys, which are managed by server administrators, not developers, and Vault can enforce access policies. So my layer will use a short-term, revokable access token, pass that to Vault, and get back a key, do its work, and throw the key away. Doing this in my own layer, or even writing a Vault clone, seems like a recipe for disaster unless you are extremely cautious. In a sense, even though Vault it’s not as well oiled as FoundationDB, and increases my surface area – not using it is likely to increase my failure rate even more and cause disastrous errors, unless I’m extremely cautious. It seems like many sorts of interactions like this are largely unavoidable.)

Anyway, I agree this is all a bit adjacent to the topic, but it is fun to think about a bit. :slight_smile:

I think we’d probably not want to use temporary, in-memory files in the solution if possible (though maybe it’d be fine). In particular, I think the concern would be that it might not be easily portable to other systems. I will add that internally, most things already use what is known as the “cluster config string” (i.e., the cluster file contents ) rather than the cluster file path (or file handle) as the way they connect to server processes. So I suspect that it probably wouldn’t be that hard to refactor the parts that do use the file to instead let you use a string and then most of the code probably doesn’t change that much.

Right, the Linux O_TMPFILE example was just to show doing it now is possible as a sort of gimmicky-workaround, not a suggestion that any such feature should actually use that trick, as you note.

I figured, but hadn’t checked, that the cluster file read/write paths were probably pretty limited and most of the codebase used the internal string instead as you mention, so the real code changes wouldn’t be very hard. Good to hear it directly from the horses mouth!


So then the real question is one of bikeshedding: what should the API look like? In particular it has to be added to every language binding in order to be useful at all.

I think connect() is an okay name but it’s not particularly unique or evocative (vs open for, example). I’m open to suggestions here.

Also, ideally, it would be nice to retrieve the cluster information from a connected database. As I mentioned, if you have a handle to the DB, you can basically do this through some sifting of the \xff\xff\status\json key, but it would be nice if there was something like a getConnectedInfo on a database handle that returned this information directly to you in a nice way. So I’m guessing this piece would just be a matter of returning the cluster string. Or maybe it should return some kind of object with properties on it, which can easily be turned into the cluster string representation…


Anyway, I think I’ll take a stab at writing a patch for this if it seems useful. Minor API reworks/bikeshedding would probably be most of the work, given the patch seems like it should be relatively easy to author, otherwise, without any substantial internal changes.