What's the purpose of the Directory layer?

I try to understand what brings the Directory layer.

My understanding is that the Directory layer implements a hierarchy of Directory somewhat like a filesystem. Except instead of files their Key-Value pairs.
There is also the fact that another Layer that uses the Directory Layer can introspect its structure via DirectoryLayer.list.
Last but not least, it seems to me the implementation is optimized for online/on-the-fly/dynamic Directory creation ie. it avoids contention with version stamps to achieve high throughput while creating directories inside a given directory, is that correct?

It seems Subspace and Directory Layer are mostly here to avoid big mistakes, what are those mistakes?

The documentation is clear about the fact that client libraries should in most cases use the Directory Layer. The question is: in which case is it a bad idea to use the Directory Layer?

2 Likes

As suggested, the directory layer does provide something that resembles a file-system’s hierarchical directory structure on top of FoundationDB. I think, though, that instead of files being key-value pairs, files are more like “subspaces”. In other words, each logical directory path (e.g., “my-application/my-collection/my-table” in directory notation or (“my-application”, “my-collection”, “my-table”) in tuple notation) is translated into a physical byte string (e.g., \x16\x04\x03). The directory layer guarantees that no other directory layer path maps to that same physical string, so if you wanted to have multiple logical “tables” or “collections” (or whatever makes sense in your data model), you can assign each one a directory path and get a subspace in which you can put all of the data for each such table or collection (or whatever).

But, in theory, you could just use the subspace layer directly and get the same level of data isolation. For example, you could just create a subspace whose prefix tuple is the same as the directory layer path you were giving to the directory layer, and everything would still work. The differences are:

  • The directory layer generates much shorter prefixes. In particular, it will assign each path a tuple-encoded integer, which (because the tuple layer uses a variable length encoding) means that if you have fewer than 65,536 directories, you will only need 3 bytes per prefix at most. The subspace layer, by contrast, if given strings will produce a prefix which is (slightly) larger than the string itself.
  • The directory layer supports renaming prefixes without moving any of the underlying data. In particular, directory layer “moves” manipulate the logical-to-physical mapping data structures, but not the data themselves.
  • The directory layer can be queried. So, for example, because it’s possible to know if someone else has already used the same path, you can check. (The list method you mentioned can be useful for this, and also useful for recursively searching down different directory paths.)
  • But, the directory layer requires an extra database read. In particular, before you can get the prefix, you have to ask the database what it is. You can cache this value…but then you can’t (safely) make use of the move or remove features of the directory layer unless you have a good solution to the cache invalidation problem.
  • There is no relationship between the physical locations of different subdirectories’ subspaces. This can be both good or bad. On the good side, it means that (1) your prefixes can be shorter and (2) you can move subdirectories around without moving data. On the bad side, it means that you can’t do something like issue a single range delete to move all of the data in a directory and its subdirectories. You also can’t do a single range scan and get all of the data if you are interested in copying the data from one cluster to another.

As to whether the directory layer is optimized for dynamic directory creation, I’d say it is. It doesn’t use versionstamps for high throughput, but it uses an internal class called the “high contention allocator” to generate short prefixes in a high contention way (thought it’s not zero contention). Another member of the community actually wrote up a description of how the high-contention allocator works: https://www.activesphere.com/blog/2018/08/05/high-contention-allocator

That being said, you can also use the directory layer if you have only a few directories or if your directories are fairly static. Because of the extra DB reads, you have to be a little careful, especially because if you query the directory layer too frequently (because there is typically only one per cluster, though you can create additional ones), you can create “hot keys” in the directory layer’s subspace (by default, the \xfe subspace). So, if you are primarily using the directory layer as a way of producing short prefixes from long paths (which is a reasonable enough use case), my suggestion would be to heavily cache the directory layer results (and just accept that you will never be able to “remove” or “move” the path, which is fine for many use cases).


I guess the mistake that those two layers are trying to help you avoid is doing something like creating a data model where you end up using the entire cluster, then you need to put some more meta-data somewhere else, and then you have no-where to put it, because you’ve used the entire keyspace for the first part of your application. For example, you could imagine a simple data model where keys are some primary key and values are some say named tuple (serialized using your favorite named-tuple serializer):

(key1) -> {a: 10, b: "val"}
(key2) -> {a: 66: c: "var"}

But then you decide you want an index on the “a” field of every named-tuple, something like:

10 -> (key1)
66 -> (key2)

But where does it go? If integers are valid keys, then it’s possible you will have keys that intersect with your index…and that’s no good. But if you used subspaces, you might do something like:

("primary", key1) -> {a: 10, b: "val"}
("primary", key2) -> {a: 66: c: "var"}
("secondary", "a_index", 10) -> (key1)
("secondary", "a_index", 66) -> (key2)

I guess I’d also note that even if it’s not necessarily a “mistake” people make, the subspace layer (assisted by the directory layer) is required to implement multi-tenancy in any sane way. If you have multiple users/applications sharing the same FoundationDB cluster, subspaces and directories are the easiest way to achieve data isolation (with each user getting their own directory/subspace).


When is it a bad idea to use the directory layer?

Well, as mentioned above, if you only have a few, relatively static directories, it’s usually a bad idea to naïvely use the directory layer because you might run into hot keys in an instance where the results can be easily cached. (You also save yourself a round trip or two.) I suppose you can also run into hot keys (or hot ranges) in general with the directory layer. For that reason, you might need to maintain multiple directory layers and load balance between them if you can’t cache.

Also, even though the directory layer is designed to handle multiple incoming requests fairly well, if you create directories too frequently, you can run into issues stemming from the fact that (1) as you create more directories, the prefix must necessarily get larger (at a logarithmic rate) and (2) as you create more directories, you must store meta-data about the fact that the directory exists. So, for example, the “worst case” might be something like every time you store a new key, you create a new directory just for it. Then because of book-keeping information, you are more than doubling the number of keys and number of key bytes used to store this single key. Yuck.


I guess I’d propose this rule of thumb: the directory layer should be used to create keyspaces to distinguish different uses of the FoundationDB cluster, while the subspace layer should be used directly to separate concerns within one single usage.

I’m not sure that that rule makes any sense as I type it out, but let’s take the example of a simple data model from primary key to named tuple. In this instance, I’d probably say that keeping track of the “primary key” and index information is a use within a single usage, so one should probably use the subspace layer directly (probably choosing short names for the primary keyspace and secondary indexes subspaces–perhaps the integers 1 and 2–maybe reserving 0 for meta-data?). But let’s say you wanted to run two copies of this simple system on the same cluster. Then maybe you’d give each collection its own directory prefix. Something like:

dir: "/use1", 1, key1 -> { a: 10, b: "val"}
dir: "/use1", 1, key2 -> { a: 66: c: "var"}
dir: "/use1", 2, "a", 10 -> key1
dir: "/use1", 2, "a", 66 -> key2

dir: "/use2", 1, key3 -> { a: 15 }
dir: "/use2", 1, key4 -> { a: 14, d: false }
dir: "/use2", 2, "a", 14 -> key4
dir: "/use2", 2, "a", 15 -> key3

(Another thing you could so is for each usage, you keep track of its own directory layer that has a mapping of, say, index or table names to short prefixes that you use within that copy of your system, which avoids one use case spilling into another and, say, filling up its directory layer. This also helps balance out the load rather than sending everything to the one-and-only directory layer and creating hot keys.)


Hopefully this is at all cogent. If not, feel free to ask follow up questions.

7 Likes

Does this translates in terms of Python code into creating Directory instance?

I just want to stress that Subspace alone already allows that.

A “single usage” like say, in a mongodb-like collection, there is the 1) actual documents and 2) the description of indices (metadata) 3) the actual indices. All those three subspaces don’t need to be directories to me.

I think that’s the most two important points.

It seems to me, I can go on with Subspace class and Enum classes which will map application / collection / subs to short prefixes. It avoids the hot keys problem and avoid the need to come up with a cache machinery. I will have:

  • I will have as short as possible prefixes
  • I will be able to rename subspaces by changing the enum
  • I will somewhat be able to query with some heuristics, the structure via enum introspection

That said it’s a more ‘static’ approach, I can not dynamically in the application create subspaces since they are statically ‘allocated’ in the enums.

Yeah, creating the directory instance (i.e., calling create, open, or create_or_open). Listing the directory layer also queries it.

Correct.

Yeah, that is loosely what I meant by a single “usage”, though there’s probably a better term for it. Though different collections might have their own directory (or different “database” where “database” is used to mean “a set of collections”.)

Your approach here looks solid enough to me. Depending on the application, you might be able to still create new subspaces on the fly (for certain definitions of “on the fly”). I’mg going to use your example of a document database with documents, indices, and meta-data about which indices exist. Lets say you wanted to add a new index. In the index meta-data, if you include the (short) subspace prefix, then you can effectively add to your enum on the fly. Now, um, you have to somehow solve the problem of how do you make sure all of your clients get word that the new index exists (but that shouldn’t be too hard, right?), but once you do, you get the subspace prefix virtually for free.

But yeah, a static approach would sound reasonable if you have some notion of what the subspaces might look like a priori, which you probably do.

1 Like

@bbc @alloc @amirouche I am new to FDB. I am coming from the RDBMS and Cassandra world of tables. Help me with a “rule of thumb”. My use case is simple. A user table with id, name. A tweet table with id, userId, content. In FDB, should user and tweet be a DirectorySubspace using create_or_open or should it be Subspace. Also how about id, name, content etc.

My approach for MY project is to avoid Directory layer but that is a performance optimization. Otherwise said, Directory might work for you. Also, it is a good idea to read the code, at least of the subspace layer to get a feeling of how things work in FDB.

@alloc @ajbeamon - Could you please confirm if my understanding of subspaces mentioned below is correct ?

When we have multiple levels of subspace with the same prefix “datasources” (see eg below), do they all share the same keyspace and as a result there is no tenant level isolation ? i.e there is no difference in having a single string as subspace vs a Subspace Tuple (with same prefix string), all keys with nested subspaces within the parent subspaces are going to pound into the same start and end keyrange mapped for the prefix subspace and can lead to Hot-Spotting a single storage server. Is this understanding correct ?

Subspace Tuple: (datasource, tenant1) => Key: k1, Value: v1
Subspace Tuple: (datasource, tenant2) => Key: k2, Value: v2
Subspace Tuple: (datasource, tenant3) => Key: k3, Value: v3

It is a bit confusing because the keys are going to have tenant information in them. So it makes me think each tenant is going to get a different keyspace and writes will happen in parallel.

For eg: tenant1’s keyrange starts with key: \x02datasource\x00\x02tenant1\x00 to \x02datasource\x00\x02tenant1\xff.

tenant2’s keyrange is between \x02datasource\x00\x02tenant2\x00 to \x02datasource\x00\x02tenant2\xff.

When I think more about this, since FDB sorts keys lexicographically, the keys from the different tenants (in my above data model with same prefix subspace) will end up near each other and hence within the same giant keyspace between \x02datasource\x00 to \x02datasource\xff. So it doesn’t provide tenant isolation, though all keys belonging to one tenant will be grouped together.

That’s correct, keys in a subspace will all share the subspace’s prefix, even if you nest multiple subspaces underneath it. The same is not true for directories, where a subdirectory’s data is disjoint from its parent.

1 Like

@ajbeamon - Follow up question - Does pre-splitting a single subspace into multiple keyranges (by randomizing the key prefix within the subspace) help prevent writes into this subspace from hotspotting a single storage server? (similar to pre-splitting a table in hbase to avoid hotspotting of a region server?).