Most common issues or annoyances when using the Directory Layer

I just finished a big refactoring of the Directory Layer implementation in the .NET binding, due to the need for adding metadata caching (to reduce the latency), and also make it easier to integrate with other layers that also have their own caching of metadata.

I should preface by saying that I heavily use the Directory Layer for everything, and use a lot of directory partitions, usually nested in other partitions, and so the paths are usually long and require one or two “hops” between partitions to resolve. Without any caching, this makes transactions very slow (DL by itself would require up to a dozen serialized reads just to know where the keys are!)

All the points below are either annoyances that have bitten me one too many time (some of them since 2014!) or what I think are real issues that can compound with other bugs and cascade into outages or data corruptions.

Directory paths are untyped

Currently, most implementations of the Directory Layer take paths as vectors of strings, usually with a custom overload that take a single string.

Ex:

  • x.TryOpen("Foo") takes a string
  • x.TryOpen(new string[] { "Foo", "Bar", "Baz" }) array/vector
  • x.TryOpen("Foo", "Baz", "Baz") for languages that support the equivalent of params string[]
  • x.TryOpen(basePath.Append("Baz")) very common where layer is initialized with a base path, and will populate sub-directories as required.

But in the mind of developers, directory paths are very similar to “disk” folder paths, and they like to represent them as “/Foo/Bar/Baz”.

It is very easy for someone to call x.TryOpen("/Foo/Bar/Baz"), ending up with a single directory whose name is literaly "/Foo/Bar/Baz" with ‘/’ unrecognized.

One way to deal with it is to not provide methods that take a single string, and force the caller to split the path into arrays/tuples.

But for languages that support the “params string” notation (compiler implicitly creating the array), calling x.TryOpen("/Foo/Bar/Baz") would end up compiled as x.TryOpen(new [] { "/Foo/Bar/Baz" }), and back to square one.

I think that the Directory Layer API should have is own “Path” type that models the notion of path as a list of “segments”, and have a set of factory methods on that type:

class FdbPath { 
   Segments[]  // holds the "segments"
   FullName    // return the serialized path, like '/Foo/Bar/Baz'
   Name        // return the last segment, like 'Baz'
   [n]         // get the Nth segment (0-based)
   Parent      // return the parent path (or nothing/error if top-level)
   Append(..)  // return a new path with extra segments
   Slice(..)   // return a sub-section of the path
   ...
   static Parse("/Foo/Bar/Baz")
   static Create("Foo", "Bar", "Baz")
}

Relative vs Absolute paths

There are two main components that drive the Directory Layer:

  • the DirectoryLayer instance itself which contains the code that read/mutates the \xFE subspace.
    • This instance takes “absolute paths” and walks the tree of nodes to find their prefix and layer id.
  • the DirectorySubspace instances that are returned, on the other hand, have methods the take relative paths

Examples:

var folder = DirectoryLayer.CreateOrOpen("/Foo/Bar") // take an absolute path
Assert(folder.Path == "/Foo/Bar")
var child = folder.CreateOrOpen("Baz")  // takes a relative path
Assert(child.Path == "/Foo/Bar/Baz")

The current implementation of a directory path is a vector of strings, which does not encode the notion of “absolute” vs “relative. In regular “disk” paths, we usually add a leading ‘/’ to distinguish between the two (”/Foo/Bar" vs “Foo/Bar”)

When refactoring the Directory Layer to implement caching (and optimizing some of the internal implementation), this caused so many issues, like double concatenation that produces things like “/Foo/Bar/Foo/Bar/Baz” instead of “/Foo/Bar/Bar”.

note: these are very similar to the kind of bugs one end up when attempting to combine file paths “by hand”

I think that directory paths should be extended to have a flag “Absolute” vs “Relative”, in order to distinguish them. When represented as a single string, it would simply add the leading ‘/’ or not.

In the above Path type, we would add an “IsAbsolute” boolean flag, which would match the presence of the leading ‘/’.

All methods for combining paths would make sure that the type is preserved and that some operations are forbidden (like added an absolute path to a relative path, …)

Then, all methods of the Directory Layer API could check the type of the path: the Directory Layer would only accept absolute paths, and the instance methods of directory subspaces would only accept relative paths OR absolute paths that are children of the subspace.

Introducing this change would like reveal a lot of bugs, like code that “works” because - by coincidence - it is passing relative paths to some object whose base path is ‘/’. This code would work until an admin change the base path to ‘/Foo’.

Paths do not carry the layer id of the parents

Paths to directories are only a vector of string, but do not carry the “LayerId” of the corresponding folders.

This is a big issue when recursively creating parents directory that are located in partitions or using custom layer ids !

ex: CreateOrOpen("/TopPartition/Some/Path/To/SomethingWithALayerId/Foo", "LayerForFoo")

Here, “TopPartition” is a directory partition (layerId = “partition”) and “SomethingWithALayerId” has layer Id “MyAwesomeLayer”.

If there was some maintenance done that ended up deleting the whole TopPartition partition, but some other tool wanted to create the Foo nested sub-directory, the code will recursively attempt to create all the parents. But since the method call only provide the layer id for the leaf, it will create both TopPartition and SomethingWithALayerId as regular folders.

Then other code attempting to open these folders by specifying the correct layer id will fail (since these directories have an empty layer id). This is even more a problem for partitions because this will completely change the expectations of the administrateur (key prefix isolation!)

To fix this:

    1. Either we have to NOT recursively create missing parents, and throw if the parent is missing (caller has to ensure the path from the root is valid before creating a leaf)
  • OR, 2. we change the meaning of the directory path to be a vector of pairs (Name, LayerId) for each segment:
    • ex: { ("TopPartition", "partition"), ("Some",""), ("Path", ""), ("To", ""), ("SomethingWithALayerId", "MyAwesomeLayer"), ("Foo", "LayerForFoo") }
    • that way, whenever we want to access or create any node traversed by the path, we know its layerId and can use that to create it if it is missing.

Solution 1 is the most simple to do for the binding, but may create issues at runtime for application developpers: they may forget to check the parent, or a parent could be removed at any time.

Solution 2 seems to address the issue properly, but requires the application developper to know all the layer ids of all the parent folders.

This may not be an issue, because for example in most of my code, layers are passed a “base” folder, derived from the config, and will append their own sub-path to it, adding their own layer id when required.

Since most people want to represent a path as a string, we can’t simply join the segments with /, like /TopPartition/Some/Path/To/SomethingWithALayerId/Foo because it would loose the layer id.

We could maybe represent it like /TopPartition[partition]/Some/Path/To/SomethingWithALayerId[MyAwesomeLayer]/Foo[LayerForFoo], and add rules for escaping [ if it is part of the name.

There are cases when some code wants to traverse a path without bothering to check the layer ids, so maybe if ‘[…]’ is present it means “check that the layer id is equal to this”, but when omitted it means “don’t check”

Open Questions:

  • What is the best way to encode the layer, along side the name, in a path segment? Right know I’m using name[layer] with rules to encode any [ or ] present in either the name or layer, but this seems a bit weird. I thought of using other separators, like ':' or ';' or '!' but there are legitimate uses of these in application, for ex when using IPv6 addresses as folder name (::1).
  • It feels like the layer id is a mix between a content type (application/foo) and a file extension (xxx.jpg). I thought of using the . as a separator (name.layer) but again the dot is present in natural keys like IPv4 adresses, etc…

List / TryList do not return the full path

This is a minor annoyance that I lived with for a long time, but the List/TryList methods usually only return the name of the sub-folders, not their path. They also don’t return the layer id of the directory.

The caller can simply add back the parent path to the resulting array, but this is more code and more memory allocations.

I think that List/TryList should return the list of the absolute ‘paths’ of the children, and by extension include the layer ids of these.