FoundationDB

Is possible set a value as a reference to another subspace?


(Wolf Dan) #1

Hello,

I’ve been playing around with fbd for a few days, right now I’m trying to create a graph database just to learn more about fbd

I’ve designed the graph looks like this:

(uid (a global id), property_name, value), this for the node (edge) properties, but when I want to reference another node is where I’m having issues

So far the only two ways I’ve think of are:

  • set the value to the tuple as the global uid, so I need to get the property values and then get the subspace properties based on that uid stored
  • set a reference to that subspace and get all in one single query, this is what I’m asking for right now if it is possible

I hope you can help me, and I’m glad if can give me better ideas to implement what I’m trying to do

Greetings!


(Alec Grieser) #2

I don’t know if I’ve totally understood your question, but maybe I can try to answer it.

So, as FDB only has byte-arrays for keys and byte-arrays for values, subspaces are only fixed byte prefixes that are added to all keys as a prefix within a subspace (nothing more, nothing less). So, if you wanted to reference the other subspace, you might include that byte prefix in your tuple. (Or, like, if your subspace is constructed from packing a tuple, you might place the value that will be packed to make the subspace in the tuple directly, if that makes sense.)

But it won’t be an automatically de-referenced reference, i.e., there isn’t a single query you can give FDB that will read everything in a single operation that also reads everything that is in the referenced subspace. You will instead need to do something like read the range yourself, parse out the keys that are referencing other nodes, and then issue the appropriate reads to read those keys.


(Amirouche) #3

FWIW, I find the Subspace implementation in Python bindings rather complex. That being said, the documentation is clear and should not hinder people from using them. My understanding is that Subspace is lower level than Directory.

@WolfDan First, I implemented a graph database using multiple schema in ajgudb project in Python. Look in the history for leveldb and bsddb.

There is multiple implementation are possible on top ordered key-value store AFAIK. The initial design used two “tables” or two Directory in FDB parlance and then I added more. Eventually I moved to Entity-Attribute-Value model because the implemention is simpler (but not without caveats).

In the following, I summarize what I have experimented with.

Method 1: Multipe Directory

Like I said, the initial design used to two directories, the first Directory stores vertices (nodes) and the second Directory stores edges (links). The first table vertex looks simply like the following:

database = [
    Key(b'P4X432'), Value(b"{'field_one': 'value one', 'field_two': 'value two'}")
]

In that Directory the Key is the unique identifier of the Vertex and the Value is the utf8 encoded json serialized JSObject / mapping or dictionary (depending on your prefered language vocabulary).

The edge is similar except that instead of only storing the associated properties (or fields if you prefer) it also store the start and end vertices references (or unique identifiers).

The Value of the edge `Directory is something that looks like that:

Value(
    json.dumps(
        dict(
            start='P4X432', 
            end='P4X432', 
            properties=dict(other_field_one='other value one', other_field_two='other value two')
)).encode('utf-8'))

The above defines an edge that starts at the example vertex from further above and ends also at the same vertex forming a loop.

With that you already have a graph database. The problem with that implementation is about querying. Imagine you have a vertex, you know a vertex so well, that you know its vertex identifier. Say, P4X432 using FDB API you can easily ask for the (key, value) inside the vertex Directory that has b'P4X432' as identifier since it’s the entry with b'P4X432' as Key. That’s what FDB does very exact match lookup of Key (or range lookups but leave that for later).

Now, you have the famous vertex, you want to know what connect to it. There is only one place this information can be found and it’s in the second Directory. That said, it’s stored as a Value so you can not simply get it using FDB. One way to go, with that particular schema would be to do a full scan of the Directory and deserialize every Value and see if it has P4X432 as start. NEVER DO A FULL SCAN except for debug :slight_smile:

So, to avoid that memory and CPU heavy query, we will denormalize the data and repeat in another Directory the information that edges connects to a given “start vertex”. For doing that, mind the fact that we need to connect the “start vertex” to the edge. So the relation is reversed. The edge Directory connects an edge unique identifier to some edge value that knows about which vertex it connects. Here we want the reverse relationship: «what connects to a given vertex». Indeed the relation is inverted that why what we are going to use is called an index. So there is a third Directory that implements the relationship start vertex unique identifier → edge vertex unique idenfier. That one is very simply describe as

edges_starting_at = [ 
    (Key('P4X432'), Value('SOME-EDGE-IDENTIFIER'))
]

So now, with that Directory with a simple transaction.get(b'P4X432') we know what connect to P4X432.

I made a mistake. Since the Key must be unique, there can only be a single edge connecting to a given edge. No problem, let’s make the Key unique. We could use a random bytes value, but it leads to other problems. We could use transaction timestamp but there is simpler solution which is to stuff the edge unique identifier inside the Key. For instance:

edge_starts_at = [
    (Key(pack(('P4X432', 'SOME-EDGE-IDENTIFIER')), Value(''))
]

We can leave the value empty. That Directory can be queried using a slice or range (depending on your language vocabulary) for (key, value) that starts with the beloved vertex unique identifier P4X432.

That way there can be several entries in that Directory that starts with the vertex identifier that you are looking for. Then given the key, you unpack it retrieve the edge identifier from the tuple and hop inside the edge Directory for that (key, value) pair. you retrieve the related edge value and get back the full information about that edge that is:

Value(
    json.dumps(
        dict(
            start='P4X432', 
            end='P4X432', 
            properties=dict(other_field_one='other value one', other_field_two='other value two')
)).encode('utf-8'))

With that information, you learn that 'P4X432 is connected to P4X432 forming a beautiful single edge loop.

Method 2: A single Directory

The second method use a single directory (and second index Directory if you want to query edge incoming or outgoing from a given vertex).

It rely on EAV pattern.

The Directory schema looks like the following:

Key(pack(Entity, Attribute)) → Value(pack(Value))

That is the key part is stuffed with Entity which is a fancy way to say “unique identifier” and Attribute which is a way to say “field” or “property” and Value contains the value associated to that Attribute for the given Entity. For instance:

Key(pack(('P4X432', 'name'))) → Value('amirouche')
Key(pack(('P4X432', 'age'))) → Value(33)

Since the Key is unique in a given Directory there is a single entry that contains P4X432, "name" and 'amirouche'.

Since, FDB can quickly query entries that have the same prefix through range queries you can quickly build the following document from the above database description:

p4x432 = dict(
    name='amirouche',
    age=33
)

That Directory builds a layer in FDB parlance or abstraction if you prefer. A document store layer. On top of that layer you can build different kinds of documents one that have kind property as vertex and ones that have kind property as edge.

To be able to query for incoming and outgoings edges, you must build another directory (again repeating the same information that in the first table). And shuffle the (key, vaue) pair to look like:

Key(pack(Attribute, Value)) → Value(Entity)

That way if you do a range query over pack((‘start’, 'P4X432')) you will know what entities have P4X432 has start property. Otherwise said, what starts at P4X432 then you can get SOME-EDGE-IDENTIFIER and re-use the previous Directory to retrieve the document representing SOME-EDGE-IDENTIFIER.

Conclusion

The EAV pattern (method 2) is easier to grasp I think. I did not fully explain how to create the second Directory in the method 2. Let me know if it works for you.

If you are not confortable with Subspace and Directory just use come up with something that works tuple.pack and tuple.unpack and come back to Subspace when you have better understanding of how FDB works.


(Amirouche) #4

Ok, my answer is somewhat offtopic. @alloc answered the question AFAIU.


(Wolf Dan) #5

Sorry for the late response guys

Yeah sorry for being so confusing, I’m not a native english speaker ^^’ and yeah is exactly the answer I was looking for thank you so much

@amirouche I was exactly taking a look to your project before I created this post ^^ amazing work, I decided to go with the single directory method is easier to implement imo, but I changed it a bit, right now I’m using 4 subdirectories

  • The property directory where I do the Key(pack(Entity, Attribute)) → Value(pack(Value)) pattern
  • An index where I can get the nodes by the property fixed values
  • An in and out nodes subdirectories, where I reference the other nones keys

I’m still looking to implement another things like text search, pagination and queries like in WHERE x.value > n and things like that, this is the place where I’m having more problems to implement since I dont find any way to do it ^^’

Thank you guys for your help


(Amirouche) #6

It’s somewhat off topic, but I finally understood the interest of Directories I think but I can’t explain it.


(Amirouche) #7

That’s what is called EAV table

That’s AVE index.

That’s new to me! I am wondering why you do that? Is that some form of static typing?


(Wolf Dan) #8

Thank you for the term names, I’m not good at technically speaking ^^’

It was just a test but it didn’t worked as I wanted, here is the code I’m working on if you’re interested ^^ https://github.com/OkamiIO/Nomure/tree/python (Expect a bit messy code, I’m not too used to code in python)