FWIW, I find the Subspace implementation in Python bindings rather complex. That being said, the documentation is clear and should not hinder people from using them. My understanding is that Subspace
is lower level than Directory
.
@WolfDan First, I implemented a graph database using multiple schema in ajgudb project in Python. Look in the history for leveldb and bsddb.
There is multiple implementation are possible on top ordered key-value store AFAIK. The initial design used two “tables” or two Directory
in FDB parlance and then I added more. Eventually I moved to Entity-Attribute-Value model because the implemention is simpler (but not without caveats).
In the following, I summarize what I have experimented with.
Method 1: Multipe Directory
Like I said, the initial design used to two directories, the first Directory
stores vertices (nodes) and the second Directory
stores edges (links). The first table vertex looks simply like the following:
database = [
Key(b'P4X432'), Value(b"{'field_one': 'value one', 'field_two': 'value two'}")
]
In that Directory
the Key
is the unique identifier of the Vertex
and the Value
is the utf8 encoded json serialized JSObject
/ mapping or dictionary (depending on your prefered language vocabulary).
The edge is similar except that instead of only storing the associated properties (or fields if you prefer) it also store the start
and end
vertices references (or unique identifiers).
The Value
of the edge `Directory is something that looks like that:
Value(
json.dumps(
dict(
start='P4X432',
end='P4X432',
properties=dict(other_field_one='other value one', other_field_two='other value two')
)).encode('utf-8'))
The above defines an edge that starts at the example vertex from further above and ends also at the same vertex forming a loop.
With that you already have a graph database. The problem with that implementation is about querying. Imagine you have a vertex, you know a vertex so well, that you know its vertex identifier. Say, P4X432
using FDB API you can easily ask for the (key, value)
inside the vertex Directory
that has b'P4X432'
as identifier since it’s the entry with b'P4X432'
as Key
. That’s what FDB does very exact match lookup of Key (or range lookups but leave that for later).
Now, you have the famous vertex, you want to know what connect to it. There is only one place this information can be found and it’s in the second Directory
. That said, it’s stored as a Value
so you can not simply get it using FDB. One way to go, with that particular schema would be to do a full scan of the Directory and deserialize every Value and see if it has P4X432
as start
. NEVER DO A FULL SCAN except for debug
So, to avoid that memory and CPU heavy query, we will denormalize the data and repeat in another Directory
the information that edges connects to a given “start vertex”. For doing that, mind the fact that we need to connect the “start vertex” to the edge. So the relation is reversed. The edge Directory
connects an edge unique identifier to some edge value that knows about which vertex it connects. Here we want the reverse relationship: «what connects to a given vertex». Indeed the relation is inverted that why what we are going to use is called an index. So there is a third Directory
that implements the relationship start vertex unique identifier → edge vertex unique idenfier
. That one is very simply describe as
edges_starting_at = [
(Key('P4X432'), Value('SOME-EDGE-IDENTIFIER'))
]
So now, with that Directory
with a simple transaction.get(b'P4X432')
we know what connect to P4X432
.
I made a mistake. Since the Key
must be unique, there can only be a single edge connecting to a given edge. No problem, let’s make the Key
unique. We could use a random bytes value, but it leads to other problems. We could use transaction timestamp but there is simpler solution which is to stuff the edge unique identifier inside the Key. For instance:
edge_starts_at = [
(Key(pack(('P4X432', 'SOME-EDGE-IDENTIFIER')), Value(''))
]
We can leave the value empty. That Directory can be queried using a slice or range (depending on your language vocabulary) for (key, value)
that starts with the beloved vertex unique identifier P4X432
.
That way there can be several entries in that Directory
that starts with the vertex identifier that you are looking for. Then given the key, you unpack it retrieve the edge identifier from the tuple and hop inside the edge Directory
for that (key, value)
pair. you retrieve the related edge value
and get back the full information about that edge that is:
Value(
json.dumps(
dict(
start='P4X432',
end='P4X432',
properties=dict(other_field_one='other value one', other_field_two='other value two')
)).encode('utf-8'))
With that information, you learn that 'P4X432
is connected to P4X432
forming a beautiful single edge loop.
Method 2: A single Directory
The second method use a single directory (and second index Directory
if you want to query edge incoming or outgoing from a given vertex).
It rely on EAV pattern.
The Directory schema looks like the following:
Key(pack(Entity, Attribute)) → Value(pack(Value))
That is the key part is stuffed with Entity
which is a fancy way to say “unique identifier” and Attribute
which is a way to say “field” or “property” and Value contains the value associated to that Attribute
for the given Entity
. For instance:
Key(pack(('P4X432', 'name'))) → Value('amirouche')
Key(pack(('P4X432', 'age'))) → Value(33)
Since the Key is unique in a given Directory
there is a single entry that contains P4X432
, "name"
and 'amirouche'
.
Since, FDB can quickly query entries that have the same prefix through range queries you can quickly build the following document from the above database description:
p4x432 = dict(
name='amirouche',
age=33
)
That Directory builds a layer in FDB parlance or abstraction if you prefer. A document store layer. On top of that layer you can build different kinds of documents one that have kind
property as vertex
and ones that have kind
property as edge
.
To be able to query for incoming and outgoings edges, you must build another directory (again repeating the same information that in the first table). And shuffle the (key, vaue)
pair to look like:
Key(pack(Attribute, Value)) → Value(Entity)
That way if you do a range query over pack((
‘start’, 'P4X432'))
you will know what entities have P4X432
has start
property. Otherwise said, what starts at P4X432
then you can get SOME-EDGE-IDENTIFIER
and re-use the previous Directory to retrieve the document representing SOME-EDGE-IDENTIFIER
.
Conclusion
The EAV pattern (method 2) is easier to grasp I think. I did not fully explain how to create the second Directory in the method 2. Let me know if it works for you.
If you are not confortable with Subspace
and Directory
just use come up with something that works tuple.pack
and tuple.unpack
and come back to Subspace
when you have better understanding of how FDB works.