How to reduce data size?

Apache · April 2, 2023, 7:50pm

Say, I have these two tables:

Course{uuid, name, description}
Student{uuid, name, address}

I am new to FDB, so as per my understanding, the design of these two tables in FDB will be as follows

For Course table:

tuple(uuid, "name")
tuple(uuid, "description")

For Student table:

tuple(uuid, "name")
tuple(uuid, "address")

Please correct me if I am wrong in my understanding.

If my understanding is correct, then my question is, what is the way to avoid repeating the value of the uuid in each key of the record? As the uuid is 16 bytes, prepending it to each field of the record increases the size. For few records this may not be a problem, but if there are thousands of record entries, then this adds up very quickly to the overall size of the DB storage. What is a way to reduce the size of the record entry and hence the DB storage space in this case?

markus.pilman · April 3, 2023, 7:46pm

FDB doesn’t support tables, it just a key-value store (the tuple layer is shipped with the client, but there’s no obligation to use it).

So if you want to store relational data in FDB you have to come up with a scheme how you want to translate this to keys and values – and there’s many ways of doing this. A simple one would be to use an existing layer (like the record layer). Or you can come up with something yourself.

if you only want to use the tuple layer, you would probably use the uuid as your key and a tuple (name, description) as the value. However, this means you can only query for the uuid. So you’d also need to set up an index. So you might also have a set of kv-pairs for name → uuid. If the name isn’t unique, you’d do something like (name, uuid,) → "" and then you can use a range query to find all records of a given name.

You also might want to use the directory layer to efficiently store these things in different subranges so you don’t accidentally interpret a uuid as a name and vice-versa.

Now for size reduction: you could use an int64 and use FDB to increment it atomically. Or you could use a versionstamp. However, it is probably not going to me a huge difference. Keep in mind that if you do the same in a relational database it will usually also repeat some key for each index.

Apache · April 3, 2023, 9:32pm

FDB doesn’t support tables, it just a key-value store (the tuple layer is shipped with the client, but there’s no obligation to use it).

Yes, I am aware of that. I used the tables illustration to depict my mental model of the data.

So if you want to store relational data in FDB you have to come up with a scheme how you want to translate this to keys and values – and there’s many ways of doing this.

The Tuple structure I mentioned in my original post is my understanding, after reading the documentation and Tuple implementation on GitHub, of how to translate the mental model of a relational table into key-value pairs. So I will elaborate on my original example below with the representation of a table entry into key-value pairs. Taking the example of a Student table:

Student{id:uuid, name:string, address:string}

An entry in FDB for this table could be as follows(for illustration purpose I am using descriptive names for the API rather than the actual names below):

S = directory("student")
S.add_key_value(tuple{uuid, "name"}, name_value)
S.add_key_value(tuple{uuid, "address"}, address_value)

As per my understanding, this will create two keys directory_id+uuid+"name" and directory_id+uuid+"address" with the values name_value and address_value respectively. And I think, now to create an index to find what courses the student is taking, additional key-value pairs could be added per course taken:

S.add_key_value(tuple{uuid, "course", course_uuid}, "")

This will create the key directory_id+uuid+"course"+course_uuid with empty value. Then to find all the courses taken by a student with id uuid, all the keys with prefix directory_id+uuid+"course" should be fetched.

Similarly, the table Course could be translated into key-value pair with the index entries for students that have taken that particular course.

This is my understanding of how a mental model of a relational table can be translated to key-value pairs in FDB. Using this understanding, my question was whether there is a way to compress the size of the keys(and hence the overall storage space).

markus.pilman · April 3, 2023, 11:40pm

I think you’re making this a bit more complicated than it needs to be. You can use the tuple value also for values. So your first example could simply be:

S = directory("student")
S.add_key_value(uuid, tuple(name_value, address_value))

Now this means that the value won’t have a schema – so when you read a record you need to deserialize the values in the same order as you serialized them and you need to know the first one is a name and the second is an address.

If you need or want a schema you could use something like protobuf or flatbuffers to serialize the value (or json – though json won’t be very compact).

Generally you don’t want to put more information into your key than you need. With my above scheme, if you wanted some kind of m:n relationship between students and courses, you would do something like this:

C = directory("courses")
C.add_key_value(uuid, tuple(course_title_value, course_description_value))
# only necessary if you need to answer questions like "who is taking this course"
CS = directory("courses_students")
CS.add_key_value(course_uuid, student_uuid)
# only necessary if you need to answer questions like "what courses is this student taking"
SC = directory("students_courses")
SC.add_key_value(student_uuid, course_uuid)

Basically the rules would be:

For a 1:1 relationship you include the value in the tuple (student name ↔ student address)
For a N:1 relationship you include the uuid of the foreign object into the object (for example if you had like major and minor for each student you could put uuids into the student tuple and have a separate directory you would reference)
For M:N relationships you need at least one additional “table” to store the relationship

In a relational database (like some SQL database) the DBMS would do many of these things for you. In FDB you have to explicitly create all indexes etc yourself (unless you use a layer that does some of that work for you).

For any kind of serious application I would highly recommend to either use an existing layer or write something yourself. Just using the tuple layer will probably not scale well. You will need to think about schema evolution etc and with the tuple layer alone you will be quite limited.

Apache · April 4, 2023, 8:15am

Thank you for the detailed explanation!

So now the challenge will be, how to query for the records(which you alluded to in your first reply); and also how to discern what change caused the watch to fire.

markus.pilman · April 4, 2023, 6:34pm

Yes, well you have to do everything yourself. So yes in order to query your data you need to build indexes (which is why a uuid might or might not be the best key value – if the thing you usually use for querying is unique it can just be the index). And you have the choice whether you want to have indirect indices or whether you want to include some information in the keys etc.

data lineage is the same as almost everything else: if you want this feature you need to build it yourself

Apache · April 5, 2023, 9:09am

data lineage is the same as almost everything else: if you want this feature you need to build it yourself

Back to drawing board for me then

Topic		Replies	Views
How to efficiently query data from FDB? Using FoundationDB	3	717	April 19, 2023
Record size limit Using FoundationDB	3	278	February 19, 2023
Considerations for key and value sizes Using FoundationDB	2	1989	November 28, 2018
Duplication of key data? Using FoundationDB	2	441	April 19, 2023
Impact of encoding on range read performance Using FoundationDB performance	5	1075	May 10, 2019

How to reduce data size?

Related topics