Duplication of key data?

mAdkins23 · April 18, 2023, 3:47pm

I’m trying to figure out how much duplication of data is actually in the DB. I’ve been poking around the web but I haven’t stumbled across a page that answers my question.

For this example Go program:

package main

import (
	"log"

	"github.com/apple/foundationdb/bindings/go/src/fdb"
	"github.com/apple/foundationdb/bindings/go/src/fdb/directory"
	"github.com/apple/foundationdb/bindings/go/src/fdb/tuple"
	"github.com/google/uuid"
)

func main() {
	fdb.MustAPIVersion(630)
	db := fdb.MustOpenDefault()

	ds, err := directory.CreateOrOpen(db, []string{"records"}, nil)
	if err != nil {
		log.Fatalf("Unable to create/open directory: %s\n", err)
	}

	field := map[string]string{
		"one":   "alpha",
		"two":   "bravo",
		"three": "charlie",
		"four":  "delta",
		"five":  "echo",
		"six":   "foxtrot",
		"seven": "golf",
	}

	for i := 0; i < 2; i++ {
		id, err := uuid.NewRandom()
		if err != nil {
			log.Fatalf("Unable to get random ID: %s\n", err)
		}

		_, _ = db.Transact(func(tr fdb.Transaction) (interface{}, error) {
			for key, value := range field {
				tr.Set(ds.Pack([]tuple.TupleElement{id.String(), key}), []byte(value))
			}
			return nil, nil
		})
	}
}

I get the following from fdbcli:

fdb> getrange "" "\xFF"

Range limited to 25 keys
`\x15\x12\x02052f6dad-1076-4696-a246-3d7524245dd6\x00\x02five\x00' is `echo'
`\x15\x12\x02052f6dad-1076-4696-a246-3d7524245dd6\x00\x02four\x00' is `delta'
`\x15\x12\x02052f6dad-1076-4696-a246-3d7524245dd6\x00\x02one\x00' is `alpha'
`\x15\x12\x02052f6dad-1076-4696-a246-3d7524245dd6\x00\x02seven\x00' is `golf'
`\x15\x12\x02052f6dad-1076-4696-a246-3d7524245dd6\x00\x02six\x00' is `foxtrot'
`\x15\x12\x02052f6dad-1076-4696-a246-3d7524245dd6\x00\x02three\x00' is `charlie'
`\x15\x12\x02052f6dad-1076-4696-a246-3d7524245dd6\x00\x02two\x00' is `bravo'
`\x15\x12\x02c723601e-ec4c-4e43-a536-6b2e72cfb504\x00\x02five\x00' is `echo'
`\x15\x12\x02c723601e-ec4c-4e43-a536-6b2e72cfb504\x00\x02four\x00' is `delta'
`\x15\x12\x02c723601e-ec4c-4e43-a536-6b2e72cfb504\x00\x02one\x00' is `alpha'
`\x15\x12\x02c723601e-ec4c-4e43-a536-6b2e72cfb504\x00\x02seven\x00' is `golf'
`\x15\x12\x02c723601e-ec4c-4e43-a536-6b2e72cfb504\x00\x02six\x00' is `foxtrot'
`\x15\x12\x02c723601e-ec4c-4e43-a536-6b2e72cfb504\x00\x02three\x00' is `charlie'
`\x15\x12\x02c723601e-ec4c-4e43-a536-6b2e72cfb504\x00\x02two\x00' is `bravo'
`\xfe\x01\x15\x12\x00\x01layer\x00' is `'
`\xfe\x01\xfe\x00\x01hca\x00\x14\x14' is `\x01\x00\x00\x00\x00\x00\x00\x00'
`\xfe\x01\xfe\x00\x01hca\x00\x15\x01\x15\x12' is `'
`\xfe\x01\xfe\x00\x01version\x00' is `\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
`\xfe\x01\xfe\x00\x14\x02records\x00' is `\x15\x12'

I can see a lot of repeated string data here. Is this repeated in the underlying data store?

alloc · April 19, 2023, 8:43am

Is the question essentially “is there any kind of compression to avoid storing duplicate data multiple times?”

The answer to that question is “not currently with most configurations”, though the answer does depend on which storage engine the database is configured with. The most common storage engines (“ssd”, which is based on sqlite, and “memory”, which requires everything to fit in memory but makes data durable to a circular queue) do not have any kind of compression, and so the common data at the beginning of each key will be stored multiple times.

There are a few experimental/beta storage engines that seek to address this shortcoming. Redwood (a custom B+tree implementation) has prefix compression as one of its primary features (see more about that here: Redwood Storage Engine Update - Steve Atherton, Apple - YouTube). There’s also a “memory-radixtree-beta” storage engine built on a trie-type data structure, though I’m actually not sure if that only materializes the radix trie in memory, or if there are also on-disk savings. Finally, there’s an in-the-works RocksDB-based storage engine which would have both prefix compression and full-file compression to try and limit the impact of duplicative key data.

The other technique you can use here is to dictionary compress your key prefixes so that common strings are stored in a more compact way. The directory layer included in each language binding offers a way to map paths of strings to short prefixes, for example. There are a few different reasons one might want to do this, but shortening a long prefix into a shorter prefix for space savings reasons is one of them. In your example, note that the text “records” is not repeated in each key, but instead there’s one key \xfe\x01\xfe\x00\x14\x02records\x00 that maps to the prefix \x15\x12, and then all of the other keys begin with \x15\x12 instead of \x02records\x00. The Record Layer expands upon that, and adds some things like caching and hierarchy management, to support directory layer encoding elements of a longer path while maintaining prefix hierarchy (that is, encoding longer paths so that if one path is a prefix of another path, the on-disk encoding of the former is the prefix of the on-disk encoding of the latter) in the “key space path” abstraction, where one can use a DirectoryLayerDirectory to shorten individual path elements.

mAdkins23 · April 19, 2023, 2:47pm

Thank you for that clear and complete answer. Kind of what I expected.

Topic		Replies	Views
Directory removal leaves stuff in database Using FoundationDB	1	390	April 23, 2023
How to tell a single key exists or not using golang binding Using FoundationDB	2	2024	May 9, 2018
Request for feedback: Full export, Go binding usage, Future plans Using FoundationDB	2	442	March 29, 2021
Using the same keys and values across different languages Using FoundationDB	6	514	January 24, 2021
Looking for feedback on a Go FoundationDB blob store FoundationDB Layers	2	557	March 29, 2023

Duplication of key data?

Related topics