Duplication of key data?

I’m trying to figure out how much duplication of data is actually in the DB. I’ve been poking around the web but I haven’t stumbled across a page that answers my question.

For this example Go program:

package main

import (
	"log"

	"github.com/apple/foundationdb/bindings/go/src/fdb"
	"github.com/apple/foundationdb/bindings/go/src/fdb/directory"
	"github.com/apple/foundationdb/bindings/go/src/fdb/tuple"
	"github.com/google/uuid"
)

func main() {
	fdb.MustAPIVersion(630)
	db := fdb.MustOpenDefault()

	ds, err := directory.CreateOrOpen(db, []string{"records"}, nil)
	if err != nil {
		log.Fatalf("Unable to create/open directory: %s\n", err)
	}

	field := map[string]string{
		"one":   "alpha",
		"two":   "bravo",
		"three": "charlie",
		"four":  "delta",
		"five":  "echo",
		"six":   "foxtrot",
		"seven": "golf",
	}

	for i := 0; i < 2; i++ {
		id, err := uuid.NewRandom()
		if err != nil {
			log.Fatalf("Unable to get random ID: %s\n", err)
		}

		_, _ = db.Transact(func(tr fdb.Transaction) (interface{}, error) {
			for key, value := range field {
				tr.Set(ds.Pack([]tuple.TupleElement{id.String(), key}), []byte(value))
			}
			return nil, nil
		})
	}
}

I get the following from fdbcli:

fdb> getrange "" "\xFF"

Range limited to 25 keys
`\x15\x12\x02052f6dad-1076-4696-a246-3d7524245dd6\x00\x02five\x00' is `echo'
`\x15\x12\x02052f6dad-1076-4696-a246-3d7524245dd6\x00\x02four\x00' is `delta'
`\x15\x12\x02052f6dad-1076-4696-a246-3d7524245dd6\x00\x02one\x00' is `alpha'
`\x15\x12\x02052f6dad-1076-4696-a246-3d7524245dd6\x00\x02seven\x00' is `golf'
`\x15\x12\x02052f6dad-1076-4696-a246-3d7524245dd6\x00\x02six\x00' is `foxtrot'
`\x15\x12\x02052f6dad-1076-4696-a246-3d7524245dd6\x00\x02three\x00' is `charlie'
`\x15\x12\x02052f6dad-1076-4696-a246-3d7524245dd6\x00\x02two\x00' is `bravo'
`\x15\x12\x02c723601e-ec4c-4e43-a536-6b2e72cfb504\x00\x02five\x00' is `echo'
`\x15\x12\x02c723601e-ec4c-4e43-a536-6b2e72cfb504\x00\x02four\x00' is `delta'
`\x15\x12\x02c723601e-ec4c-4e43-a536-6b2e72cfb504\x00\x02one\x00' is `alpha'
`\x15\x12\x02c723601e-ec4c-4e43-a536-6b2e72cfb504\x00\x02seven\x00' is `golf'
`\x15\x12\x02c723601e-ec4c-4e43-a536-6b2e72cfb504\x00\x02six\x00' is `foxtrot'
`\x15\x12\x02c723601e-ec4c-4e43-a536-6b2e72cfb504\x00\x02three\x00' is `charlie'
`\x15\x12\x02c723601e-ec4c-4e43-a536-6b2e72cfb504\x00\x02two\x00' is `bravo'
`\xfe\x01\x15\x12\x00\x01layer\x00' is `'
`\xfe\x01\xfe\x00\x01hca\x00\x14\x14' is `\x01\x00\x00\x00\x00\x00\x00\x00'
`\xfe\x01\xfe\x00\x01hca\x00\x15\x01\x15\x12' is `'
`\xfe\x01\xfe\x00\x01version\x00' is `\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
`\xfe\x01\xfe\x00\x14\x02records\x00' is `\x15\x12'

I can see a lot of repeated string data here. Is this repeated in the underlying data store?

Is the question essentially “is there any kind of compression to avoid storing duplicate data multiple times?”

The answer to that question is “not currently with most configurations”, though the answer does depend on which storage engine the database is configured with. The most common storage engines (“ssd”, which is based on sqlite, and “memory”, which requires everything to fit in memory but makes data durable to a circular queue) do not have any kind of compression, and so the common data at the beginning of each key will be stored multiple times.

There are a few experimental/beta storage engines that seek to address this shortcoming. Redwood (a custom B+tree implementation) has prefix compression as one of its primary features (see more about that here: Redwood Storage Engine Update - Steve Atherton, Apple - YouTube). There’s also a “memory-radixtree-beta” storage engine built on a trie-type data structure, though I’m actually not sure if that only materializes the radix trie in memory, or if there are also on-disk savings. Finally, there’s an in-the-works RocksDB-based storage engine which would have both prefix compression and full-file compression to try and limit the impact of duplicative key data.

The other technique you can use here is to dictionary compress your key prefixes so that common strings are stored in a more compact way. The directory layer included in each language binding offers a way to map paths of strings to short prefixes, for example. There are a few different reasons one might want to do this, but shortening a long prefix into a shorter prefix for space savings reasons is one of them. In your example, note that the text “records” is not repeated in each key, but instead there’s one key \xfe\x01\xfe\x00\x14\x02records\x00 that maps to the prefix \x15\x12, and then all of the other keys begin with \x15\x12 instead of \x02records\x00. The Record Layer expands upon that, and adds some things like caching and hierarchy management, to support directory layer encoding elements of a longer path while maintaining prefix hierarchy (that is, encoding longer paths so that if one path is a prefix of another path, the on-disk encoding of the former is the prefix of the on-disk encoding of the latter) in the “key space path” abstraction, where one can use a DirectoryLayerDirectory to shorten individual path elements.

Thank you for that clear and complete answer. Kind of what I expected.