Online Indexing on large databases

VibhutiD · October 15, 2020, 6:03am

Hi,
Background: I have a cluster running on a triple redundancy mode, ssd, 16TB (Sum of K-V sizes), 65TB Disk space. It has >~1.4B records.
Post migration of the records and corresponding indexes, I want to build another index on this record store. What would be the most performant way of building an index on such a store?

Example for reference of the metadata:

    message foo { 
        int32 foo1;        // Primary Key
        int32 foo2;        // Already existing index
        int32 foo3;        // index to be built
    }

alloc · October 16, 2020, 5:32pm

Good question. This is somewhat of a hole in our documentation, so I’ve submitted an issue about making the docs in this area better: https://github.com/FoundationDB/fdb-record-layer/issues/1045

There are a few things you might want to do to more performantly build this index.

The OnlineIndexer is deliberately rate limited to avoid overwhelming “foreground” traffic. You can make a single indexer faster by playing with some of the parameters into it, such as the “limit”, which controls the number of records it will index in a single transaction and the “records per second”, which controls its target rate of indexing. (Note that in newer versions of the Record Layer, there are also controls over how many bytes to allow in a single transaction. This is set very high, by default, and is mainly included as a protection against large transactions (which degrade FDB performance), so it probably shouldn’t be adjusted, but having it in place does make it safer to set a higher limit.) Note that if you are too aggressive with the values there, you can hurt the cluster’s performance, so it might be something you need to tune. Note that if you want to do something like start an indexer, see if helps, and then increase the limit, it is perfectly safe to terminate an indexer and restart it, and you won’t lose any progress if you do so.
You can try building different parts of the index in parallel. In particular, the OnlineIndexer has a method called splitIndexBuildRange, which looks at the distribution of keys in FDB to produce (roughly) equal ranges of records to index. The steps here are a bit manual, as the feature is still experimental, but if you create an index builder and call splitIndexBuildRange and then create additional indexers for each range, you can run those index builds in parallel.

Another tip: you probably want to call setProgressLogIntervalMillis on the indexer when you create it, which will cause it to write a log message at the specified interval with its progress. That way, you can monitor how the index build is going. If you call setTrackProgress, then the number of records indexed also gets written to the database so you can then see how many records have been indexed in total by looking at the index state.

Docs on the OnlineIndexer.Builder here for the complete list of options: https://javadoc.io/static/org.foundationdb/fdb-record-layer-core-pb3/2.10.141.0/com/apple/foundationdb/record/provider/foundationdb/OnlineIndexer.Builder.html
And on splitBuildIndexBuildRange: https://javadoc.io/static/org.foundationdb/fdb-record-layer-core-pb3/2.10.141.0/com/apple/foundationdb/record/provider/foundationdb/OnlineIndexer.html#splitIndexBuildRange(int,int)

Topic		Replies	Views
Best way to add an index on already-existing data? Using FoundationDB	2	3499	April 21, 2018
Scaling Record Layer for Bulk Writes Using Layers	6	933	July 24, 2020
Lucene-Layer on FoundationDB Development	6	3489	December 11, 2020
Record Layer query performance benchmarking against traditional RDBMS Record Layer	3	1513	April 22, 2020
Bulk load in Record Layer is slow Record Layer	3	1283	December 4, 2019

Online Indexing on large databases

Related topics