Bulkload Performance Testing

Northwo · April 18, 2025, 3:42am

Really excited to see this bulkload feature released!, I’m currently testing on a cluster with 4 SSD nodes and 8 processes（1stateless + 1transaction + 6storage）.

fdb> status

Using cluster file `/etc/foundationdb/fdb.cluster'.

Configuration:
  Redundancy mode        - single
  Storage engine         - ssd-rocksdb-v1
  Log engine             - ssd-2
  Encryption at-rest     - disabled
  Coordinators           - 1
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 8
  Zones                  - 1
  Machines               - 1
  Memory availability    - 30.9 GB per process on machine with least available
  Fault Tolerance        - 0 machines
  Server time            - 04/18/25 11:40:31

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 24.493 GB
  Disk space used        - 8.888 GB

Operating space:
  Storage server         - 502.9 GB free on most full server
  Log server             - 505.9 GB free on most full server

Workload:
  Read rate              - 15 Hz
  Write rate             - 0 Hz
  Transactions started   - 6 Hz
  Transactions committed - 0 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Client time: 04/18/25 11:40:31

I bulkdump existing files from FDB to local storage (approximately 15GB)。

then reloaded the data back into FDB using bulkload. The bulkload process took about 11 minutes.

Job fe0bcec23eec9d38955ebf36c07791ed submitted at 1744946248.635041 for range { begin=  end=\xff }. The job has 858 tasks. The job ran for 11.368590 mins and exited with status Complete.

I’m wondering if this speed meets expectations? Also, are there any performance benchmarks available for this feature? Any other config suggestions for this feature? Thanks!

jzhou · April 20, 2025, 7:09pm

The speed is probably limited by your test cluster size. @kakaiu did some tests with 100+ storage servers, which can load 1TB in 20~30 minutes.

kakaiu · April 20, 2025, 8:01pm

Thanks for your interest in BulkLoad/Dump. How many SSes are you using? What is the knob DD_BULKDUMP_PARALLELISM and DD_BULKLOAD_PARALLELISM and MANIFEST_COUNT_MAX_PER_BULKLOAD_TASK are you set in the execution?

Currently, you can tune the three knobs to maximize the bulkload/dump throughput given your cluster setting.

DD_BULKDUMP_PARALLELISM is the maximum number of parallel bulkdump tasks at any time. DD_BULKLOAD_PARALLELISM is the maximum number of parallel bulkload tasks at any time. You may want to increase the two knobs to fully leverage your cluster’s parallelism (aka. the number of SSes). Currently, the load balancing policy of bulk loading is not optimal, so I recommend you to set the parallelism knob to be larger than the number of SSes.

In bulkload mechanism, DD dispatches tasks to SSes in batching way. MANIFEST_COUNT_MAX_PER_BULKLOAD_TASK is the batch size. If you see DD is very busy but the SSes are not busy, we may consider to increase this knob value.

In general, to maximize the bulkload throughput, we want to increase the parallelism as high as possible so that all SSes are leveraged to load the data at any time. We also want DD not very busy so that this centralized role is not the bottleneck of the mechanism.

In my test, the number of SS count was 100 in a single DC with SQLite. I increased two parallelism knob to 1000 and the MANIFEST_COUNT_MAX_PER_BULKLOAD_TASK is 10.

Northwo · April 22, 2025, 9:28am

kakaiu:

Thanks for your interest in BulkLoad/Dump. How many SSes are you using? What is the knob DD_BULKDUMP_PARALLELISM and DD_BULKLOAD_PARALLELISM and MANIFEST_COUNT_MAX_PER_BULKLOAD_TASK are you set in the execution?

Currently, you can tune the three knobs to maximize the bulkload/dump throughput given your cluster setting.

DD_BULKDUMP_PARALLELISM is the maximum number of parallel bulkdump tasks at any time. DD_BULKLOAD_PARALLELISM is the maximum number of parallel bulkload tasks at any time. You may want to increase the two knobs to fully leverage your cluster’s parallelism (aka. the number of SSes). Currently, the load balancing policy of bulk loading is not optimal, so I recommend you to set the parallelism knob to be larger than the number of SSes.

In bulkload mechanism, DD dispatches tasks to SSes in batching way. MANIFEST_COUNT_MAX_PER_BULKLOAD_TASK is the batch size. If you see DD is very busy but the SSes are not busy, we may consider to increase this knob value.

In general, to maximize the bulkload throughput, we want to increase the parallelism as high as possible so that all SSes are leveraged to load the data at any time. We also want DD not very busy so that this centralized role is not the bottleneck of the mechanism.

In my test, the number of SS count was 100 in a single DC with SQLite. I increased two parallelism knob to 1000 and the MANIFEST_COUNT_MAX_PER_BULKLOAD_TASK is 10.

Thank you for your guidance！ Following your suggestions, I expanded my cluster size to 42 storage nodes (SS) and added three new configuration parameters ：

knob_dd_bulkload_parallelism=1000
knob_dd_bulkdump_parallelism=1000
knob_manifest_count_max_per_bulkload_task=20

The bulkload time for the original 15GB dataset has been reduced to 5 minutes . I then regenerated a 150GB dataset, which completed bulkloading in 10 minutes . This suggests that larger datasets may better utilize disk I/O performance.

Northwo · April 22, 2025, 9:47am

Also, I’m considering converting raw data files (like CSV or TXT) directly into a bulkload-compatible format, bypassing FDB’s bulkdump step. This would allow me to directly use the bulkload for faster data initialization. Would this approach be viable? I’d appreciate any advice you might have. Thank you!"

kakaiu · April 23, 2025, 5:36am

Yes! We can use RocksDB SST file writer to create SST files. The key is to build a directory on S3 with the organization compatible with the bulkload mechanism.

In general, you can check definitions in foundationdb/fdbclient/include/fdbclient /BulkLoading.h to get the overview of the organization of the dataset folder.

The dataset folder is an UID::toString(). In the folder, there is a single global “job-manifest.txt” file. This file records file path to the data set file for each subrange. Note that the subrange should cover the entire key space. I will explain a bit for the empty sub-range later. For the file format, please check comments in BulkLoadJobManifestFileHeader::toString() and BulkLoadJobFileManifestEntry::toString().

The folder can contain multiple sub-folders. Each sub-folder name is an UID::toString(). Inside, the children folders are named as index, i.e. 0, 1, 2, 3,… For each child folder, there is a std::to_string(Version)-manifest.txt. For the manifest content format, please check BulkLoadManifest::toString(). In addition to the manifest file, there is a data file. The name of the data file is std::to_string(Version)-data.sst. You can fake a version as the name file, but please note that the value of Version of the same child folder should be same. The byte sample file is optional.
Note that if the sub-range is empty, you still need the manifest file but the data file omits. In the manifest file, the keyCount and the byte should be 0 for the empty subrange.

When you load this data set, you can simply input the faked global UID as the bulkload input.

Please let me know if you have more questions. Thanks!

kakaiu · April 23, 2025, 5:39am

If you are interested in this, please consider to submit a PR, I can look into this with you. Thanks!

Northwo · April 24, 2025, 12:46pm

Thanks for your advice! From what I understand:

If the raw data is unsorted, I need to sort all of it first (since SST files can’t have overlapping key ranges).
Then split the sorted data into chunks and use bulk loading to create SST/Manifest files.

My biggest hurdle right now is efficiently performing global sorting on extremely large files.
Also, I appreciate your trust! I’ll share my progress as I work on this. Once the code is stable, I’ll submit a PR.Thanks!

kakaiu · April 24, 2025, 4:48pm

At top of head, for the huge files, you can do parallel merge sort.

Note that when creating SST files, you may want to make the bytes of each SST file relatively small. The bulkload creates FDB shards to load data. So, the resulting shard size is equal to the single SST bytes multiplies knob_manifest_count_max_per_bulkload_task. In my test, each SST file size is around 15MB. Since my knob_manifest_count_max_per_bulkload_task is 10, after bulkload, each shard size is about 150MB which does not trigger any extra shard-boundary-change data movement after the bulkload, given my default setting of FDB target shard size. A nice SST file size gives us flexibility to tune knob_manifest_count_max_per_bulkload_task and avoid those extra data movements.

Thanks!

Northwo · May 15, 2025, 3:18pm

Hi kakaiu, with your help, I have successfully generated the SST files and the manifest file, as well as the job-manifest.txt file.
fileStructure like this:

[root@fdb 0c7f799c408a4481bb31f418061f78ba]# tree
.
├── 0
│   ├── 0-data.sst
│   └── 0-manifest.txt
├── 1
│   ├── 1-data.sst
│   └── 1-manifest.txt
├── 2
│   ├── 2-data.sst
│   └── 2-manifest.txt
├── 3
│   ├── 3-data.sst
│   └── 3-manifest.txt
├── 4
│   ├── 4-data.sst
│   └── 4-manifest.txt
├── 5
│   ├── 5-data.sst
│   └── 5-manifest.txt
└── job-manifest.txt

6 directories, 13 files

And the job-manifest.txt :

[FormatVersion]: 1, [ManifestCount]: 6
[BeginKey]: 00 05 01 00 00 00 03 00, [EndKey]: 00 05 01 00 00 00 03 00 04 00 00 00 33 39 39 39 00 00 00 00 03 00 00 00, [ManifestRelativePath]: 0c7f799c408a4481bb31f418061f78ba/0/0-manifest.txt, [Version]: 0, [Bytes]: 94770
[BeginKey]: 00 05 01 00 00 00 03 00 04 00 00 00 33 39 39 39 00 00 00 00 03 00 00 00, [EndKey]: 00 05 01 00 00 00 03 00 0e 00 00 00 38 33 38 34 38 33 30 39 33 31 33 34 32 33 00 00 00 00 08 00 00 00, [ManifestRelativePath]: 0c7f799c408a4481bb31f418061f78ba/1/1-manifest.txt, [Version]: 1, [Bytes]: 58821993
[BeginKey]: 00 05 01 00 00 00 03 00 0e 00 00 00 38 33 38 34 38 33 30 39 33 31 33 34 32 33 00 00 00 00 08 00 00 00, [EndKey]: 00 05 01 00 00 00 03 00 0e 00 00 00 38 36 30 34 37 33 31 35 39 36 37 36 35 38 00 00 00 00 08 00 00 00, [ManifestRelativePath]: 0c7f799c408a4481bb31f418061f78ba/2/2-manifest.txt, [Version]: 2, [Bytes]: 58084049
[BeginKey]: 00 05 01 00 00 00 03 00 0e 00 00 00 38 36 30 34 37 33 31 35 39 36 37 36 35 38 00 00 00 00 08 00 00 00, [EndKey]: 00 05 01 00 00 00 03 00 0e 00 00 00 38 37 31 34 36 38 35 39 31 37 36 35 30 35 00 00 00 00 08 00 00 00, [ManifestRelativePath]: 0c7f799c408a4481bb31f418061f78ba/3/3-manifest.txt, [Version]: 3, [Bytes]: 56563968
[BeginKey]: 00 05 01 00 00 00 03 00 0e 00 00 00 38 37 31 34 36 38 35 39 31 37 36 35 30 35 00 00 00 00 08 00 00 00, [EndKey]: 00 05 01 00 00 00 03 00 0e 00 00 00 38 38 37 39 36 31 32 36 39 32 33 31 36 36 00 00 00 00 08 00 00 00, [ManifestRelativePath]: 0c7f799c408a4481bb31f418061f78ba/4/4-manifest.txt, [Version]: 4, [Bytes]: 51414663
[BeginKey]: 00 05 01 00 00 00 03 00 0e 00 00 00 38 38 37 39 36 31 32 36 39 32 33 31 36 36 00 00 00 00 08 00 00 00, [EndKey]: 00 05 01 00 00 00 03 00 ff, [ManifestRelativePath]: 0c7f799c408a4481bb31f418061f78ba/5/5-manifest.txt, [Version]: 5, [Bytes]: 103676

One of manifest file:

[FormatVersion]: 1, [RootPath]: /home/Ruby/bulkload/, [RelativePath]: 0c7f799c408a4481bb31f418061f78ba/0, [ManifestFileName]: 0-manifest.txt, [DataFileName]: 0-data.sst, [ByteSampleFileName]: , [ChecksumValue]: , [ChecksumMethod]: , [BeginKey]: 00 05 01 00 00 00 03 00, [EndKey]: 00 05 01 00 00 00 03 00 04 00 00 00 33 39 39 39 00 00 00 00 03 00 00 00, [Version]: 0, [Bytes]: 94770, [KeyCount]: 999, [ByteSampleVersion]: 0, [ByteSampleMethod]: , [ByteSampleFactor]: 250, [ByteSampleOverhead]: 100, [ByteSampleMinimalProbability]: 0.000000, [loadType]: 1, [TransportMethod]: 1

I then build a new Fdb cluster and tried to perform a bulkload

After executing the following command:

bulkload load 0c7f799c408a4481bb31f418061f78ba \x00\x05\x01\x00\x00\x00\x03\x00 \x00\x05\x01\x00\x00\x00\x03\x00\xff /home/Ruby/bulkload/

I got some error and log:

<Event Severity="40" ErrorKind="BugDetected" Time="1747321177.148717" DateTime="2025-05-15T14:59:37Z" Type="InternalError" ID="0000000000000000" Error="internal_error" ErrorDescription="An internal error occurred" ErrorCode="4100" FailedAssertion="rangeResult.size() &gt;= 2 &amp;&amp; rangeResult.size() &lt;= 3" File="/home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbserver/DataDistribution.actor.cpp" Line="1447" ThreadID="3005844978757111542" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x5a9ccaf 0x5a9cfa9 0x5a97104 0x59c392c 0x29e7bc8 0x29e784c 0x234d8b8 0x234d742 0x4a4ec58 0x4a4ea52 0x4d60234 0x4d4e342 0x4d5f188 0x4d5efe2 0x4d5e518 0x4d5e3aa 0x4d519d1 0x4d53998 0x4d5385c 0x277c168 0x277b700 0x1eda468 0x1eda17a 0x586dc44 0x586d53c 0x5a21088 0x3551b97 0x7f60211eb1b7" Machine="10.155.106.44:4500" LogGroup="default" Roles="CC,CD,CP,CS,DD,GP,MS,RK,RV" />
<Event Severity="40" ErrorKind="Unset" Time="1747321177.148717" DateTime="2025-05-15T14:59:37Z" Type="SystemError" ID="0000000000000000" Error="internal_error" ErrorDescription="An internal error occurred" ErrorCode="4100" ThreadID="3005844978757111542" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x5a9ccaf 0x5a9cfa9 0x5a97104 0x59c3d70 0x59c393f 0x29e7bc8 0x29e784c 0x234d8b8 0x234d742 0x4a4ec58 0x4a4ea52 0x4d60234 0x4d4e342 0x4d5f188 0x4d5efe2 0x4d5e518 0x4d5e3aa 0x4d519d1 0x4d53998 0x4d5385c 0x277c168 0x277b700 0x1eda468 0x1eda17a 0x586dc44 0x586d53c 0x5a21088 0x3551b97 0x7f60211eb1b7" Machine="10.155.106.44:4500" LogGroup="default" Roles="CC,CD,CP,CS,DD,GP,MS,RK,RV" />

The corresponding code is here.

https://github.com/apple/foundationdb/blob/717bdb794452bed25ca1cf3ce29ec3c06089137b/fdbserver/DataDistribution.actor.cpp#L1469
It seems that the issue might be caused by multiple jobs being executed at the same time. However, I’m sure that I only submitted one task.

May I ask what could be the possible reason for this?

Thanks!!!

Northwo · May 19, 2025, 7:00am

Hello, kakaiu
I suspect that there might be some issues with the manifest file I generated myself. Therefore, I re-ran the bulk dump for the key range: beginKey=\x00\x05\x01\x00\x00\x00\x03\x00 to endKey=\x00\x05\x01\x00\x00\x00\x03\x00\xff. However, when trying to bulk load this data, I encountered the same error again. I was wondering if there might be an issue on this end, or if FoundationDB currently only supports bulk loading data from “” to \xff?

Thanks a lot!

kakaiu · May 19, 2025, 4:18pm

Yes. Currently, FDB only supports to load key range of “” ~ \xff. Can you try to load to “” ~ \xff. For the empty range, you can create a manifest file with 0 bytes and 0 keys and omitting the data.sst file. Please let me know if you observe any error then. In the future version, we will support loading to any range and multiple jobs.

Thanks!

Northwo · May 26, 2025, 2:34am

Thanks, the other thing I found important is that the endKey of the previous entry is the beginKey of the next entry, and the whole thing is continuous. Currently I have bulkloaded successfully

Topic		Replies	Views
Best practices for bulk load Using FoundationDB	7	4390	May 25, 2018
Bulk insert 2 billion records Using FoundationDB	1	1299	March 25, 2019
Fdbrestore Performance Running FoundationDB performance	6	387	December 14, 2023
Storage queue limiting performance when initially loading data Using FoundationDB	10	2719	October 14, 2019
New feature bulkdump/bulkload using Using FoundationDB	4	92	April 17, 2025

Bulkload Performance Testing

Related topics