Data set size of the fdb server process save to disk

Cliicy · February 27, 2020, 4:27am

hi,
good day! my question is about how multiple fdb server process decide write how many data to disk.
let’s say, I plan to write 202G data (about 300,000,000 records, 512 bytes per record) to disk, the data directory is /opt/data/fdb-6.2/, when using 1 fdb server process, the data size located on /opt/data/fdb-6.2 is about 202GB, that’s correct. please see the info below:
202 /opt/data/fdb-6.2/data/4500

then, when I increased the fdb server process to 8, using the same records about 300,000,000 records, 512 bytes per record, just like what I did when using only one fdb server process, but this time , the final total data set is about 404GB, please refer to the info below:
49 /opt/data/fdb-6.2/data/4501
50 /opt/data/fdb-6.2/data/4503
50 /opt/data/fdb-6.2/data/4505
51 /opt/data/fdb-6.2/data/4502
52 /opt/data/fdb-6.2/data/4507
51 /opt/data/fdb-6.2/data/4506
52 /opt/data/fdb-6.2/data/4504
52 /opt/data/fdb-6.2/data/4500
404 /opt/data/fdb-6.2/data
404 /opt/data/fdb-6.2

could you let me know how multiple fdb server processes decide to handle how many records to save to disk?

Thanks,
Luo

alexmiller · February 27, 2020, 4:42am

Did you mean to clear the database between runs? It seems rather suspicious that 404GB = 202GB * 2.

I could see an argument that data distribution competing with the incoming writes combined with a skewed workload could have left the raw files sizes on disk unbalanced, as we only slowly do the work required to give space back to the OS, but being a perfect 2x multiplier seems like too much of a coincidence.

Cliicy · February 27, 2020, 7:00am

no, in my test, we don’t clean up the database between runs.

For my pressure test, our test data size should not exceed the maximum of disk capacity, I want to estimate how many processes will be used basing on the fixed number of total records, like for example: I plan to load about 202 GB data, when I use 1 fdb server process, after loading, the total db size is 202.
but when I use 8 fdb server processes, after loading , the total db size is 404, it seems is 202*2, but actually it does not always exactly go like this 2x multiplier,

here is my another test,
I plan to load abut 286GB data (using ycsb , 300,000,000 records, 1k for every record size), on the following results, 762 is more 20GB than 742 ( 742 equals = 371 multiplier 2)

using 1 fdb server processes, after loading, the total db size is 286 GB
371 /opt/data/4500
371 /opt/data
using 10 fdb server processes, ,after loading, the total db size is 762GB,
76 /opt/data/4506
76 /opt/data/4502
77 /opt/data/4509
76 /opt/data/4500
76 /opt/data/4507
76 /opt/data/4503
74 /opt/data/4501
80 /opt/data/4504
77 /opt/data/4508
78 /opt/data/4505
762 /opt/data

so do you have any clue what the rule is about how every fdb server will be assigned how many records when inserting those total 300,000,000 records (every record is 1kb )? every fdb server has the same opportunity to do inserting? I found that the data size is nearly the same (74GB-80GB) for the different storage path (like /opt/data/4500 /opt/data/4501 …/opt/data/4509 ) , the total db size is about 76.6*10 =766 (76.6 is the average value and 10 is the number of total fdb server processes), so how every fdb server knows that it just needs to load about 74GB–80GB?

thanks,

SteavedHams · February 27, 2020, 7:07am

Size on disk is not the same as bytes used, there can be free space in those files, particularly after deletes due to data movement. Look at FDB’s status to find out how many bytes on disk are used.

Cliicy · February 27, 2020, 8:40am

Hi, thanks for your explanation,
here is the status of my FDB:

Data:
Replication health - Healthy (Repartitioning.)
Moving data - 241.930 GB
Sum of key-value sizes - 688.661 GB
Disk space used - 1.869 TB

so, “Sum of key-value sizes” means the total size of all the records actually occupied on the disk?
what does " Moving data" mean? I am not in the cluster environment, just single workstation.
I am using 12 fdb server processes, one is for log and another is for stateless, the other are storage. fdb server version is 5.2.5.

Thanks,

alexmiller · February 27, 2020, 9:45am

When you’re inserting data, you’re growing the amount of data in one shard, and data distribution will split the shard into two in order to maintain a maximum shard size. “Moving data” is the amount of data that FDB currently thinks it will need to move between servers in order to balance the cluster. This redistribution of data can cause some imbalance in the total disk space used between different processes, as FDB is (intentionally) relatively slow in giving space back to the OS.

“Sum of Key-Value Sizes” is the logical amount of key-value bytes that are stored. This should exactly match the calculation of what you’ve written to the database. What you’re looking at is the actual file size on disk, which includes overhead, and slack space within the file itself.

Cliicy · February 27, 2020, 9:55am

I got what you mean.
Thanks very much for all of you!

have a nice day!

Topic		Replies	Views
FoundationDB 7.1.24 - the memory usage after clean startup of fdbserver process is too high Using FoundationDB	10	821	April 22, 2024
Production optimizations Using FoundationDB	20	6425	August 15, 2018
Storage Server CPU bottleneck - Growing data lag Using FoundationDB performance	22	3039	December 13, 2021
How to increase the read/write throughput for foundationdb 5.2 Using FoundationDB	20	3119	May 24, 2020
Moving Data more than double the key-value size stored Using FoundationDB	0	472	January 11, 2019

Data set size of the fdb server process save to disk

Related topics