Data set size of the fdb server process save to disk

hi,
good day! my question is about how multiple fdb server process decide write how many data to disk.
let’s say, I plan to write 202G data (about 300,000,000 records, 512 bytes per record) to disk, the data directory is /opt/data/fdb-6.2/, when using 1 fdb server process, the data size located on /opt/data/fdb-6.2 is about 202GB, that’s correct. please see the info below:
202 /opt/data/fdb-6.2/data/4500

then, when I increased the fdb server process to 8, using the same records about 300,000,000 records, 512 bytes per record, just like what I did when using only one fdb server process, but this time , the final total data set is about 404GB, please refer to the info below:
49 /opt/data/fdb-6.2/data/4501
50 /opt/data/fdb-6.2/data/4503
50 /opt/data/fdb-6.2/data/4505
51 /opt/data/fdb-6.2/data/4502
52 /opt/data/fdb-6.2/data/4507
51 /opt/data/fdb-6.2/data/4506
52 /opt/data/fdb-6.2/data/4504
52 /opt/data/fdb-6.2/data/4500
404 /opt/data/fdb-6.2/data
404 /opt/data/fdb-6.2

could you let me know how multiple fdb server processes decide to handle how many records to save to disk?

Thanks,
Luo

Did you mean to clear the database between runs? It seems rather suspicious that 404GB = 202GB * 2.

I could see an argument that data distribution competing with the incoming writes combined with a skewed workload could have left the raw files sizes on disk unbalanced, as we only slowly do the work required to give space back to the OS, but being a perfect 2x multiplier seems like too much of a coincidence.

no, in my test, we don’t clean up the database between runs.

For my pressure test, our test data size should not exceed the maximum of disk capacity, I want to estimate how many processes will be used basing on the fixed number of total records, like for example: I plan to load about 202 GB data, when I use 1 fdb server process, after loading, the total db size is 202.
but when I use 8 fdb server processes, after loading , the total db size is 404, it seems is 202*2, but actually it does not always exactly go like this 2x multiplier,

here is my another test,
I plan to load abut 286GB data (using ycsb , 300,000,000 records, 1k for every record size), on the following results, 762 is more 20GB than 742 ( 742 equals = 371 multiplier 2)

  1. using 1 fdb server processes, after loading, the total db size is 286 GB
    371 /opt/data/4500
    371 /opt/data

  2. using 10 fdb server processes, ,after loading, the total db size is 762GB,
    76 /opt/data/4506
    76 /opt/data/4502
    77 /opt/data/4509
    76 /opt/data/4500
    76 /opt/data/4507
    76 /opt/data/4503
    74 /opt/data/4501
    80 /opt/data/4504
    77 /opt/data/4508
    78 /opt/data/4505
    762 /opt/data

so do you have any clue what the rule is about how every fdb server will be assigned how many records when inserting those total 300,000,000 records (every record is 1kb )? every fdb server has the same opportunity to do inserting? I found that the data size is nearly the same (74GB-80GB) for the different storage path (like /opt/data/4500 /opt/data/4501 …/opt/data/4509 ) , the total db size is about 76.6*10 =766 (76.6 is the average value and 10 is the number of total fdb server processes), so how every fdb server knows that it just needs to load about 74GB–80GB?

thanks,

Size on disk is not the same as bytes used, there can be free space in those files, particularly after deletes due to data movement. Look at FDB’s status to find out how many bytes on disk are used.

Hi, thanks for your explanation,
here is the status of my FDB:

Data:
Replication health - Healthy (Repartitioning.)
Moving data - 241.930 GB
Sum of key-value sizes - 688.661 GB
Disk space used - 1.869 TB

so, “Sum of key-value sizes” means the total size of all the records actually occupied on the disk?
what does " Moving data" mean? I am not in the cluster environment, just single workstation.
I am using 12 fdb server processes, one is for log and another is for stateless, the other are storage. fdb server version is 5.2.5.

Thanks,

When you’re inserting data, you’re growing the amount of data in one shard, and data distribution will split the shard into two in order to maintain a maximum shard size. “Moving data” is the amount of data that FDB currently thinks it will need to move between servers in order to balance the cluster. This redistribution of data can cause some imbalance in the total disk space used between different processes, as FDB is (intentionally) relatively slow in giving space back to the OS.

“Sum of Key-Value Sizes” is the logical amount of key-value bytes that are stored. This should exactly match the calculation of what you’ve written to the database. What you’re looking at is the actual file size on disk, which includes overhead, and slack space within the file itself.

I got what you mean.
Thanks very much for all of you!

have a nice day!