I’d like to start relatively small around 100GB of data and depending on the requirements single or doubleredundancy mode using ssd storage.
I don’t understand how the math in storage space requirements work. I would like know how much space of SSD is required per machine (in the worst case) if I need to store 100GB of data.
The basic idea is you need to account for the following things:
Overhead per byte
This depends a bit on the size of your keys and values, but should be somewhat less than 2x for sufficiently sized key-value pairs. For small pairs, it may be higher. I don’t have an exact formula for you right now, but let’s use 1.7x as an example. You can probably determine a more accurate number for your data empirically, but note that the b-tree tends to be more efficient right after inserting data and uses a bit more space as it settles in and undergoes mutations.
Replication
1x for single, 2x for double, 3x for triple
Over-provisioning
The documentation mentions that some SSDs benefit performance-wise from not being completely full. It’s also advisable to leave some extra space around to be able to tolerate a machine failing (all of the data from that machine would be replicated to other machines). If we wanted to keep the disks less than 2/3 full, for example, then we’d choose 1.5x.
To get the total amount of disk capacity we’d need to account for all that, we’d multiply the different overheads. So for a single-replicated cluster, we’d want our disk capacity factor to be 1.7 * 1.5 = 2.55x, or 255GB for 100GB of data. For double replication, that would be 1.7 * 2 * 1.5 = 5.1x (510GB), and for triple you’d have 1.7 * 3 * 1.5 = 7.65x (765 GB).
I understand that 3 machines with 100GB doesn’t end being 300GB available in the whole cluster.
Let’s reverse the computation. If I need 100GB with double replication. How many machines do I need to be able to survive the failure of one machine (edit: survive in the sens being still able to read the whole data and be able to write to the database) and how GB per machine do I need?
I think for that setup you would want three machines with 200 GB each. Double replication requires two machines to be up for the database to be available, so you will need three machines to be able to sustain the loss of one machine without losing availability. When you are down a machine, the two remaining machines will be responsible for holding the entire data set, which per A.J.'s estimates above would be about 340 GB total, or 170 GB per machine. Having 200 GB on each machine will allow you to handle that failure mode while still having about 15% of your disk free, which should be a reasonable amount of free space to keep things healthy during failure scenarios. The 1.7x overhead isn’t precise, and the amount of free space you’re willing to accept on your disk is something that you may need to experiment with in your setup.