The basic idea is you need to account for the following things:
- Overhead per byte
- This depends a bit on the size of your keys and values, but should be somewhat less than 2x for sufficiently sized key-value pairs. For small pairs, it may be higher. I don’t have an exact formula for you right now, but let’s use 1.7x as an example. You can probably determine a more accurate number for your data empirically, but note that the b-tree tends to be more efficient right after inserting data and uses a bit more space as it settles in and undergoes mutations.
- Replication
- 1x for single, 2x for double, 3x for triple
- Over-provisioning
- The documentation mentions that some SSDs benefit performance-wise from not being completely full. It’s also advisable to leave some extra space around to be able to tolerate a machine failing (all of the data from that machine would be replicated to other machines). If we wanted to keep the disks less than 2/3 full, for example, then we’d choose 1.5x.
To get the total amount of disk capacity we’d need to account for all that, we’d multiply the different overheads. So for a single-replicated cluster, we’d want our disk capacity factor to be 1.7 * 1.5 = 2.55x, or 255GB for 100GB of data. For double replication, that would be 1.7 * 2 * 1.5 = 5.1x (510GB), and for triple you’d have 1.7 * 3 * 1.5 = 7.65x (765 GB).