In our testing, we found that the data cannot be evenly distributed onto the disks, leading to some disks being excessively utilized, resulting in abnormal cluster states. I want to know if adding more storage can make the data writes distribute more evenly across the disks, or if there are other ways to avoid this issue.
Our deployment consists of a single Availability Zone (AZ) with 10 nodes. Each node has 10 processes, including 6 storage, 3 stateless, and 1 log process. Below is a screenshot of the monitoring.
What storage engine are you using and how are you calculating disk utilization? Does each storage process have its own disk?
We are using the SSD-2 storage engine and calculating disk utilization using fdb-exporter. Each group of three storage processes shares one disk.
I’m not familiar with that tool so I don’t know what fields it is actually using for this.
The metric which FDB should balance across StorageServers total logical KV bytes it holds replicas for. This is reported in two places.
- Status JSON as
stored_bytes
for arole=storage
role in a process - Trace log in the
Type=StorageMetrics
trace events asBytesStored
for each StorageServer
Check if this metric is balanced across your Storage Servers.
If you look at disk utilization or file sizes, there are several reasons they will not match. A storage server holds X amount of logical data but in > X amount of disk usage because of overhead, internal fragmentation, and internal reusable free space.