We added 15 more transaction logs hoping to see the disk busy on transaction logs reduced but it did not help. Each of the old and new transaction logs still exhibits the similar disk busy as before.
I don’t see any increase in traffic that can explain the behavior.
I noticed in the FDB paper at Section 2.4.3 it says:
“After a Proxy decides to commit a transaction, the log message is broadcast to all LogServers. As illustrated in Figure 2, the Proxy first consults its in-memory shard map to determine the StorageServers responsible for the modified key range. Then the Proxy attaches StorageServer tags 1, 4, and 6 to the mutation, where each tag has a preferred LogServer for storage. In this example, tags 1 and 6 have the same preferred LogServer. Note the mutation is only sent to the preferred LogServers (1 and 4) and an additional LogServer 3 to meet the replication requirements. All other LogServers receive an empty message body. The log message header includes both LSN and the previous LSN obtained from the Sequencer, as well as the known committed version (KCV) of this Proxy.”
To my understanding, it seems that besides the 3 LogServers that persist mutation logs, other LogServers also receive an empty message body. I think it might be the case that these LogServers also need to persist the message header’s details to disk. Is my understanding correct? If yes, would it explain the behavior mentioned above because in our system, each write transaction is small that stores a few small key-value pairs in a transaction to FDB.
Also, if that is the case, does that mean we cannot scale write throughput more if a transaction log server reaches the disk bottleneck?
what exactly is the metric you’re using here? is this disk busyness reported by FDB? If so, this might be a non-issue.
What this is telling you is simply that there’s work in the queue for the disk at most times. This is not something that is inherently bad. Instead you need to look at IOPS (optimally through metrics provided by the cloud provider).
FDB tries to strike a balance between latency and throughput here by controlling how large each commit batch should be. Smaller commit batches will result in lower latencies, larger batches give you more throughput. So as long as the disks can keep up FDB will keep the batches small. When the tlog disks start to slow down, the commit proxies will measure a higher latency and react to that by making the batches larger. You can observe this by measuring tail commit latencies.
Thanks for pointing out that write throughput can be governed and improved by the rate keeper via the tuning of the batch size, based on the measured commit latency observed at the proxy nodes. In fact, we did not see significantly higher write latency than normal when the disk business reaches close to 1 in our experiment.
But what we would like to know more is that when the number of the transaction log servers is increased 2x, the non-empty mutation logs that are distributed to each transaction log server should be reduced by 2x (due to sharding), and hence its disk-business should be reduced to some degree (if not 2x). But that is not what we observed. Hence we suspect that the empty mutation logs (carrying only the commit time-stamp) that are broadcast to all transaction log servers might be the reason that the disk business does not go down, as each log server always receives the number of the messages (empty and not empty) that is the same as the number of the committed transactions.
what exactly is the metric you’re using here? is this disk busyness reported by FDB? If so, this might be a non-issue.
Can’t speak for OP but we have a similar issue with stateless process lately they somehow report fdb_disk_busy close to 100%.
Which is kind of weird for a stateless agent.
What this is telling you is simply that there’s work in the queue for the disk at most times. This is not something that is inherently bad. Instead you need to look at IOPS (optimally through metrics provided by the cloud provider).
Do you mean that this metric reflect the physical queue of the disk, if so which one ? we are running FDB in k8s and are using volumes for /var/fdb/data and /var/log-fdb-trace-logs
FDB tries to strike a balance between latency and throughput here by controlling how large each commit batch should be. Smaller commit batches will result in lower latencies, larger batches give you more throughput. So as long as the disks can keep up FDB will keep the batches small. When the tlog disks start to slow down, the commit proxies will measure a higher latency and react to that by making the batches larger. You can observe this by measuring tail commit latencies.
What about stateless ? what could be leading to the disk being busy ?
Yes, the “disk busyness” metric comes from querying the /proc filesystem to get the kernel measurement of how often the disk queue is non-empty. The disk queried comes from taking the --datadir option given to fdbserver and figuring out which block device that directory is on. The stats are at the block device level and are not directory-specific.