We conducted a test yesterday where we were reading 1GByte/s and reading 80MBytes/s on a 100 storage server clusters.
The status json kept on showing that we were limited by log_server_write_queue for batch performance and DataDistribution was continuously moving ~300GB of data at priority 1000 (the highest to the best of my understanding).
From what I could tell the log servers actually were not saturated on I/O nor on CPU, but when I looked a bit more at the different shards I noticed that out of 5000 shards that 520 of them where used for backup logs ie.
b'\xff\x02/blog/>l?6\x93\xd1\x88\xa9\t~\x00\xd1\x1f,\xfe\xa7\xbd\x00\x00\x00\xc98\xd3@' ['5ab86def80a483a1d9c992cfec4ae9c2', 'a9e203628a9336e815391eba3e37c179', 'aebb827b83585f537c8abe0134879ad4', '2a353d989c13cf2dc27b2804cb52b36e']
We had a backup lag of ~40s which seems decent and seems to explain why 3GB or so where continuously moving as it means that we were roughly storing that in the system range: 40*80MBytes = 3200MBytes.
I suspect that as backup logs are written in the system range the range become quite write hot and so DD wants to move and split the range in more shards to make it more uniform but as logs are stored but the nature of backup logs (somewhat ephemeral) is causing shards to quickly move from hot to cold and vice versa.
I’m wondering if there is any tunning that could be applied either by using range config or knobs to somehow tame this.
I’m wondering if there is any tunning that could be applied either by using range config or knobs to somehow tame this.
The range config feature can’t tell DD to not split or to not move a key range. It also can’t force the replication factor of any key range to be lower than the configured replication factor for the cluster, only higher.
I don’t think any of the backup related knobs can meaningfully improve the situation. Also, be aware that some of the backup/restore knobs are effectively stateful in that they determine parts of the KV log schema, so changing them once backup data exists can silently corrupt a running backup and/or silently prevent restore from using older backup data correctly.
If you are writing to a blobstore://
destination then you might be able to shave some seconds off of your backup lag by tweaking the blobstore URL parameters documented here but I don’t have any specific suggestions. The current log flushing logic reads reads from FDB in parallel and pipelines output to the destination, which in the blobstore://
case is further pipelined by doing a multi-part upload and sending multiple parts at once.
Backup has a better logging mode called --partitioned-log-experimental
which reads logs directly from the transaction log roles, but I can’t recommend using it because as far as I am aware the open source version has several critical backup bugs and the current version of restore can’t use the partitioned logs because they were meant to be used with a new restore project which was abandoned as it did not deliver the performance improvement expected.
1 Like