We’ve got 2 identical clusters that we’re doing DR replication between. For most of the day this runs absolutely fine, replication lag hovers around 5 secs or less. These are on full VMs, not on K8s using the FDB operator.
We’ve also got a set of jobs that run over a large swathe of our keyspace reading ranges to do analysis and writing other keys with the results, which kick off at 00:00 each day and take about 30-35 mins.
(N.B. Ideally these jobs would be configured as batch priority, but they are not currently and making the whole distributed system aware to make them fully batch priority is a big chunk of work, so the engineering team who would be responsible are pushing back unless we can point to specific data to say that that is the problem and moving to batch priority from default would fix the issue).
During the running of these jobs, our DR replication lag steadily and linearly climbs to about 3-4 mins. Then the background jobs complete and the replication lag rapidly (within a minute or so) returns to the normal sub-5 second level.
During the time of these background jobs, there is nothing else to indicate that the cluster is being overloaded at all. All FDB queues stay normal, we see no change in durability lag. Data lag actually goes down (from 50-80 msec to 10-15 msec), which I would assume is because more data being written = some buffer before write is being filled faster. Client requests to our api which is backed by FDB don’t see any increase in response times or similar.
For additional context, the dr-agent processes are hosted on the same machines as the log FDB server processes currently. We have 9 server processes with class log, and at any one time (except during reelection or similar) only 4 of them are actively hosting the log role. The other 5 are just there to allow takeover in minimal time/with minimal performance drop. There are 18 storage class nodes, all of which have the storage role.
During the jobs, CPU goes from ~5% to ~25% specifically on the log nodes which are hosting the log role (so I surmise it’s related to the role’s needs, not the dr-agent). It stays constant on the ‘hot spare’ log nodes, and on the storage nodes. Memory use doesn’t change on any nodes. Disk write I/O and throughput on the log and storage nodes shows a significant increase, but not to the point it is saturating the capabilities of the disk or anything. Network throughput both in and out on the log nodes with active log roles shows a similar increase, but again not saturating the network capability. Any additional network traffic on the storage nodes is so small as to be lost in the regular background traffic (a combination of they’re doing more as a baseline, and there’s more of them to distribute the traffic between, I believe).
So I’m at a bit of a loss. At the moment we have an alarm that fires every night when our background jobs run, and short of artificially restricting its throughput beyond any backpressure that the FDB versionstamp issuing and commit mechanism would apply (I forget off the top of my head which starts artificially introducing slowness if the system is being overloaded), I don’t know how to resolve it. I also don’t know what we’d do if the level of traffic those background jobs are applying (which is apparently well within what the FDB cluster can cope with) was our baseline (which it could easily be, as we grow).
Can I see how ‘loaded’ each dr-agent process is, or similar? Should I be expecting to spin up more than just 9 of them? The very minimal CPU/memory/network throughput of the 5 ‘hot spare’ log nodes, which also have active dr-agent processes, leads me to believe that either the work isn’t being distributed to them for some reason or that adding more processes wouldn’t helm because the bottleneck is elsewhere.