We have a number of workers, each operating on an individual set of keys. Each worker periodically runs a transaction where it reads and updates its own keys.
We’ve encountered high latencies on these transactions reaching 2 seconds when running 1000 workers, with period 200ms (5 writes per second per worker). The strange thing is that when we remove the delay between transactions (i.e. write as fast as we can), the latencies get back to 30-130ms.
The first panel shows the latency of executing a single transaction including retries.
The second panel shows the overall rate of transactions among all workers per second.
The third panel shows the overall rate of transaction retries per second.
Here’s the rate of conflicts that we got from status json:
When running without delays, we observed stable low latencies, despite increased overall write rate. Also, we had ~8 times fewer transaction retries.
At the same time, status json showed ~8 times more transaction conflicts:
We monitored the reasons for the retries, and in all cases, it was error code 1009 (Request for future version).
What may have caused this? It’s surprising that increased load leads to lower latencies. It’s also strange that there are fewer retries despite more transaction conflicts.
We’re using FoundationDB v6.2.7.
Our app is built on node-foundationdb with FDB client v6.2.7.
future_version is an error that you receive when a transaction receives a read version from the proxy, and then uses it to do a read at a storage server that hasn’t caught up to that version within 1 second. Typically, receiving this error means that your storage servers are lagging behind. You’d want to check max(cluster.processes.<process_id>.roles[n].data_lag.seconds ) from status json for storage roles. In most cases, this would mean that your write throughput is higher (or more focused on one key) than what your current storage servers can handle. If this is a sustained, regular workload for you, then you should add additional storage servers to the cluster. But it’d be good if you’re able to check storage server CPU and disk utilization as well.
I’m confused about where your conflicts are coming from. The workload that you describe makes it sound like you’ve already partitioned all of your workers apart in the keyspace, and therefore they shouldn’t be conflicting at all?
The graph of transaction conflicts is somewhat less meaningful without knowing how many transaction successes there were. Our graphs track this as a “percentage of transactions that conflict”, which then lets you know if 100 conflicts is a lot or a little.
Combined with your other graphs, sounds like the most likely explanation would be that somehow you’ve acquired a read version (ie. did a read), and then did the wait? Otherwise, is your wait happening in such a way that it’s stacking up multiple conflicting workers to start at about the same time? I could see an argument that a continuous stream of transactions would cause clients to settle into a mutually avoidant pattern, but random waits could cause unpredictable p99 latency.
It’s strange to me that your graphs show some stretches of minutes as being fine and low latency, and then some minutes where you’ve completely fallen off of a latency cliff. I have less good suggestions coming off the top of my head for this, but I shall ponder a bit longer…
We currently have 12 storage processes, none of which use more than 20% of one CPU core. Disk utilization (cluster.processes.<pid>.disk.busy) is less than 10%.
Anyway, if the reason was resource saturation, how would it be less saturated with a heavier workload?
The number of transactions is about the same as in the Write Rate graph. So the percentage of conflicts is very small in both cases. However, there are much more retries than conflicts. The Transaction Confilcts graph is taken from status json, and the Transaction Retries comes from our app and is measured using counter that is incremented when transaction commit is rejected.
No, we only wait after a transaction successfully completes. I’ll try to experiment with randomizing delays, though I think that transactions should already be enough distributed in time due to variance in latencies.
I tried to add ±50% randomization for delays, and it fixed the issue! Latencies are consistently low and there are no retries at all.
But I still don’t understand how to interpret this. Could it be an issue with node-foundationdb? Or is it expected for any client?
In our test, we run 100 workers per instance, with 10 instances. Could it be that there’s some sort of contention inside one client resulting in 1009 errors?
I don’t think any slowdown or stall on the client side would generate future_version errors. if anything it would manifest as past_version errors and retries, because the read version of the transaction would be obsolete once the client comes back to life and resumes reading.
Maybe if the client would stall just as a lot of transactions are about to commit, then all the sudden send all the commits all at once in a burst, saturating the cluster for that brief moment, causing future_version errors, and for some reason, starting a new batch of transactions and stalling again before they commit? This could create very brief spikes of activity followed by idle periods, manifesting as low average CPU usage?
I can’t see considerable difference in CPU or disk usage. None exceeds 40%. However, it may be that spikes are between metric scrapes so they aren’t seen. Scrape interval is currently 10 seconds.
Is there something else I can monitor in status json that indicates saturation?
As far as I understand, it stands for a queue of data already received by storage process, but not yet written to disk. However, I can’t see disk nor CPU saturation:
Hmm, I’m seeing exactly the same symptoms (some 1009 errors, and a strangely high p100 response rate on a very unloaded cluster). Checking my data and durability lags I have:
Attached are some status jsons. It’s hard to say exactly when this issue is occurring since it’s somewhat sporadic, and we (unfortunately) don’t have super good telemetry on it, but I did notice some of these errors bubble up to our API around this time. If these aren’t super helpful, happy to instrument things a bit more.
One weird thing I noticed with the status jsons above is that the limiting_durability_lag_storage_server is 14.334199999999999, but if you examine the durability_lag on all the storage servers they’re all around 5.5s
You probably have a proxy or resolver on the same process as a storage server, so it’s draining all the CPU from that storage server. Your status jsons confirm this, see e.g. process 10.67.246.24:4504. Your cluster will become much happier if you add 4 stateless processes to it.
Ahh thanks heaps for the quick reply @alexmiller! Got ya, sorry about the silly configuration, I tried to follow along in Cluster tuning cookbook but think I got a bit lost on the correct setup.
I just discovered this recently as well. I think what’s happening is that the “limiting” statistic you get comes from ratekeeper, which is using smoothers on the source data. It uses a 1s smoother on the latest version and a 10s smoother on the durable version, and this makes the apparent difference between the two larger.
After digging some more, I’ve found that 1009 errors don’t always correlate to high durability_lag.
In my latest test, I do some 500 writes per second. Starting on an empty database, everything is fine for some time, but after a while, I start getting retries:
I don’t see any considerable change in data_lag nor durability_lag.
When I restart the writer, I get retries from the start.
Surprisingly, when I add another writer effectively doubling the load, all retries go away:
Do you have sample code that can be tested by others on their setup?
I started writing some hypothesis on lines of:
…if you are doing back-to-back writes and reads, it may result in clients requesting for a version from Storage Server that SS hasn’t yet pulled in from tLogs. And when you increase the write load, it could be that client or proxy is introducing some additional latency (like batching) that gives SS enough time to pull the latest mutations from tLogs…
, but, then I got confused about how SS keeps in sync with tLogs. I am not sure why we do not see much more of these future_version errors in our code?
Typically, workloads are always doing many concurrent transactions that involve both reads and writes. So, in practice, SS should always be behind TLOGS by at-least a few millis (in best case), and up to a few seconds (in average case)?
If that is so, then one should expect to see these errors much more frequently; but we don’t see those in practice. So, clearly I am missing something important here.
@alexmiller@ajbeamon could you please help with the doubt above (apologies, if this is already mentioned somewhere in some architecture doc or post)
There are a few factors at play here that could result in you not seeing frequent future_version errors. First, in order for a client to attempt to read at a version, they will need to request and receive a read version from the proxies and then issue the read. During this time, the data will be moving from the logs to the storage servers, and if they are keeping up well the data could be there before you want to read it.
Second, a storage server won’t fail with a future version immediately if it gets a request for a read that is in the future. Instead, it will wait for 1 second to see if the requested version shows up. This could introduce latency to your early reads if they are arriving before the data, but it will result in the read succeeding.
Last, the future version error is a retryable one, so depending on how you run your transactions you might not even see it when it’s happening. The default retry loops, for example, will not give you any indication that they retried due to a future version error. You can, however, see a count of future version errors in the client trace logs in the TransactionMetrics event. You could also write your own retry loop that logs errors in some way so that you could see them.
@ajbeamon I was looking through the code and noticed the 1 second sleep. This explains this graph of our read/write and read/read transactions (where the p100 response time for a transaction is a few 10s of ms over 1 second: