High rate of transaction retries with error code 1009 (Request for future version)

We have a number of workers, each operating on an individual set of keys. Each worker periodically runs a transaction where it reads and updates its own keys.

We’ve encountered high latencies on these transactions reaching 2 seconds when running 1000 workers, with period 200ms (5 writes per second per worker). The strange thing is that when we remove the delay between transactions (i.e. write as fast as we can), the latencies get back to 30-130ms.

Below are the metrics when running with delays:

The first panel shows the latency of executing a single transaction including retries.
The second panel shows the overall rate of transactions among all workers per second.
The third panel shows the overall rate of transaction retries per second.

Here’s the rate of conflicts that we got from status json:

image

When running without delays, we observed stable low latencies, despite increased overall write rate. Also, we had ~8 times fewer transaction retries.

At the same time, status json showed ~8 times more transaction conflicts:

image

We monitored the reasons for the retries, and in all cases, it was error code 1009 (Request for future version).

What may have caused this? It’s surprising that increased load leads to lower latencies. It’s also strange that there are fewer retries despite more transaction conflicts.

We’re using FoundationDB v6.2.7.
Our app is built on node-foundationdb with FDB client v6.2.7.

future_version is an error that you receive when a transaction receives a read version from the proxy, and then uses it to do a read at a storage server that hasn’t caught up to that version within 1 second. Typically, receiving this error means that your storage servers are lagging behind. You’d want to check max(cluster.processes.<process_id>.roles[n].data_lag.seconds ) from status json for storage roles. In most cases, this would mean that your write throughput is higher (or more focused on one key) than what your current storage servers can handle. If this is a sustained, regular workload for you, then you should add additional storage servers to the cluster. But it’d be good if you’re able to check storage server CPU and disk utilization as well.

I’m confused about where your conflicts are coming from. The workload that you describe makes it sound like you’ve already partitioned all of your workers apart in the keyspace, and therefore they shouldn’t be conflicting at all?

The graph of transaction conflicts is somewhat less meaningful without knowing how many transaction successes there were. Our graphs track this as a “percentage of transactions that conflict”, which then lets you know if 100 conflicts is a lot or a little.

Combined with your other graphs, sounds like the most likely explanation would be that somehow you’ve acquired a read version (ie. did a read), and then did the wait? Otherwise, is your wait happening in such a way that it’s stacking up multiple conflicting workers to start at about the same time? I could see an argument that a continuous stream of transactions would cause clients to settle into a mutually avoidant pattern, but random waits could cause unpredictable p99 latency.

It’s strange to me that your graphs show some stretches of minutes as being fine and low latency, and then some minutes where you’ve completely fallen off of a latency cliff. I have less good suggestions coming off the top of my head for this, but I shall ponder a bit longer…

Thanks for a quick response.

Here’s the data lag from the same time period when running 1000 workers with 200ms delays:


(Don’t mind different time, just moved to a different timezone)

Here’s the data lag without delays:

We currently have 12 storage processes, none of which use more than 20% of one CPU core. Disk utilization (cluster.processes.<pid>.disk.busy) is less than 10%.

Anyway, if the reason was resource saturation, how would it be less saturated with a heavier workload?

The number of transactions is about the same as in the Write Rate graph. So the percentage of conflicts is very small in both cases. However, there are much more retries than conflicts. The Transaction Confilcts graph is taken from status json, and the Transaction Retries comes from our app and is measured using counter that is incremented when transaction commit is rejected.

No, we only wait after a transaction successfully completes. I’ll try to experiment with randomizing delays, though I think that transactions should already be enough distributed in time due to variance in latencies.

I tried to add ±50% randomization for delays, and it fixed the issue! Latencies are consistently low and there are no retries at all.

But I still don’t understand how to interpret this. Could it be an issue with node-foundationdb? Or is it expected for any client?

In our test, we run 100 workers per instance, with 10 instances. Could it be that there’s some sort of contention inside one client resulting in 1009 errors?

I don’t think any slowdown or stall on the client side would generate future_version errors. if anything it would manifest as past_version errors and retries, because the read version of the transaction would be obsolete once the client comes back to life and resumes reading.

Maybe if the client would stall just as a lot of transactions are about to commit, then all the sudden send all the commits all at once in a burst, saturating the cluster for that brief moment, causing future_version errors, and for some reason, starting a new batch of transactions and stalling again before they commit? This could create very brief spikes of activity followed by idle periods, manifesting as low average CPU usage?

I can’t see considerable difference in CPU or disk usage. None exceeds 40%. However, it may be that spikes are between metric scrapes so they aren’t seen. Scrape interval is currently 10 seconds.

Is there something else I can monitor in status json that indicates saturation?

I’ve tracked it down to growing durability_lag.

As far as I understand, it stands for a queue of data already received by storage process, but not yet written to disk. However, I can’t see disk nor CPU saturation:

The disk is used individually by this storage process. What else could be a bottleneck here?