Best strategy to handle client overload

From time to time under load spikes we are getting client app overflowed with work and we started to get errors that transaction is too old (since it exceeds 5 secs) while this is indeed a bad thing, but there are no way to exit this state easily since it retries, retries and retries. Today i got that transaction was retrying for over 30 minutes.

Since it is quite hard to predict workload and we are starting to look for lower lever throttling to avoid this errors.

Is there are any best practices for this? Throttling starting of a transaction? Limiting number of parallel transactions?

What operations is the transaction trying to get done, that makes it take take longer than 5 sec?

Is it taking longer because the reads are taking much longer? Or is it because the transaction is reading much more rows under certain conditions? If later, then can the reads go on in parallel?

Depending on consistency guarantee requirements, maybe the transaction can be broken down into smaller transactions that each finish much quicker…

If you could provide more details, maybe someone can suggest something specific.

I also think that you might be seeing a lot other errors sooner than transaction too old error - like future version error, if fdb is being saturated.

In one of the projects, we do catch such errors and throttle down the load on cluster to get it out of this state.

But are you certain that it is fdb cluster itself that is being saturated, or is it something else going on the client side wrt to amount of work being done per transaction?

This is very heterogenous execution and I explicitly ask about generic solution, not specific one.

Most of the time it is either client network thread overload or application thread. Sometimes too much reads, sometimes too much transactions. Some times it is just unexpected user’s usage spike.

Problem is virtually zero isolation between transactions and one faulty code path could bring down entire cluster.

I believe that generic solution exists in Record Layer or at least some best practices achieving this.

In one of the projects, we do catch such errors and throttle down the load on cluster to get it out of this state.

Our cluster is fine and well over-provisioned, issue is about client side only. But there could be other cases that could bring down FDB itself.

Will the tag-throttling feature (https://github.com/apple/foundationdb/issues/2432) be a generic enough solution to solve the problem that a client overloads an entire cluster ?

The feature lets client to tag its transactions. FDB cluster tracks the traffic of tagged transactions, detect if a tag of transactions has too much (read) traffic, and throttle transactions with the tag if the tag is detected as hot.

This feature will come in 6.3 for read-hot transactions. The feature for write-hot transaction is currently under active development and should come in 7.0.

Wow this is awesome! I am not sure if this could solve all issues, since my issue is mostly client side, not server side, but this might work.

The tag-throttling feature actually throttle client in client side.

The throttling is enforced on the client-side, but based on server-side saturation. Saturation on the client side is not detected or throttled, so if that’s the problem you’re running into tag throttling probably won’t have a big impact.

On the other hand, you mention bringing down the entire cluster, which would suggest a server-side issue that might be manageable with tag throttling.

1 Like

@ex3ndr say you have a 2-machine cluster and 200 clients, is the overload happening at client side (i.e., where fdbclient runs and client runs out of, say cpu, resources) or the server side (i.e., some fdbserver process is running out of resource)?

Well, i think problem is on client side. Entire cluster is a client cluster: if you have same request on every node then it will saturate client and application code. It is not a problem for most other DBs since there are no 5 second limit, but with FDB all transactions start to timeout since client is overloaded.

I suspect that this is mostly client, not our application code. I actually found that our application code could be very performant with FDB and it is usually client saturated before our application (like Bindings seems to be very slow for a simple operation).

1 Like