The API get_range_split_points always reports "timed out"

I use the API fdb_transaction_get_range_split_points(), it works well most time, but sometimes returns 1031 (transaction_timed_out). Once 1031 is returned, it always returns 1031 in subsequent calls. This interface is called every 3 seconds in my test program. But other interfaces always works well.

future = fdb_transaction_get_range_split_points(...)
err = fdb_future_block_until_ready(future);

In /etc/foundationdb/foundationdb.conf, I configured 4 processes like this:





However, if use only one process, there is no timed out problem.

I tested FoundationDB version 7.1.9, 7.1.19 and 7.1.27, all of them have this issue.

I added log in fdb client, the file

ACTOR Future<Standalone<VectorRef<KeyRef>>> getRangeSplitPoints(Reference<TransactionState> trState,
                                                                KeyRange keys,
                                                                int64_t chunkSize,
                                                                Version version) { 
} catch (Error& e) {
			if (e.code() == error_code_wrong_shard_server || e.code() == error_code_all_alternatives_failed) {
				TraceEvent(SevWarn, "===>Client getRangeSplitPoints error").detail("errcode", e.code());
				trState->cx->invalidateCache(locations[0].tenantEntry.prefix, keys);
				wait(delay(CLIENT_KNOBS->WRONG_SHARD_SERVER_DELAY, TaskPriority::DataDistribution));
			} else if (e.code() == error_code_unknown_tenant) {
				wait(delay(CLIENT_KNOBS->UNKNOWN_TENANT_RETRY_DELAY, trState->taskID));
			} else {
				TraceEvent(SevError, "GetRangeSplitPoints").error(e);

The errcode is 1001(error_code_wrong_shard_server). I think there may be something wrong in my fdb server configuration, could anyone help me?

I found that fdbserver throw wrong_shard_server() in the above code, and the shard state is NotAssigned. How to resolve this?

@Andrew Noyes

In my test, if using multiple storage server process, the interface fdb_transaction_get_range_split_points() doesn’t always works well. But it’s OK for only one storage server.

error_code_wrong_shard_server is sent by the storage server when the key range is not hosted on itself. This is a retry-able error such that getRangeSplitPoints() in will call getKeyRangeLocations to get the locations (i.e., storage servers) of the range and then send requests these storage servers.

When you have only one storage server, all shards are there so you won’t get error_code_wrong_shard_server. For multiple storage servers, this error code should be transient.

Thank you! However, getRangeSplitPoints() often timed out when retrying in multiple SS environment(timeout is set to 30s or longer). This happens regardless of whether perpetual_storage_wiggle is enabled or disabled, or whether there are read and write requests. This error code wrong_shard_server is transient, but it lasts a little too long. I’m confused about whether this means that the API fdb_transaction_get_range_split_points() is unstable? And how to avoid this issue?