The API get_range_split_points always reports "timed out"

mdianjun · February 7, 2023, 3:15am

I use the API fdb_transaction_get_range_split_points(), it works well most time, but sometimes returns 1031 (transaction_timed_out). Once 1031 is returned, it always returns 1031 in subsequent calls. This interface is called every 3 seconds in my test program. But other interfaces always works well.

future = fdb_transaction_get_range_split_points(...)
err = fdb_future_block_until_ready(future);

In /etc/foundationdb/foundationdb.conf, I configured 4 processes like this:

[fdbserver.4500]

[fdbserver.4501]

[fdbserver.4502]

[fdbserver.4503]

However, if use only one process, there is no timed out problem.

I tested FoundationDB version 7.1.9, 7.1.19 and 7.1.27, all of them have this issue.

mdianjun · February 7, 2023, 10:04am

I added log in fdb client, the file NativeAPI.actor.cpp:

ACTOR Future<Standalone<VectorRef<KeyRef>>> getRangeSplitPoints(Reference<TransactionState> trState,
                                                                KeyRange keys,
                                                                int64_t chunkSize,
                                                                Version version) { 
.....
} catch (Error& e) {
			if (e.code() == error_code_wrong_shard_server || e.code() == error_code_all_alternatives_failed) {
				TraceEvent(SevWarn, "===>Client getRangeSplitPoints error").detail("errcode", e.code());
				trState->cx->invalidateCache(locations[0].tenantEntry.prefix, keys);
				wait(delay(CLIENT_KNOBS->WRONG_SHARD_SERVER_DELAY, TaskPriority::DataDistribution));
			} else if (e.code() == error_code_unknown_tenant) {
				ASSERT(trState->tenant().present());
				trState->cx->invalidateCachedTenant(trState->tenant().get());
				wait(delay(CLIENT_KNOBS->UNKNOWN_TENANT_RETRY_DELAY, trState->taskID));
			} else {
				TraceEvent(SevError, "GetRangeSplitPoints").error(e);
				throw;
			}
		}

The errcode is 1001(error_code_wrong_shard_server). I think there may be something wrong in my fdb server configuration, could anyone help me?

mdianjun · February 7, 2023, 1:59pm

I found that fdbserver throw wrong_shard_server() in the above code, and the shard state is NotAssigned. How to resolve this?

@Andrew Noyes

mdianjun · February 8, 2023, 5:08am

In my test, if using multiple storage server process, the interface fdb_transaction_get_range_split_points() doesn’t always works well. But it’s OK for only one storage server.

jzhou · February 14, 2023, 4:43am

error_code_wrong_shard_server is sent by the storage server when the key range is not hosted on itself. This is a retry-able error such that getRangeSplitPoints() in NativeAPI.actor.cpp will call getKeyRangeLocations to get the locations (i.e., storage servers) of the range and then send requests these storage servers.

When you have only one storage server, all shards are there so you won’t get error_code_wrong_shard_server. For multiple storage servers, this error code should be transient.

mdianjun · February 14, 2023, 5:14am

Thank you! However, getRangeSplitPoints() often timed out when retrying in multiple SS environment(timeout is set to 30s or longer). This happens regardless of whether perpetual_storage_wiggle is enabled or disabled, or whether there are read and write requests. This error code wrong_shard_server is transient, but it lasts a little too long. I’m confused about whether this means that the API fdb_transaction_get_range_split_points() is unstable? And how to avoid this issue?

Topic		Replies	Views
Recommended usage of get_range_split_points or GetRangeSplitPoints Using FoundationDB	2	739	February 14, 2023
Issues with get_range_split_points returning chunks of very uneven size Using FoundationDB bindings , performance	1	296	May 17, 2023
Streaming data out of FoundationDB Using FoundationDB	2	2608	September 11, 2018
Why can I only range read 2857 keys? Using FoundationDB	1	627	July 13, 2019
Can't get last pair in FDBKeyValue array Using FoundationDB	4	867	March 25, 2019

The API get_range_split_points always reports "timed out"

Related topics