High rate of transaction retries with error code 1009 (Request for future version)

The 1 second delay here is how long a storage server will wait for a requested version to be available before failing. If you try to read from a storage server and the version is already available, there will be no wait. If the version becomes available at some point before that second has elapsed, then it will proceed with the operation once the version becomes available. If the version fails to become available, it will send an error back to the client once the 1 second has elapsed (but not before). In a default retry loop, the client will perform a 10ms delay in response to that error and then try again.

Ah got ya, thanks for the explanation.

I’m starting to get a lot of future_version in a specific unit test (that creates a lot of contention), and I noticed something strange: basically any time a read fails with future_version (1009), the next attempt reusing the transaction (after calling on_error) while see the same read version as the previous attempt, instead of being given a new one. For other error codes, the read version is changed, but not for future_version errors.

FDB version is 6.2.7

Description of the test

I have a test that checks the impact of a naive implementation of a queue that generates tons of conflicts on pops.

The test start 4 threads that will all try to pop from the same queue concurrently: They do a get_range with limit=1 on the head of the queue, and then clear the key to β€œpop” it. This obviously creates a lot of conflicts, and I do get a lot of not_committed errors, which is expected in this test.

But a test that tries to push/pop about 400 items concurrently will almost always timeout after ~30 sec. The test is run locally on a very fast server with 10 processes spread on 4 dedicated NVME drives. I expect the test to complete in less than a couple of seconds, but it randomly takes between 20sec (a pass but very slow) up to 30sec+ which is the execution timeout of the test.

I enabled the transaction logs and found something that happens whenever there is a future_version error: the get_range takes about 1 sec to fail (which is expected given the explanation above), and then the retry loop calls on_error.

But when the transaction starts again, the read version has NOT changed! The transaction will retry using the same read version it got about 1 second before. If the transaction fails with a not_committed then the next call to get_read_version returns a NEW value.

I’ve seen multiple consecutive future_version errors (each with a 1 sec penalty) all retrying using the same old read version until I get a not_committed error and only then the next retry will have a fresh read version.

If the transaction stalled for 1 sec or due to a future_version errors β€œretries” using the same old read version, there is no way it will ever be able to commit if there is conflicts, meaning that there must be at least N retries with the last being something other than future_version, before the transaction has a real chance of completing.

When the test runs, there is absolutely no other activity on the cluster. All other benchmarks that I’ve run that don’t generate a lot of conflicts are very fast (reads take about 300-500Β΅s in practice).

see next comment for the rest, the log exceeds the max message size in the forum

Sample log

Below is a sample transaction that really struggles to make progress (and takes 15 sec in total for 19 attempts)

You can see attempts #1 and #2 fail with not_committed (expecting given the test), but they always get a new read version (7704868432291, 7704868445168) after the retry.

But then, attempt #3 (which got rv 7704868461488) fails with future_version and a 1 sec penalty. attempt #4 gets the same read version, and also fail with future_version + 1 sec penalty. Attempt #5 is still stuck at rv 7704868461488 but then it fails with a not_commited. Attempt #6 now finally gets a new read version 7704870508521 and is able to make progress… only to conflict again.

There is the same sequence for attempts #8, #9, #10 and #11 that all get the same read version 7704872269950 with an elapsed time of 3+ sec between them (meaning that attempt #11 has less than 2 sec of life remaining, and must pray that nobody conflicted with him in the last 3 seconds, which has probability 0 given this test).

This chain of future_version/not_committed error continues until the transaction aborts after a timeout.

note: the code is using my experimental version of the Directory Layer which uses \xFF/metadataVersion and its own metadata key to perform caching, but all calls are cache HITs.

Transaction #476 (read/write, 168 operations, '#' = 500.0 ms, started 18:00:22.3291433Z, ended 18:00:37.3301214Z)
β”Œ  oper. ┬────────────────────────────────┬──── start ──── end ── duration ──┬─ sent  recv ┐
β”‚ 0   // β”‚                                β”‚ T+  0.002                        β”‚     -     - β”‚ // Pop()
β”‚:0   rv β”‚ `                              β”‚ T+  0.006 ~   0.757 (    751 Β΅s) β”‚             β”‚ GetReadVersion => 7704868432291
β”‚ 1   G  β”‚ `                              β”‚ T+  0.769 ~   0.819 (     50 Β΅s) β”‚    17    10 β”‚ Get <FF>/metadataVersion => @7704867927687-0
β”‚ 2   // β”‚ `                              β”‚ T+  0.823                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) cache context still valid (GMV hit @7704867927687-0)
β”‚:2   rc β”‚ `                              β”‚ T+  0.827 ~   0.829 (      2 Β΅s) β”‚    29       β”‚ AddConflictRange Read! (|Directory|, `<FE>`, `metadata`) <= k < (|Directory|, `<FE>`, `metadata`, null)
β”‚ 3   // β”‚ `                              β”‚ T+  0.833                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) subspace HIT for Tests/Fdb/MUTSUMI/queue: DirectorySubspace(path=Tests/Fdb/MUTSUMI/queue, prefix=(284, 187))
β”‚:3   R  β”‚ `                              β”‚ T+  0.840 ~   1.045 (    205 Β΅s) β”‚    10    24 β”‚ GetRange fGE{(284, 187)} <= k < fGE{(284, 188)}, limit(1), Exact => 1 result(s), has_more
β”‚ 4   c  β”‚ `                              β”‚ T+  1.049 ~   1.054 (      5 Β΅s) β”‚    18       β”‚ Clear (284, 187, @7704868082010-1#1)
β”‚ 5  !CoΒ°β”‚ `                              β”‚ T+  1.062 ~   3.861 (  2,799 Β΅s) β”‚             β”‚ Commit => [NotCommitted] Transaction not committed due to conflict with another transaction
β”‚ 6   ErΒ°β”‚ `                              β”‚ T+  3.908 ~   9.533 (  5,625 Β΅s) β”‚             β”‚ OnError NotCommitted (1020)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ == Attempt #2 ==
β”‚ 7   // β”‚ `                              β”‚ T+  9.538                        β”‚     -     - β”‚ // Pop()
β”‚:7   rvΒ°β”‚ `                              β”‚ T+  9.542 ~  10.667 (  1,125 Β΅s) β”‚             β”‚ GetReadVersion => 7704868445168
β”‚ 8   G  β”‚ `                              β”‚ T+ 10.681 ~  10.704 (     23 Β΅s) β”‚    17    10 β”‚ Get <FF>/metadataVersion => @7704867927687-0
β”‚ 9   // β”‚ `                              β”‚ T+ 10.709                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) cache context still valid (GMV hit @7704867927687-0)
β”‚:9   rc β”‚ `                              β”‚ T+ 10.713 ~  10.714 (      1 Β΅s) β”‚    29       β”‚ AddConflictRange Read! (|Directory|, `<FE>`, `metadata`) <= k < (|Directory|, `<FE>`, `metadata`, null)
β”‚ 10  // β”‚ `                              β”‚ T+ 10.720                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) subspace HIT for Tests/Fdb/MUTSUMI/queue: DirectorySubspace(path=Tests/Fdb/MUTSUMI/queue, prefix=(284, 187))
β”‚:10  R  β”‚ `                              β”‚ T+ 10.724 ~  10.897 (    173 Β΅s) β”‚    10    24 β”‚ GetRange fGE{(284, 187)} <= k < fGE{(284, 188)}, limit(1), Exact => 1 result(s), has_more
β”‚ 11  c  β”‚ `                              β”‚ T+ 10.900 ~  10.902 (      2 Β΅s) β”‚    18       β”‚ Clear (284, 187, @7704868091146-1#1)
β”‚ 12 !CoΒ°β”‚ `                              β”‚ T+ 10.908 ~  13.815 (  2,908 Β΅s) β”‚             β”‚ Commit => [NotCommitted] Transaction not committed due to conflict with another transaction
β”‚ 13  Er*β”‚ `                              β”‚ T+ 13.841 ~  31.711 ( 17,870 Β΅s) β”‚             β”‚ OnError NotCommitted (1020)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ == Attempt #3 ==
β”‚ 14  // β”‚ `                              β”‚ T+ 31.717                        β”‚     -     - β”‚ // Pop()
β”‚:14  rv β”‚ `                              β”‚ T+ 31.724 ~  32.694 (    970 Β΅s) β”‚             β”‚ GetReadVersion => 7704868461488
β”‚ 15  G  β”‚ `                              β”‚ T+ 32.705 ~  32.723 (     18 Β΅s) β”‚    17    10 β”‚ Get <FF>/metadataVersion => @7704867927687-0
β”‚ 16  // β”‚ `                              β”‚ T+ 32.729                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) cache context still valid (GMV hit @7704867927687-0)
β”‚:16  rc β”‚ `                              β”‚ T+ 32.733 ~  32.735 (      2 Β΅s) β”‚    29       β”‚ AddConflictRange Read! (|Directory|, `<FE>`, `metadata`) <= k < (|Directory|, `<FE>`, `metadata`, null)
β”‚ 17  // β”‚ `                              β”‚ T+ 32.741                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) subspace HIT for Tests/Fdb/MUTSUMI/queue: DirectorySubspace(path=Tests/Fdb/MUTSUMI/queue, prefix=(284, 187))
β”‚:17 !R *β”‚ &#.                            β”‚ T+ 32.745 ~ 1033.888 (1,001,143 Β΅s) β”‚    10       β”‚ GetRange fGE{(284, 187)} <= k < fGE{(284, 188)}, limit(1), Exact => [FutureVersion] Request for future version
β”‚ 18  Er*β”‚ __`                            β”‚ T+1033.960 ~ 1044.018 ( 10,058 Β΅s) β”‚             β”‚ OnError FutureVersion (1009)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ == Attempt #4 ==
β”‚ 19  // β”‚ °°`                            β”‚ T+1044.020                        β”‚     -     - β”‚ // Pop()
β”‚:19  rv β”‚ °°`                            β”‚ T+1044.022 ~ 1044.221 (    199 Β΅s) β”‚             β”‚ GetReadVersion => 7704868461488
β”‚ 20  G  β”‚ °°`                            β”‚ T+1044.241 ~ 1044.281 (     40 Β΅s) β”‚    17    10 β”‚ Get <FF>/metadataVersion => @7704867927687-0
β”‚ 21  // β”‚ °°`                            β”‚ T+1044.284                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) cache context still valid (GMV hit @7704867927687-0)
β”‚:21  rc β”‚ °°`                            β”‚ T+1044.287 ~ 1044.292 (      6 Β΅s) β”‚    29       β”‚ AddConflictRange Read! (|Directory|, `<FE>`, `metadata`) <= k < (|Directory|, `<FE>`, `metadata`, null)
β”‚ 22  // β”‚ °°`                            β”‚ T+1044.297                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) subspace HIT for Tests/Fdb/MUTSUMI/queue: DirectorySubspace(path=Tests/Fdb/MUTSUMI/queue, prefix=(284, 187))
β”‚:22 !R *β”‚ °°&#.                          β”‚ T+1044.301 ~ 2045.250 (1,000,949 Β΅s) β”‚    10       β”‚ GetRange fGE{(284, 187)} <= k < fGE{(284, 188)}, limit(1), Exact => [FutureVersion] Request for future version
β”‚ 23  Er*β”‚ °°__`                          β”‚ T+2045.537 ~ 2055.783 ( 10,247 Β΅s) β”‚             β”‚ OnError FutureVersion (1009)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ == Attempt #5 ==
β”‚ 24  // β”‚ °°°°`                          β”‚ T+2055.818                        β”‚     -     - β”‚ // Pop()
β”‚:24  rvΒ°β”‚ °°°°`                          β”‚ T+2055.854 ~ 2056.970 (  1,116 Β΅s) β”‚             β”‚ GetReadVersion => 7704868461488
β”‚ 25  G  β”‚ °°°°`                          β”‚ T+2057.126 ~ 2057.257 (    131 Β΅s) β”‚    17    10 β”‚ Get <FF>/metadataVersion => @7704867927687-0
β”‚ 26  // β”‚ °°°°`                          β”‚ T+2057.356                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) cache context still valid (GMV hit @7704867927687-0)
β”‚:26  rc β”‚ °°°°`                          β”‚ T+2057.426 ~ 2057.433 (      6 Β΅s) β”‚    29       β”‚ AddConflictRange Read! (|Directory|, `<FE>`, `metadata`) <= k < (|Directory|, `<FE>`, `metadata`, null)
β”‚ 27  // β”‚ °°°°`                          β”‚ T+2057.575                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) subspace HIT for Tests/Fdb/MUTSUMI/queue: DirectorySubspace(path=Tests/Fdb/MUTSUMI/queue, prefix=(284, 187))
β”‚:27  R *β”‚ °°°°`                          β”‚ T+2057.655 ~ 2071.915 ( 14,260 Β΅s) β”‚    10    24 β”‚ GetRange fGE{(284, 187)} <= k < fGE{(284, 188)}, limit(1), Exact => 1 result(s), has_more
β”‚ 28  c  β”‚ °°°°`                          β”‚ T+2071.941 ~ 2071.956 (     14 Β΅s) β”‚    18       β”‚ Clear (284, 187, @7704868098954-1#1)
β”‚ 29 !CoΒ°β”‚ °°°°`                          β”‚ T+2072.013 ~ 2075.024 (  3,011 Β΅s) β”‚             β”‚ Commit => [NotCommitted] Transaction not committed due to conflict with another transaction
β”‚ 30  ErΒ°β”‚ °°°°`                          β”‚ T+2075.129 ~ 2082.753 (  7,624 Β΅s) β”‚             β”‚ OnError NotCommitted (1020)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ == Attempt #6 ==
β”‚ 31  // β”‚ °°°°`                          β”‚ T+2082.762                        β”‚     -     - β”‚ // Pop()
β”‚:31  rvΒ°β”‚ °°°°`                          β”‚ T+2082.775 ~ 2083.944 (  1,170 Β΅s) β”‚             β”‚ GetReadVersion => 7704870508521
β”‚ 32  G  β”‚ °°°°`                          β”‚ T+2083.978 ~ 2084.030 (     53 Β΅s) β”‚    17    10 β”‚ Get <FF>/metadataVersion => @7704867927687-0
β”‚ 33  // β”‚ °°°°`                          β”‚ T+2084.057                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) cache context still valid (GMV hit @7704867927687-0)
β”‚:33  rc β”‚ °°°°`                          β”‚ T+2084.083 ~ 2084.090 (      6 Β΅s) β”‚    29       β”‚ AddConflictRange Read! (|Directory|, `<FE>`, `metadata`) <= k < (|Directory|, `<FE>`, `metadata`, null)
β”‚ 34  // β”‚ °°°°`                          β”‚ T+2084.110                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) subspace HIT for Tests/Fdb/MUTSUMI/queue: DirectorySubspace(path=Tests/Fdb/MUTSUMI/queue, prefix=(284, 187))
β”‚:34 !R *β”‚ °°°°$#:                        β”‚ T+2084.132 ~ 3084.683 (1,000,550 Β΅s) β”‚    10       β”‚ GetRange fGE{(284, 187)} <= k < fGE{(284, 188)}, limit(1), Exact => [FutureVersion] Request for future version
β”‚ 35  Er*β”‚ °°°°__`                        β”‚ T+3084.786 ~ 3095.425 ( 10,639 Β΅s) β”‚             β”‚ OnError FutureVersion (1009)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ == Attempt #7 ==
β”‚ 36  // β”‚ °°°°°°`                        β”‚ T+3095.433                        β”‚     -     - β”‚ // Pop()
β”‚:36  rv β”‚ °°°°°°`                        β”‚ T+3095.442 ~ 3095.863 (    422 Β΅s) β”‚             β”‚ GetReadVersion => 7704870508521
β”‚ 37  G  β”‚ °°°°°°`                        β”‚ T+3095.882 ~ 3095.930 (     48 Β΅s) β”‚    17    10 β”‚ Get <FF>/metadataVersion => @7704867927687-0
β”‚ 38  // β”‚ °°°°°°`                        β”‚ T+3095.955                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) cache context still valid (GMV hit @7704867927687-0)
β”‚:38  rc β”‚ °°°°°°`                        β”‚ T+3095.976 ~ 3095.984 (      7 Β΅s) β”‚    29       β”‚ AddConflictRange Read! (|Directory|, `<FE>`, `metadata`) <= k < (|Directory|, `<FE>`, `metadata`, null)
β”‚ 39  // β”‚ °°°°°°`                        β”‚ T+3095.998                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) subspace HIT for Tests/Fdb/MUTSUMI/queue: DirectorySubspace(path=Tests/Fdb/MUTSUMI/queue, prefix=(284, 187))
β”‚:39  R *β”‚ °°°°°°$x                       β”‚ T+3096.018 ~ 3781.200 (685,182 Β΅s) β”‚    10    24 β”‚ GetRange fGE{(284, 187)} <= k < fGE{(284, 188)}, limit(1), Exact => 1 result(s), has_more
β”‚ 40  c  β”‚ °°°°°°_`                       β”‚ T+3781.216 ~ 3781.221 (      5 Β΅s) β”‚    18       β”‚ Clear (284, 187, @7704868098954-2#1)
β”‚ 41 !CoΒ°β”‚ °°°°°°_`                       β”‚ T+3781.279 ~ 3784.630 (  3,351 Β΅s) β”‚             β”‚ Commit => [NotCommitted] Transaction not committed due to conflict with another transaction
β”‚ 42  Er*β”‚ °°°°°°_:                       β”‚ T+3784.698 ~ 3864.088 ( 79,391 Β΅s) β”‚             β”‚ OnError NotCommitted (1020)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ == Attempt #8 ==
β”‚ 43  // β”‚ °°°°°°°`                       β”‚ T+3864.094                        β”‚     -     - β”‚ // Pop()
β”‚:43  rvΒ°β”‚ °°°°°°°`                       β”‚ T+3864.098 ~ 3865.240 (  1,142 Β΅s) β”‚             β”‚ GetReadVersion => 7704872269950
β”‚ 44  G  β”‚ °°°°°°°`                       β”‚ T+3865.251 ~ 3865.269 (     18 Β΅s) β”‚    17    10 β”‚ Get <FF>/metadataVersion => @7704867927687-0
β”‚ 45  // β”‚ °°°°°°°`                       β”‚ T+3865.273                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) cache context still valid (GMV hit @7704867927687-0)
β”‚:45  rc β”‚ °°°°°°°`                       β”‚ T+3865.278 ~ 3865.280 (      2 Β΅s) β”‚    29       β”‚ AddConflictRange Read! (|Directory|, `<FE>`, `metadata`) <= k < (|Directory|, `<FE>`, `metadata`, null)
β”‚ 46  // β”‚ °°°°°°°`                       β”‚ T+3865.286                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) subspace HIT for Tests/Fdb/MUTSUMI/queue: DirectorySubspace(path=Tests/Fdb/MUTSUMI/queue, prefix=(284, 187))
β”‚:46 !R *β”‚ °°°°°°°;#X                     β”‚ T+3865.290 ~ 4865.755 (1,000,466 Β΅s) β”‚    10       β”‚ GetRange fGE{(284, 187)} <= k < fGE{(284, 188)}, limit(1), Exact => [FutureVersion] Request for future version
β”‚ 47  Er*β”‚ °°°°°°°__`                     β”‚ T+4865.841 ~ 4876.536 ( 10,695 Β΅s) β”‚             β”‚ OnError FutureVersion (1009)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ == Attempt #9 ==
β”‚ 48  // β”‚ °°°°°°°°°`                     β”‚ T+4876.547                        β”‚     -     - β”‚ // Pop()
β”‚:48  rvΒ°β”‚ °°°°°°°°°`                     β”‚ T+4876.557 ~ 4877.756 (  1,200 Β΅s) β”‚             β”‚ GetReadVersion => 7704872269950
β”‚ 49  G  β”‚ °°°°°°°°°`                     β”‚ T+4877.782 ~ 4877.825 (     43 Β΅s) β”‚    17    10 β”‚ Get <FF>/metadataVersion => @7704867927687-0
β”‚ 50  // β”‚ °°°°°°°°°`                     β”‚ T+4877.834                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) cache context still valid (GMV hit @7704867927687-0)
β”‚:50  rc β”‚ °°°°°°°°°`                     β”‚ T+4877.841 ~ 4877.844 (      3 Β΅s) β”‚    29       β”‚ AddConflictRange Read! (|Directory|, `<FE>`, `metadata`) <= k < (|Directory|, `<FE>`, `metadata`, null)
β”‚ 51  // β”‚ °°°°°°°°°`                     β”‚ T+4877.854                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) subspace HIT for Tests/Fdb/MUTSUMI/queue: DirectorySubspace(path=Tests/Fdb/MUTSUMI/queue, prefix=(284, 187))
β”‚:51 !R *β”‚ °°°°°°°°°:#$                   β”‚ T+4877.862 ~ 5878.525 (1,000,663 Β΅s) β”‚    10       β”‚ GetRange fGE{(284, 187)} <= k < fGE{(284, 188)}, limit(1), Exact => [FutureVersion] Request for future version
β”‚ 52  Er*β”‚ °°°°°°°°°__`                   β”‚ T+5878.599 ~ 5889.372 ( 10,773 Β΅s) β”‚             β”‚ OnError FutureVersion (1009)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ == Attempt #10 ==
β”‚ 53  // β”‚ °°°°°°°°°°°`                   β”‚ T+5889.377                        β”‚     -     - β”‚ // Pop()
β”‚:53  rvΒ°β”‚ °°°°°°°°°°°`                   β”‚ T+5889.391 ~ 5890.618 (  1,227 Β΅s) β”‚             β”‚ GetReadVersion => 7704872269950
β”‚ 54  G  β”‚ °°°°°°°°°°°`                   β”‚ T+5890.641 ~ 5890.676 (     35 Β΅s) β”‚    17    10 β”‚ Get <FF>/metadataVersion => @7704867927687-0
β”‚ 55  // β”‚ °°°°°°°°°°°`                   β”‚ T+5890.688                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) cache context still valid (GMV hit @7704867927687-0)
β”‚:55  rc β”‚ °°°°°°°°°°°`                   β”‚ T+5890.696 ~ 5890.700 (      4 Β΅s) β”‚    29       β”‚ AddConflictRange Read! (|Directory|, `<FE>`, `metadata`) <= k < (|Directory|, `<FE>`, `metadata`, null)
β”‚ 56  // β”‚ °°°°°°°°°°°`                   β”‚ T+5890.710                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) subspace HIT for Tests/Fdb/MUTSUMI/queue: DirectorySubspace(path=Tests/Fdb/MUTSUMI/queue, prefix=(284, 187))
β”‚:56 !R *β”‚ °°°°°°°°°°°:#$                 β”‚ T+5890.719 ~ 6891.518 (1,000,799 Β΅s) β”‚    10       β”‚ GetRange fGE{(284, 187)} <= k < fGE{(284, 188)}, limit(1), Exact => [FutureVersion] Request for future version
β”‚ 57  Er*β”‚ °°°°°°°°°°°__`                 β”‚ T+6891.600 ~ 6902.279 ( 10,678 Β΅s) β”‚             β”‚ OnError FutureVersion (1009)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ == Attempt #11 ==
β”‚ 58  // β”‚ °°°°°°°°°°°°°`                 β”‚ T+6902.285                        β”‚     -     - β”‚ // Pop()
β”‚:58  rvΒ°β”‚ °°°°°°°°°°°°°`                 β”‚ T+6902.290 ~ 6903.534 (  1,244 Β΅s) β”‚             β”‚ GetReadVersion => 7704872269950
β”‚ 59  G  β”‚ °°°°°°°°°°°°°`                 β”‚ T+6903.545 ~ 6903.565 (     20 Β΅s) β”‚    17    10 β”‚ Get <FF>/metadataVersion => @7704867927687-0
β”‚ 60  // β”‚ °°°°°°°°°°°°°`                 β”‚ T+6903.570                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) cache context still valid (GMV hit @7704867927687-0)
β”‚:60  rc β”‚ °°°°°°°°°°°°°`                 β”‚ T+6903.575 ~ 6903.577 (      2 Β΅s) β”‚    29       β”‚ AddConflictRange Read! (|Directory|, `<FE>`, `metadata`) <= k < (|Directory|, `<FE>`, `metadata`, null)
β”‚ 61  // β”‚ °°°°°°°°°°°°°`                 β”‚ T+6903.586                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) subspace HIT for Tests/Fdb/MUTSUMI/queue: DirectorySubspace(path=Tests/Fdb/MUTSUMI/queue, prefix=(284, 187))
β”‚:61  R *β”‚ °°°°°°°°°°°°°:#;               β”‚ T+6903.590 ~ 7643.556 (739,966 Β΅s) β”‚    10    24 β”‚ GetRange fGE{(284, 187)} <= k < fGE{(284, 188)}, limit(1), Exact => 1 result(s), has_more
β”‚ 62  c  β”‚ °°°°°°°°°°°°°__`               β”‚ T+7643.565 ~ 7643.573 (      8 Β΅s) β”‚    18       β”‚ Clear (284, 187, @7704868116986-0#1)
β”‚ 63 !CoΒ°β”‚ °°°°°°°°°°°°°__`               β”‚ T+7643.587 ~ 7647.423 (  3,836 Β΅s) β”‚             β”‚ Commit => [NotCommitted] Transaction not committed due to conflict with another transaction
β”‚ 64  Er*β”‚ °°°°°°°°°°°°°__:               β”‚ T+7647.450 ~ 7758.562 (111,112 Β΅s) β”‚             β”‚ OnError NotCommitted (1020)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ == Attempt #12 ==
β”‚ 65  // β”‚ °°°°°°°°°°°°°°°`               β”‚ T+7758.573                        β”‚     -     - β”‚ // Pop()
β”‚:65  rvΒ°β”‚ °°°°°°°°°°°°°°°`               β”‚ T+7758.583 ~ 7760.595 (  2,012 Β΅s) β”‚             β”‚ GetReadVersion => 7704876080098
β”‚ 66  G  β”‚ °°°°°°°°°°°°°°°`               β”‚ T+7760.609 ~ 7760.641 (     32 Β΅s) β”‚    17    10 β”‚ Get <FF>/metadataVersion => @7704867927687-0
β”‚ 67  // β”‚ °°°°°°°°°°°°°°°`               β”‚ T+7760.649                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) cache context still valid (GMV hit @7704867927687-0)
β”‚:67  rc β”‚ °°°°°°°°°°°°°°°`               β”‚ T+7760.655 ~ 7760.658 (      2 Β΅s) β”‚    29       β”‚ AddConflictRange Read! (|Directory|, `<FE>`, `metadata`) <= k < (|Directory|, `<FE>`, `metadata`, null)
β”‚ 68  // β”‚ °°°°°°°°°°°°°°°`               β”‚ T+7760.667                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) subspace HIT for Tests/Fdb/MUTSUMI/queue: DirectorySubspace(path=Tests/Fdb/MUTSUMI/queue, prefix=(284, 187))
β”‚:68 !R *β”‚ °°°°°°°°°°°°°°°=#=             β”‚ T+7760.672 ~ 8761.649 (1,000,977 Β΅s) β”‚    10       β”‚ GetRange fGE{(284, 187)} <= k < fGE{(284, 188)}, limit(1), Exact => [FutureVersion] Request for future version
β”‚ 69  Er*β”‚ °°°°°°°°°°°°°°°__`             β”‚ T+8761.729 ~ 8772.444 ( 10,715 Β΅s) β”‚             β”‚ OnError FutureVersion (1009)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ == Attempt #13 ==
β”‚ 70  // β”‚ °°°°°°°°°°°°°°°°°`             β”‚ T+8772.455                        β”‚     -     - β”‚ // Pop()
β”‚:70  rvΒ°β”‚ °°°°°°°°°°°°°°°°°`             β”‚ T+8772.466 ~ 8773.655 (  1,189 Β΅s) β”‚             β”‚ GetReadVersion => 7704876080098
β”‚ 71  G  β”‚ °°°°°°°°°°°°°°°°°`             β”‚ T+8773.679 ~ 8773.714 (     36 Β΅s) β”‚    17    10 β”‚ Get <FF>/metadataVersion => @7704867927687-0
β”‚ 72  // β”‚ °°°°°°°°°°°°°°°°°`             β”‚ T+8773.724                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) cache context still valid (GMV hit @7704867927687-0)
β”‚:72  rc β”‚ °°°°°°°°°°°°°°°°°`             β”‚ T+8773.731 ~ 8773.734 (      3 Β΅s) β”‚    29       β”‚ AddConflictRange Read! (|Directory|, `<FE>`, `metadata`) <= k < (|Directory|, `<FE>`, `metadata`, null)
β”‚ 73  // β”‚ °°°°°°°°°°°°°°°°°`             β”‚ T+8773.743                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) subspace HIT for Tests/Fdb/MUTSUMI/queue: DirectorySubspace(path=Tests/Fdb/MUTSUMI/queue, prefix=(284, 187))
β”‚:73  R *β”‚ °°°°°°°°°°°°°°°°°=#+           β”‚ T+8773.754 ~ 9706.064 (932,309 Β΅s) β”‚    10    24 β”‚ GetRange fGE{(284, 187)} <= k < fGE{(284, 188)}, limit(1), Exact => 1 result(s), has_more
β”‚ 74  c  β”‚ °°°°°°°°°°°°°°°°°__`           β”‚ T+9706.070 ~ 9706.078 (      7 Β΅s) β”‚    18       β”‚ Clear (284, 187, @7704868116986-1#1)
β”‚ 75 !CoΒ°β”‚ °°°°°°°°°°°°°°°°°__`           β”‚ T+9706.092 ~ 9709.799 (  3,707 Β΅s) β”‚             β”‚ Commit => [NotCommitted] Transaction not committed due to conflict with another transaction
β”‚ 76  Er*β”‚ °°°°°°°°°°°°°°°°°__x           β”‚ T+9709.821 ~ 9995.059 (285,238 Β΅s) β”‚             β”‚ OnError NotCommitted (1020)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ == Attempt #14 ==
β”‚ 77  // β”‚ °°°°°°°°°°°°°°°°°°°`           β”‚ T+9995.070                        β”‚     -     - β”‚ // Pop()
β”‚:77  rvΒ°β”‚ °°°°°°°°°°°°°°°°°°°`           β”‚ T+9995.079 ~ 9996.248 (  1,169 Β΅s) β”‚             β”‚ GetReadVersion => 7704878161149
β”‚ 78  G  β”‚ °°°°°°°°°°°°°°°°°°°`           β”‚ T+9996.274 ~ 9996.352 (     78 Β΅s) β”‚    17    10 β”‚ Get <FF>/metadataVersion => @7704867927687-0
β”‚ 79  // β”‚ °°°°°°°°°°°°°°°°°°°`           β”‚ T+9996.362                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) cache context still valid (GMV hit @7704867927687-0)
β”‚:79  rc β”‚ °°°°°°°°°°°°°°°°°°°`           β”‚ T+9996.374 ~ 9996.380 (      6 Β΅s) β”‚    29       β”‚ AddConflictRange Read! (|Directory|, `<FE>`, `metadata`) <= k < (|Directory|, `<FE>`, `metadata`, null)
β”‚ 80  // β”‚ °°°°°°°°°°°°°°°°°°°`           β”‚ T+9996.390                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) subspace HIT for Tests/Fdb/MUTSUMI/queue: DirectorySubspace(path=Tests/Fdb/MUTSUMI/queue, prefix=(284, 187))
β”‚:80 !R *β”‚ °°°°°°°°°°°°°°°°°°°`##         β”‚ T+9996.398 ~ 10997.169 (1,000,772 Β΅s) β”‚    10       β”‚ GetRange fGE{(284, 187)} <= k < fGE{(284, 188)}, limit(1), Exact => [FutureVersion] Request for future version
β”‚ 81  Er*β”‚ °°°°°°°°°°°°°°°°°°°__``        β”‚ T+10997.253 ~ 11007.901 ( 10,647 Β΅s) β”‚             β”‚ OnError FutureVersion (1009)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ == Attempt #15 ==
β”‚ 82  // β”‚ °°°°°°°°°°°°°°°°°°°°°°`        β”‚ T+11007.906                        β”‚     -     - β”‚ // Pop()
β”‚:82  rvΒ°β”‚ °°°°°°°°°°°°°°°°°°°°°°`        β”‚ T+11007.915 ~ 11014.543 (  6,627 Β΅s) β”‚             β”‚ GetReadVersion => 7704878161149
β”‚ 83  G  β”‚ °°°°°°°°°°°°°°°°°°°°°°`        β”‚ T+11014.557 ~ 11014.581 (     25 Β΅s) β”‚    17    10 β”‚ Get <FF>/metadataVersion => @7704867927687-0
β”‚ 84  // β”‚ °°°°°°°°°°°°°°°°°°°°°°`        β”‚ T+11014.591                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) cache context still valid (GMV hit @7704867927687-0)
β”‚:84  rc β”‚ °°°°°°°°°°°°°°°°°°°°°°`        β”‚ T+11014.597 ~ 11014.599 (      2 Β΅s) β”‚    29       β”‚ AddConflictRange Read! (|Directory|, `<FE>`, `metadata`) <= k < (|Directory|, `<FE>`, `metadata`, null)
β”‚ 85  // β”‚ °°°°°°°°°°°°°°°°°°°°°°`        β”‚ T+11014.609                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) subspace HIT for Tests/Fdb/MUTSUMI/queue: DirectorySubspace(path=Tests/Fdb/MUTSUMI/queue, prefix=(284, 187))
β”‚:85 !R *β”‚ °°°°°°°°°°°°°°°°°°°°°°##`      β”‚ T+11014.614 ~ 12015.916 (1,001,302 Β΅s) β”‚    10       β”‚ GetRange fGE{(284, 187)} <= k < fGE{(284, 188)}, limit(1), Exact => [FutureVersion] Request for future version
β”‚ 86  Er*β”‚ °°°°°°°°°°°°°°°°°°°°°°__`      β”‚ T+12015.998 ~ 12026.704 ( 10,707 Β΅s) β”‚             β”‚ OnError FutureVersion (1009)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ == Attempt #16 ==
β”‚ 87  // β”‚ °°°°°°°°°°°°°°°°°°°°°°°°`      β”‚ T+12026.715                        β”‚     -     - β”‚ // Pop()
β”‚:87  rvΒ°β”‚ °°°°°°°°°°°°°°°°°°°°°°°°`      β”‚ T+12026.724 ~ 12034.234 (  7,510 Β΅s) β”‚             β”‚ GetReadVersion => 7704878161149
β”‚ 88  G  β”‚ °°°°°°°°°°°°°°°°°°°°°°°°`      β”‚ T+12034.254 ~ 12034.312 (     58 Β΅s) β”‚    17    10 β”‚ Get <FF>/metadataVersion => @7704867927687-0
β”‚ 89  // β”‚ °°°°°°°°°°°°°°°°°°°°°°°°`      β”‚ T+12034.321                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) cache context still valid (GMV hit @7704867927687-0)
β”‚:89  rc β”‚ °°°°°°°°°°°°°°°°°°°°°°°°`      β”‚ T+12034.329 ~ 12034.333 (      4 Β΅s) β”‚    29       β”‚ AddConflictRange Read! (|Directory|, `<FE>`, `metadata`) <= k < (|Directory|, `<FE>`, `metadata`, null)
β”‚ 90  // β”‚ °°°°°°°°°°°°°°°°°°°°°°°°`      β”‚ T+12034.344                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) subspace HIT for Tests/Fdb/MUTSUMI/queue: DirectorySubspace(path=Tests/Fdb/MUTSUMI/queue, prefix=(284, 187))
β”‚:90 !R *β”‚ °°°°°°°°°°°°°°°°°°°°°°°°&#.    β”‚ T+12034.351 ~ 13035.306 (1,000,955 Β΅s) β”‚    10       β”‚ GetRange fGE{(284, 187)} <= k < fGE{(284, 188)}, limit(1), Exact => [FutureVersion] Request for future version
β”‚ 91  Er*β”‚ °°°°°°°°°°°°°°°°°°°°°°°°__`    β”‚ T+13035.392 ~ 13046.062 ( 10,670 Β΅s) β”‚             β”‚ OnError FutureVersion (1009)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ == Attempt #17 ==
β”‚ 92  // β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°`    β”‚ T+13046.069                        β”‚     -     - β”‚ // Pop()
β”‚:92  rvΒ°β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°`    β”‚ T+13046.081 ~ 13054.302 (  8,221 Β΅s) β”‚             β”‚ GetReadVersion => 7704878161149
β”‚ 93  G  β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°`    β”‚ T+13054.322 ~ 13054.357 (     35 Β΅s) β”‚    17    10 β”‚ Get <FF>/metadataVersion => @7704867927687-0
β”‚ 94  // β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°`    β”‚ T+13054.368                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) cache context still valid (GMV hit @7704867927687-0)
β”‚:94  rc β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°`    β”‚ T+13054.376 ~ 13054.380 (      4 Β΅s) β”‚    29       β”‚ AddConflictRange Read! (|Directory|, `<FE>`, `metadata`) <= k < (|Directory|, `<FE>`, `metadata`, null)
β”‚ 95  // β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°`    β”‚ T+13054.389                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) subspace HIT for Tests/Fdb/MUTSUMI/queue: DirectorySubspace(path=Tests/Fdb/MUTSUMI/queue, prefix=(284, 187))
β”‚:95  R *β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°&.   β”‚ T+13054.397 ~ 13552.921 (498,524 Β΅s) β”‚    10    24 β”‚ GetRange fGE{(284, 187)} <= k < fGE{(284, 188)}, limit(1), Exact => 1 result(s), has_more
β”‚ 96  c  β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°_`   β”‚ T+13552.930 ~ 13552.935 (      6 Β΅s) β”‚    18       β”‚ Clear (284, 187, @7704868124964-0#1)
β”‚ 97 !CoΒ°β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°_`   β”‚ T+13552.951 ~ 13559.356 (  6,406 Β΅s) β”‚             β”‚ Commit => [NotCommitted] Transaction not committed due to conflict with another transaction
β”‚ 98  Er*β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°_+   β”‚ T+13559.382 ~ 13754.705 (195,323 Β΅s) β”‚             β”‚ OnError NotCommitted (1020)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ == Attempt #18 ==
β”‚ 99  // β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°°`   β”‚ T+13754.716                        β”‚     -     - β”‚ // Pop()
β”‚:99  rvΒ°β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°°`   β”‚ T+13754.726 ~ 13755.952 (  1,226 Β΅s) β”‚             β”‚ GetReadVersion => 7704882051983
β”‚ 100 G  β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°°`   β”‚ T+13755.967 ~ 13756.001 (     34 Β΅s) β”‚    17    10 β”‚ Get <FF>/metadataVersion => @7704867927687-0
β”‚ 101 // β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°°`   β”‚ T+13756.011                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) cache context still valid (GMV hit @7704867927687-0)
β”‚:101 rc β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°°`   β”‚ T+13756.018 ~ 13756.021 (      3 Β΅s) β”‚    29       β”‚ AddConflictRange Read! (|Directory|, `<FE>`, `metadata`) <= k < (|Directory|, `<FE>`, `metadata`, null)
β”‚ 102 // β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°°`   β”‚ T+13756.030                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) subspace HIT for Tests/Fdb/MUTSUMI/queue: DirectorySubspace(path=Tests/Fdb/MUTSUMI/queue, prefix=(284, 187))
β”‚:102 R *β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°°.   β”‚ T+13756.037 ~ 13784.236 ( 28,199 Β΅s) β”‚    10    24 β”‚ GetRange fGE{(284, 187)} <= k < fGE{(284, 188)}, limit(1), Exact => 1 result(s), has_more
β”‚ 103 c  β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°°`   β”‚ T+13784.245 ~ 13784.252 (      8 Β΅s) β”‚    18       β”‚ Clear (284, 187, @7704868143910-0#1)
β”‚ 104!CoΒ°β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°°`   β”‚ T+13784.268 ~ 13788.198 (  3,931 Β΅s) β”‚             β”‚ Commit => [NotCommitted] Transaction not committed due to conflict with another transaction
β”‚ 105 Er*β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°°++  β”‚ T+13788.243 ~ 14179.097 (390,854 Β΅s) β”‚             β”‚ OnError NotCommitted (1020)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ == Attempt #19 ==
β”‚ 106 // β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°°°`  β”‚ T+14179.108                        β”‚     -     - β”‚ // Pop()
β”‚:106 rvΒ°β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°°°`  β”‚ T+14179.118 ~ 14180.315 (  1,198 Β΅s) β”‚             β”‚ GetReadVersion => 7704882220744
β”‚ 107 G  β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°°°`  β”‚ T+14180.340 ~ 14180.380 (     40 Β΅s) β”‚    17    10 β”‚ Get <FF>/metadataVersion => @7704867927687-0
β”‚ 108 // β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°°°`  β”‚ T+14180.391                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) cache context still valid (GMV hit @7704867927687-0)
β”‚:108 rc β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°°°`  β”‚ T+14180.399 ~ 14180.403 (      4 Β΅s) β”‚    29       β”‚ AddConflictRange Read! (|Directory|, `<FE>`, `metadata`) <= k < (|Directory|, `<FE>`, `metadata`, null)
β”‚ 109 // β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°°°`  β”‚ T+14180.415                        β”‚     -     - β”‚ // DirectoryLayer(<empty>) subspace HIT for Tests/Fdb/MUTSUMI/queue: DirectorySubspace(path=Tests/Fdb/MUTSUMI/queue, prefix=(284, 187))
β”‚:109!R *β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°°°x# β”‚ T+14180.423 ~ 15000.464 (820,041 Β΅s) β”‚    10       β”‚ GetRange fGE{(284, 187)} <= k < fGE{(284, 188)}, limit(1), Exact => [TransactionTimedOut] Operation aborted because the transaction timed out
β”‚ 110!Er β”‚ °°°°°°°°°°°°°°°°°°°°°°°°°°°°_` β”‚ T+15000.549 ~ 15000.706 (    157 Β΅s) β”‚             β”‚ OnError TransactionTimedOut (1031) => [TransactionTimedOut] Operation aborted because the transaction timed out
β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
> Read 382 bytes and Committed 65 bytes in 15,000.751 ms and 19 attempt(s)

Pointed the test at a different cluster (8 hosts, 64 processes, 16 disks) and got mostly the same behavior, though slightly faster: test usually completes in 2 sec to 15 seconds, depending on if the issue occurs. There is a high volatility in the total time, and the median completion time seems to be at 6-7 sec for this server, vs 15-20 sec for the other.

I’ve added more timestamps to the logs, and it looks like there are multiple occurrence of a 3 to 4 sec long freeze were the versionstamp does not appear to move at all in the cluster, and were every transaction in flight at that time gets β€œfuture version” errors, stall for 1 sec, reset, get the same rv, stall again, etc… for 3 or 4 times until they all unfreeze.

I’m also seeing the the delta of read version, when these β€œfreeze” occurs, is still corelated to the elapsed time: for example a transaction that stalls 3 times with a future_version, followed by a not_committed, will see read versions (N, N, N, N + 3 millions), so the value is still incremented in the background, but the transaction will see the same value for more than 3 seconds across 3 retries.

I’m very skeptical that there is a slow down due to network (400 keys of a dozen bytes each), cpu (24c/48t for first server, 8 x 4c/8t for the second), disks (4 to 16 NVME drives) or server OS (windows and linux).

The things in common here seem to be multiple transactions involved on the same read and write hotspot with β€œa lot” of conflicts (only 4 threads), the same application+binding code, the use of versionstamped keys, and the same client os (windows).

I don’t think this would have anything to do with what onError does in response to the future_version error, as the behavior should be the same as with transaction_too_old. Possibly what’s happening here is that the non-advancing versions being handed out by the proxy are somehow correlated with the storage servers also not advancing their versions.

Do you see any signs of temporary saturation on any processes, or maybe some long SlowTask events that would help identify where things get stuck?

Sorry for not replying sooner.

I’ve re-run the test by deleting all the logs, starting all the process, running the code, stopping all the process, and looking for severity 30 events.
But it seems that all these events are happening on the restart of each process, before the test code was started, so I don’t thunk they are related… ?

<Event Severity="10" Time="1578948415.064290" Type="Net2Starting" ID="0000000000000000" Machine="127.0.0.1:4502" LogGroup="default" />
...
<Event Severity="30" Time="1578948415.510560" Type="SlowTask" ID="0000000000000000" TaskID="20001" MClocks="1651.31" Duration="0.434533" SampleRate="1" NumYields="0" Machine="127.0.0.1:4500" LogGroup="default" />
<Event Severity="30" Time="1578948415.221846" Type="SlowTask" ID="0000000000000000" TaskID="7010" MClocks="1300.23" Duration="0.342147" SampleRate="1" NumYields="0" Machine="127.0.0.1:4501" LogGroup="default" Roles="TL" />
<Event Severity="30" Time="1578948415.106004" Type="SlowTask" ID="0000000000000000" TaskID="7010" MClocks="1340.82" Duration="0.35283" SampleRate="1" NumYields="0" Machine="127.0.0.1:4502" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1578948415.110392" Type="SlowTask" ID="0000000000000000" TaskID="7010" MClocks="1466.41" Duration="0.385879" SampleRate="1" NumYields="0" Machine="127.0.0.1:4503" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1578948415.123878" Type="SlowTask" ID="0000000000000000" TaskID="7010" MClocks="1492.39" Duration="0.392713" SampleRate="1" NumYields="0" Machine="127.0.0.1:4504" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1578948415.122717" Type="SlowTask" ID="0000000000000000" TaskID="7010" MClocks="1488.06" Duration="0.391574" SampleRate="1" NumYields="0" Machine="127.0.0.1:4505" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1578948415.135932" Type="SlowTask" ID="0000000000000000" TaskID="7010" MClocks="1471.37" Duration="0.387184" SampleRate="1" NumYields="0" Machine="127.0.0.1:4506" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1578948415.135375" Type="SlowTask" ID="0000000000000000" TaskID="7010" MClocks="1483.58" Duration="0.390396" SampleRate="1" NumYields="0" Machine="127.0.0.1:4507" LogGroup="default" Roles="SS" />
<Event Severity="30" Time="1578948415.531135" Type="SlowTask" ID="0000000000000000" TaskID="20001" MClocks="1565.85" Duration="0.412045" SampleRate="1" NumYields="0" Machine="127.0.0.1:4508" LogGroup="default" />

Looking further in the logs for events with severity 20, I see a bunch of SlowSSLoopx100 events but all with an elapsed time of about 100ms, while the code seems to consider 50ms the limit…

<Event Severity="20" Time="1578948419.327296" Type="SlowSSLoopx100" ID="f651c7fb927716a2" Elapsed="0.10207" Machine="127.0.0.1:4502" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1578948439.310225" Type="SlowSSLoopx100" ID="f651c7fb927716a2" Elapsed="0.102088" Machine="127.0.0.1:4502" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1578948443.278692" Type="SlowSSLoopx100" ID="f651c7fb927716a2" Elapsed="0.109086" Machine="127.0.0.1:4502" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1578948450.021337" Type="SlowSSLoopx100" ID="f651c7fb927716a2" Elapsed="0.102057" Machine="127.0.0.1:4502" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1578948469.098325" Type="SlowSSLoopx100" ID="f651c7fb927716a2" Elapsed="0.099087" Machine="127.0.0.1:4502" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1578948475.092963" Type="SlowSSLoopx100" ID="f651c7fb927716a2" Elapsed="0.106093" Machine="127.0.0.1:4502" LogGroup="default" Roles="SS" />
<Event Severity="20" Time="1578948484.339430" Type="N2_ReadProbeError" ID="d10605e22177f333" SuppressedEventCount="0" Message="10054" Machine="127.0.0.1:4502" LogGroup="default" Roles="SS" />

I also see one N2_ReadProbeError and a bunch of RecruiteStorageNotAvailable events every few seconds on the process with roles CC, CD, MP.

<Event Severity="20" Time="1578948415.514550" Type="ClusterWatchDatabaseRetrying" ID="768e2bbd7976ffa5" Error="no_more_servers" ErrorDescription="Not enough physical servers available" ErrorCode="1008" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD" />
<Event Severity="20" Time="1578948416.890623" Type="RecruitStorageNotAvailable" ID="768e2bbd7976ffa5" Error="no_more_servers" ErrorDescription="Not enough physical servers available" ErrorCode="1008" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MP" />
<Event Severity="20" Time="1578948417.146816" Type="RecruitStorageNotAvailable" ID="768e2bbd7976ffa5" Error="no_more_servers" ErrorDescription="Not enough physical servers available" ErrorCode="1008" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MP" />
<Event Severity="20" Time="1578948424.570995" Type="RecruitStorageNotAvailable" ID="768e2bbd7976ffa5" Error="no_more_servers" ErrorDescription="Not enough physical servers available" ErrorCode="1008" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MP" />
<Event Severity="20" Time="1578948425.062570" Type="RecruitStorageNotAvailable" ID="768e2bbd7976ffa5" Error="no_more_servers" ErrorDescription="Not enough physical servers available" ErrorCode="1008" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MP" />
<Event Severity="20" Time="1578948425.062570" Type="RecruitStorageNotAvailable" ID="768e2bbd7976ffa5" Error="no_more_servers" ErrorDescription="Not enough physical servers available" ErrorCode="1008" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MP" />
<Event Severity="20" Time="1578948484.347929" Type="N2_ReadProbeError" ID="83ef98ea5bafa99b" SuppressedEventCount="0" Message="10054" Machine="127.0.0.1:4500" LogGroup="default" Roles="CC,CD,MP" />

Here is a sample of the ...Metrics events for one SS process in the middle of the test, I’m not seeing anything obvious:

<Event Severity="10" Time="1578948455.068136" Type="DiskMetrics" ID="f651c7fb927716a2" ReadOps="463" WriteOps="18" ReadQueue="0" WriteQueue="0" GlobalSQLiteMemoryHighWater="0" Machine="127.0.0.1:4502" LogGroup="default" Roles="SS" />
<Event Severity="10" Time="1578948455.068136" Type="SpringCleaningMetrics" ID="f651c7fb927716a2" SpringCleaningCount="39" LazyDeletePages="0" VacuumedPages="0" SpringCleaningTime="0" LazyDeleteTime="0" VacuumTime="0" Machine="127.0.0.1:4502" LogGroup="default" Roles="SS" />
<Event Severity="10" Time="1578948455.068136" Type="TransactionMetrics" ID="7edb5ada183c4e3d" Elapsed="5.00078" Cluster="" Internal="1" ReadVersions="0 0 0" LogicalUncachedReads="0 0 0" PhysicalReadRequests="0 0 0" CommittedMutations="0 0 1" CommittedMutationBytes="0 0 548" CommitStarted="0 0 3" CommitCompleted="0 0 1" TooOld="0 0 0" FutureVersions="0 0 0" NotCommitted="0 0 1" MaybeCommitted="0 0 0" ResourceConstrained="0 0 0" ProcessBehind="0 0 0" MeanLatency="0" MedianLatency="0" Latency90="0" Latency98="0" MaxLatency="0" MeanRowReadLatency="0" MedianRowReadLatency="0" MaxRowReadLatency="0" MeanGRVLatency="0" MedianGRVLatency="0" MaxGRVLatency="0" MeanCommitLatency="0" MedianCommitLatency="0" MaxCommitLatency="0" MeanMutationsPerCommit="0" MedianMutationsPerCommit="0" MaxMutationsPerCommit="0" MeanBytesPerCommit="0" MedianBytesPerCommit="0" MaxBytesPerCommit="0" Machine="127.0.0.1:4502" LogGroup="default" Roles="SS" />
<Event Severity="10" Time="1578948455.111179" Type="ProcessMetrics" ID="0000000000000000" Elapsed="5.00086" CPUSeconds="0.046875" MainThreadCPUSeconds="0.046875" UptimeSeconds="40.0468" Memory="53260288" ResidentMemory="55898112" UnusedAllocatedMemory="393216" MbpsSent="0.154527" MbpsReceived="0.210018" DiskTotalBytes="499971518464" DiskFreeBytes="496622583808" DiskQueueDepth="0" DiskIdleSeconds="4.93317" DiskReads="4.00035" DiskWrites="60.0053" DiskReadsCount="0" DiskWritesCount="0" DiskWriteSectors="504.044" DiskReadSectors="0" FileWrites="0" FileReads="0" CacheReadBytes="24576" CacheFinds="42" CacheWritesBlocked="24" CacheReadsBlocked="6" CachePageReadsMerged="0" CacheWrites="42" CacheReads="3738" CacheHits="3764" CacheMisses="4" CacheEvictions="0" ZoneID="b1668c8918dc7f532e7cdefbdc90dbf6" MachineID="b1668c8918dc7f532e7cdefbdc90dbf6" AIOSubmitCount="0" AIOCollectCount="0" AIOSubmitLag="0" AIODiskStall="0" CurrentConnections="6" ConnectionsEstablished="0.199966" ConnectionsClosed="0" Connec...
<Event Severity="10" Time="1578948455.111179" Type="MemoryMetrics" ID="0000000000000000" TotalMemory16="131072" ApproximateUnusedMemory16="0" ActiveThreads16="1" TotalMemory32="1441792" ApproximateUnusedMemory32="0" ActiveThreads32="1" TotalMemory64="3276800" ApproximateUnusedMemory64="0" ActiveThreads64="2" TotalMemory96="393120" ApproximateUnusedMemory96="0" ActiveThreads96="1" TotalMemory128="262144" ApproximateUnusedMemory128="0" ActiveThreads128="1" TotalMemory256="262144" ApproximateUnusedMemory256="0" ActiveThreads256="1" TotalMemory512="131072" ApproximateUnusedMemory512="0" ActiveThreads512="1" TotalMemory1024="131072" ApproximateUnusedMemory1024="0" ActiveThreads1024="1" TotalMemory2048="131072" ApproximateUnusedMemory2048="0" ActiveThreads2048="1" TotalMemory4096="5242880" ApproximateUnusedMemory4096="0" ActiveThreads4096="1" TotalMemory8192="786432" ApproximateUnusedMemory8192="393216" ActiveThreads8192="1" HugeArenaMemory="37356" Machine="127.0.0.1:4502" LogGroup="default" Roles=...
<Event Severity="10" Time="1578948455.111179" Type="NetworkMetrics" ID="0000000000000000" Elapsed="5.00086" CantSleep="0" WontSleep="0" Yields="9" YieldCalls="5177" YieldCallsTrue="4258" SlowTaskSignals="0" YieldBigStack="4256" RunLoopIterations="1452" TimersExecuted="1640" TasksExecuted="11606" ASIOEventsProcessed="1501" ReadCalls="1268" WriteCalls="627" ReadProbes="633" WriteProbes="0" PacketsRead="678" PacketsGenerated="672" WouldBlock="633" LaunchTime="0" ReactTime="0.0066216" SlowTask2M="4" SlowTask4M="1" SlowTask16M="6" SlowTask32M="2" PriorityBusy1="0.059958" SumOfSquaredPriorityBusy1="0.000228099" PriorityBusy2050="0.059958" SumOfSquaredPriorityBusy2050="0.000228099" PriorityBusy3050="0.0317917" SumOfSquaredPriorityBusy3050="2.15078e-06" PriorityBusy4050="0.0317917" SumOfSquaredPriorityBusy4050="2.15078e-06" PriorityBusy4950="0.0317917" SumOfSquaredPriorityBusy4950="2.15078e-06" PriorityBusy5050="0.00868034" SumOfSquaredPriorityBusy5050="3.29945e-07" PriorityBusy7050="0.00837016" SumO...
<Event Severity="10" Time="1578948455.111179" Type="MachineMetrics" ID="0000000000000000" Elapsed="5.00086" MbpsSent="1.9806" MbpsReceived="0.657204" OutSegs="27397" RetransSegs="7" CPUSeconds="0.0799677" TotalMemory="68609708032" CommittedMemory="35140792320" AvailableMemory="33468915712" ZoneID="b1668c8918dc7f532e7cdefbdc90dbf6" MachineID="b1668c8918dc7f532e7cdefbdc90dbf6" Machine="127.0.0.1:4502" LogGroup="default" Roles="SS" TrackLatestType="Original" />

And finally, searching for β€œFutureVersions="” in the TransactionMetrics events, I see a few of them with non-zero values

<Event Severity="10" Time="1578948481.652617" Type="TransactionMetrics" ID="bd09d7cf99bc8eb3" Elapsed="5.00059" Cluster="" Internal="1" ReadVersions="0.999881 2.3143 65" LogicalUncachedReads="2.99964 6.94289 195" PhysicalReadRequests="2.99964 6.94148 198" CommittedMutations="0 0 0" CommittedMutationBytes="0 0 0" CommitStarted="0 0 0" CommitCompleted="0 0 0" TooOld="0 0 0" FutureVersions="0.199976 0.378995 1" NotCommitted="0 0 0" MaybeCommitted="0 0 0" ResourceConstrained="0 0 0" ProcessBehind="0 0 0" MeanLatency="0" MedianLatency="0" Latency90="0" Latency98="0" MaxLatency="0" MeanRowReadLatency="0" MedianRowReadLatency="0" MaxRowReadLatency="0" MeanGRVLatency="0.00102258" MedianGRVLatency="0.00113964" MaxGRVLatency="0.00121284" MeanCommitLatency="0" MedianCommitLatency="0" MaxCommitLatency="0" MeanMutationsPerCommit="0" MedianMutationsPerCommit="0" MaxMutationsPerCommit="0" MeanBytesPerCommit="0" MedianBytesPerCommit="0" MaxBytesPerCommit="0" Machine="127.0.0.1:4508" LogGroup="default" Roles="DD,MP,MS,RK" />
<Event Severity="10" Time="1578948481.654633" Type="TransactionMetrics" ID="b5b48910a6b5e716" Elapsed="5.0006" Cluster="" Internal="1" ReadVersions="2.59969 5.44851 157" LogicalUncachedReads="4.59945 9.63968 295" PhysicalReadRequests="4.59945 9.64091 298" CommittedMutations="0 0 1" CommittedMutationBytes="0 0 64" CommitStarted="0 0 1" CommitCompleted="0 0 1" TooOld="0 0 0" FutureVersions="0.399952 0.890914 2" NotCommitted="0 0 0" MaybeCommitted="0 0 0" ResourceConstrained="0 0 0" ProcessBehind="0 0 0" MeanLatency="0" MedianLatency="0" Latency90="0" Latency98="0" MaxLatency="0" MeanRowReadLatency="0.268664" MedianRowReadLatency="0.00029707" MaxRowReadLatency="0.708174" MeanGRVLatency="0.00132126" MedianGRVLatency="0.00114918" MaxGRVLatency="0.00207305" MeanCommitLatency="0" MedianCommitLatency="0" MaxCommitLatency="0" MeanMutationsPerCommit="0" MedianMutationsPerCommit="0" MaxMutationsPerCommit="0" MeanBytesPerCommit="0" MedianBytesPerCommit="0" MaxBytesPerCommit="0" Machine="127.0.0.1:4508" LogGroup="default" Roles="DD,MP,MS,RK" />

<Event Severity="10" Time="1578948464.832542" Type="TransactionMetrics" ID="a99219476246e1d3" Elapsed="5.00009" Cluster="691107463418911801231975" Internal="1" ReadVersions="0.999982 1.99889 95" LogicalUncachedReads="2.79995 4.30692 422" PhysicalReadRequests="2.79995 4.31631 422" CommittedMutations="0 0 4" CommittedMutationBytes="0 0 2016" CommitStarted="0 0 3" CommitCompleted="0 0 3" TooOld="0 0 0" FutureVersions="0.599989 2.23546 9" NotCommitted="0 0 0" MaybeCommitted="0 0 0" ResourceConstrained="0 0 0" ProcessBehind="0 0 0" MeanLatency="0" MedianLatency="0" Latency90="0" Latency98="0" MaxLatency="0" MeanRowReadLatency="0" MedianRowReadLatency="0" MaxRowReadLatency="0" MeanGRVLatency="0.00407577" MedianGRVLatency="0.00131536" MaxGRVLatency="0.00855803" MeanCommitLatency="0" MedianCommitLatency="0" MaxCommitLatency="0" MeanMutationsPerCommit="0" MedianMutationsPerCommit="0" MaxMutationsPerCommit="0" MeanBytesPerCommit="0" MedianBytesPerCommit="0" MaxBytesPerCommit="0" Machine="127.0.0.1:54576" LogGroup="default" />
<Event Severity="10" Time="1578948469.832963" Type="TransactionMetrics" ID="a99219476246e1d3" Elapsed="5.00042" Cluster="691107463418911801231975" Internal="1" ReadVersions="0.799933 1.34061 99" LogicalUncachedReads="2.79976 3.85447 436" PhysicalReadRequests="2.79976 3.8534 436" CommittedMutations="0 0 4" CommittedMutationBytes="0 0 2016" CommitStarted="0 0 3" CommitCompleted="0 0 3" TooOld="0 0 0" FutureVersions="0.399966 1.33325 11" NotCommitted="0 0 0" MaybeCommitted="0 0 0" ResourceConstrained="0 0 0" ProcessBehind="0 0 0" MeanLatency="0" MedianLatency="0" Latency90="0" Latency98="0" MaxLatency="0" MeanRowReadLatency="0" MedianRowReadLatency="0" MaxRowReadLatency="0" MeanGRVLatency="0.00141251" MedianGRVLatency="0.00119948" MaxGRVLatency="0.00204897" MeanCommitLatency="0" MedianCommitLatency="0" MaxCommitLatency="0" MeanMutationsPerCommit="0" MedianMutationsPerCommit="0" MaxMutationsPerCommit="0" MeanBytesPerCommit="0" MedianBytesPerCommit="0" MaxBytesPerCommit="0" Machine="127.0.0.1:54576" LogGroup="default" />
<Event Severity="10" Time="1578948474.833692" Type="TransactionMetrics" ID="a99219476246e1d3" Elapsed="5.00073" Cluster="691107463418911801231975" Internal="1" ReadVersions="1.19982 2.24428 105" LogicalUncachedReads="2.79959 4.9792 450" PhysicalReadRequests="2.79959 5.00177 450" CommittedMutations="0 0 4" CommittedMutationBytes="0 0 2016" CommitStarted="0 0 3" CommitCompleted="0 0 3" TooOld="0 0 0" FutureVersions="0.799883 2.0941 15" NotCommitted="0 0 0" MaybeCommitted="0 0 0" ResourceConstrained="0 0 0" ProcessBehind="0 0 0" MeanLatency="0" MedianLatency="0" Latency90="0" Latency98="0" MaxLatency="0" MeanRowReadLatency="0" MedianRowReadLatency="0" MaxRowReadLatency="0" MeanGRVLatency="0.00276411" MedianGRVLatency="0.00120831" MaxGRVLatency="0.0105598" MeanCommitLatency="0" MedianCommitLatency="0" MaxCommitLatency="0" MeanMutationsPerCommit="0" MedianMutationsPerCommit="0" MaxMutationsPerCommit="0" MeanBytesPerCommit="0" MedianBytesPerCommit="0" MaxBytesPerCommit="0" Machine="127.0.0.1:54576" LogGroup="default" />
<Event Severity="10" Time="1578948479.833904" Type="TransactionMetrics" ID="a99219476246e1d3" Elapsed="5.00021" Cluster="691107463418911801231975" Internal="1" ReadVersions="0.799966 1.45078 109" LogicalUncachedReads="2.79988 3.88567 464" PhysicalReadRequests="2.79988 3.88551 464" CommittedMutations="0 0 4" CommittedMutationBytes="0 0 2016" CommitStarted="0 0 3" CommitCompleted="0 0 3" TooOld="0 0 0" FutureVersions="0.399983 1.43976 17" NotCommitted="0 0 0" MaybeCommitted="0 0 0" ResourceConstrained="0 0 0" ProcessBehind="0 0 0" MeanLatency="0" MedianLatency="0" Latency90="0" Latency98="0" MaxLatency="0" MeanRowReadLatency="0" MedianRowReadLatency="0" MaxRowReadLatency="0" MeanGRVLatency="0.00121075" MedianGRVLatency="0.00120711" MaxGRVLatency="0.0012269" MeanCommitLatency="0" MedianCommitLatency="0" MaxCommitLatency="0" MeanMutationsPerCommit="0" MedianMutationsPerCommit="0" MaxMutationsPerCommit="0" MeanBytesPerCommit="0" MedianBytesPerCommit="0" MaxBytesPerCommit="0" Machine="127.0.0.1:54576" LogGroup="default" />
<Event Severity="10" Time="1578948484.834269" Type="TransactionMetrics" ID="a99219476246e1d3" Elapsed="5.00037" Cluster="691107463418911801231975" Internal="1" ReadVersions="0.999927 1.75605 114" LogicalUncachedReads="2.7998 3.60065 478" PhysicalReadRequests="2.7998 3.6008 478" CommittedMutations="0 0 4" CommittedMutationBytes="0 0 2016" CommitStarted="0 0 3" CommitCompleted="0 0 3" TooOld="0 0 0" FutureVersions="0.399971 0.9148 19" NotCommitted="0 0 0" MaybeCommitted="0 0 0" ResourceConstrained="0 0 0" ProcessBehind="0 0 0" MeanLatency="0" MedianLatency="0" Latency90="0" Latency98="0" MaxLatency="0" MeanRowReadLatency="0" MedianRowReadLatency="0" MaxRowReadLatency="0" MeanGRVLatency="0.00119448" MedianGRVLatency="0.00119472" MaxGRVLatency="0.00120544" MeanCommitLatency="0" MedianCommitLatency="0" MaxCommitLatency="0" MeanMutationsPerCommit="0" MedianMutationsPerCommit="0" MaxMutationsPerCommit="0" MeanBytesPerCommit="0" MedianBytesPerCommit="0" MaxBytesPerCommit="0" Machine="127.0.0.1:54576" LogGroup="default" />

Updating from 6.2.7 to 6.2.10 does not resolve the issue.

I’ve noticed that all the FutureVersions="...." traces that are not 0 0 0 are happening in either the backup agent process (127.0.0.1:54576 above), in one of the stateless process that has roles DD,MP,MS,RK (127.0.0.1:4508 above), or in the client traces if I enable them. In the other cluster process traces, they are always 0 0 0.

By the way, I don’t see anything obvious in the client traces when enabled, except a bunch of metrics every now and then. I’m not sure what option I can use to log more details about the error codes encountered by the client ?

OK, while playing with the transaction options to setup logging in the client traces (did not find anything more than what I already had with my own custom logging), I just tried randomly setting random transaction options, and just found out that setting the commit_on_first_proxy transaction option makes all the future_version errors vanish, and now the test completes in about a couple of seconds with zero 1009 errors. Without this option, the test randomly fails after 20 or 30 sec and tons of future_version followed by 1 sec stalls. … ?

Any idea what exact impact this option could have? I’m running on a multi-process local cluster which, to my knowledge, only has a single proxy process anyway?

This options is used only in this place in the code: https://github.com/apple/foundationdb/blob/8a065b9da4114e921f31c9145afa7033315b2e1c/fdbclient/NativeAPI.actor.cpp#L2646

This makes the code not call loadBalance() on the result of getMasterProxies() and apparently does not trigger the stall?

As a reminder, this happens a lot locally, on a 24c/48t fast server with 4 dedicated local NVME ssd and on Windows, and somewhat frequently remotely on a 8 node linux cluster as well, and impacts all connected clients (the client running the test, but also the backup agent process which observes the same 1009 errors in its own log).

1 Like

Can you confirm this is the case? One effect of this option is to limit commits to a single proxy, and knowing that there was only one option anyway could reduce the problem space.

My mistake, I do in fact have 3 proxies in this cluster, that are all assigned to a different stateless process.

When the future version occurs, I see cpu and disk activity in the low single digit %, and I don’t see any role jumping from process to process, the cluster looks idle and stable from the outside.

I tried adding an additional stateless process, which stays with no roles assigned the entire time.

This is a snapshot of the cluster activity in the middle of a failing test:

foundationdb.conf

[fdbmonitor]
restart_delay = 20

[general]
cluster_file=C:\ProgramData\foundationdb\fdb.cluster

[fdbserver]
public_address = auto:$ID
listen_address = public
parentpid = $PID
command=C:\Program Files\foundationdb\bin\fdbserver.exe
datadir=C:\ProgramData\foundationdb\data\$ID
logdir=C:\ProgramData\foundationdb\logs
knob_min_free_space_ratio=0.001  # get around 5% min free space limitation

[backup-agent]
command=C:\Program Files\foundationdb\bin\backup_agent.exe
logdir=C:\ProgramData\foundationdb\logs

[backup-agent.1]

[fdbserver.4500]
class=stateless

[fdbserver.4501]
class=log
datadir=K:\FoundationDB\Data\$ID

[fdbserver.4502]
class=storage
datadir=H:\FoundationDB\Data\$ID

[fdbserver.4503]
class=storage
datadir=H:\FoundationDB\Data\$ID

[fdbserver.4504]
class=storage
datadir=I:\FoundationDB\Data\$ID

[fdbserver.4505]
class=storage
datadir=I:\FoundationDB\Data\$ID

[fdbserver.4506]
class=storage
datadir=J:\FoundationDB\Data\$ID

[fdbserver.4507]
class=storage
datadir=J:\FoundationDB\Data\$ID

[fdbserver.4508]
class=stateless

[fdbserver.4509]
class=stateless

[fdbserver.4510]
class=stateless

Your fdbtop images show low read and write rates (37 Hz and 9 Hz, respectively). Is that an accurate characterization of your test, or is this something that only happens in the failing case?

This only happens when it fails. The test spawns 4 threads all attempting to pop about 100 keys each as fast as possible, but all conflicting with each other. So about 400ish transaction spread over 20-30 sec (abnormal) fits with the rates we are seeing. There are no other activity in the cluster while running the test (appart from fdbtop that polls every seconds).

This is the test in question: https://github.com/Doxense/foundationdb-dotnet-client/blob/1e8d626923e7469cd06494e5941433e6c21da8f4/FoundationDB.Layers.Common.Tests/QueuesFacts.cs#L218

It runs with K = 4, and NUM = 100.

All the β€œpush” threads completes almost immediately (they just set a bunch of versionstamped keys).

All the β€œpop” threads wait a bit and then concurrently pop the queue. If I set the commit_on_first_proxy option in line 286 just before calling PopAsync(…) then the test completes in about 2 seconds. Without, it takes 20-30 sec (and usually timeouts because I have a 15 sec timeout per transaction) with about 1/4 to a 1/3 of the queue remaining.

The Push method: https://github.com/Doxense/foundationdb-dotnet-client/blob/1e8d626923e7469cd06494e5941433e6c21da8f4/FoundationDB.Layers.Common/Collections/FdbQueue`1.cs#L98

The PopAsync method: https://github.com/Doxense/foundationdb-dotnet-client/blob/1e8d626923e7469cd06494e5941433e6c21da8f4/FoundationDB.Layers.Common/Collections/FdbQueue`1.cs#L113

Ah, and may or may not be of interest: I recently started using a lot the \xff/metadataVersion key (used by the Directory Layer cache), so all transaction will at least read that key as well as the metadata version.

I think I’m reproducing this locally, so let me poke around with it a bit and see what I can find out.

1 Like

Something new just happened: I had to reboot a few times for hardware maintenance, and now the future version are not happening without setting the first proxy option, but this time I only have a single proxy process!

Comparing with the previous screenshots, the cluster seem to have opted to assign a single proxy role on the new stateless process that I added before (4510) ?

So:

  • If there is a single proxy in the cluster, no issue
  • If there are multiple proxies and using the commit_on_first_proxy option, no issue
  • If there are multiple proxies, default options, issue happens.

How was I able to end up with 3 proxies in the first time, if after a reboot I only have one? Can it be a symptom of having enough process to assign all the roles, vs not ?

We’ve found a problem where it seems the storage server can fail to properly advance its version as far as it should if it gets an empty commit (in your case, due to conflicts), and it will get stuck that way until a new commit succeeds. Transactions that do reads from the lagging storage server will then not be able to commit because of the future version problem. Without any write-only transactions, the only thing that can advance the version is the proxy committing a new empty batch, which by default it does after 2 seconds of no commits.

I’m not entirely certain why this gets better with only one proxy. My best guess is that the specific conditions required to trigger this bug are more likely when multiple proxies are submitting commits due to timing effects, etc.

We are testing a fix for this now, hopefully it can go out in our next patch release.

One thing that may explain the sudden appearance of this issue for that particular test (that exists since 2014-15?) is that previously this test was using write-only transactions and a constant prefix, while now it is using cached directory subspaces, so all transactions will read at least one key (metadata version) and create a read conflict range before commit.

Here’s the PR with the proposed fix:

1 Like

I just wanna say that I finally got around to deploying this change (I bumped from 6.2.15 -> 6.2.19) and after 15 minutes of observing the cluster it looks waaay happier. Here’s a graph of p50/p95/p100 latencies on some read only transactions on 6.2.15:

And here’s a graph after 6.2.19:

That’s a reduction from 3.8s to 383ms!

1 Like