Identifying shards associated with replica unavailability; shard selection when dropping redundancy levels

Hello all! I have a few questions for the community.

First, when encountering compromised shard replication such as a state like

HEALING: Only one replica remains of some data

is it possible to determine the placement of the imperiled shard(s) to drive operational decision making? I understand that it’s FoundationDB’s job to manage data, but we’ve got a staging cluster in this state that’s absolutely refusing to move data / heal, and I’d like to learn more about the replica distribution.

Secondly, how intelligently are replicas chosen when dropping redundancy levels? We met this n=1 replica state after dropping from triple to double in an n=2 state, which was a surprise (there was plenty of other activity which may have impacted the observation). This leads me to have some concern for how replicas are chosen. Will FDB ensure that a replica is healthy/live before selecting it to “survive” a redundancy drop?

To get out of this stuck cluster’s state I’m thinking of either adding more storage processes or new machines with storage processes to allow healthy storage roles as healing targets. But we’d also considered dropping to single mode to allow us to exclude problematic storage processes and then start bringing the redundancy back up / healing.

Best,
Kyle

Any thoughts?

We’ve expanded the staging cluster with three new storage processes in three new zones and are happy to see a bump up to an n=2 replica state. We plan on pursuing process excludes to encourage further healing, excluding processes with actors exhibiting huge data lag e.g. 2075604805895 aka just not functioning.

FDB’s replication policy’s high level idea is similar to the Copyset paper: https://www.usenix.org/conference/atc13/technical-sessions/presentation/cidon
The actual implementation is more sophisticated than the paper.

By default, we choose a selective set of k replicas groups (we call it server team) that have different zoneIds.

I actually don’t understand the question below. Would you mind elaborating a bit, like giving a scenario?

Secondly, how intelligently are replicas chosen when dropping redundancy levels?

Hello @mengxu!

Yes, allow me give an update and to elaborate on lowering the redundancy level.

Redundancy level question

During the operational troubleshooting for this cluster, at one point the cluster was in triple redundancy mode, but reporting “Only two replicas remain of some data.”

At this point, one of our operators lowered the cluster’s redundancy level to double, expecting FoundationDB to choose the two live replicas, and basically wind up in a “normal” double redundancy situation.

Instead we found FoundationDB reporting “Only one replica remains of some data.”! So my question is: how intelligent is FoundationDB in choosing which replicas “survive” a lowering of redundancy modes, e.g. tripledouble.

Current Update

At this point, a few days later in troubleshooting, I can tell there is some serious issue trying to heal this data set. This may have been part of the problem leading to the situation I described above.

We expanded the cluster with three new machines/zones as I mentioned and found it did bring us back to an n=2 replicas state. However some of the storage processes are just very clearly unhealthy. Here is an example data lag – just extracted from status json.

role     actor_id          process_address          process_id                        durable_version  data_lag
storage  2ac9f1823a4d5f67  10.95.111.214:4504:tls   d0b26295ff5c6fc497e0a7d514794cf3  None             None                                                                                                                                                                           
storage  6f04ac5764ebccd4  10.95.111.214:4503:tls   0f44b61c2c7ab209eab5ffa04d7906ea  None             None                                                                                                                                                                           
storage  721e454b96e75d0c  10.95.111.214:4505:tls   b7615bd9bc0e1c4bc796636626e1a539  None             None                                                                                                                                                                           
storage  8b52db5932066c2d  10.185.175.43:4504:tls   e2632a0f8c0a76bfad6a660fc9284693  8889482016494    2164815125363                                                                                                                                                                  
storage  b873c12c848af3f1  10.185.175.43:4505:tls   2587c383a9044b046790f372fc3e83c3  8889482016494    2164815125363                                                                                                                                                                  
storage  d487eab27f823aa3  10.185.175.43:4503:tls   89317bb27bf21eb2be6e6c3de371d5ff  8889482016494    2164815125363                                                                                                                                                                  
storage  a78d2fc8068c6bc8  10.220.237.150:4504:tls  75262da7e16ad452fa18f654e9deca0c  8889041958540    2165150183317                                                                                                                                                                  
storage  db7c8c1449e77cb4  10.185.115.215:4503:tls  92907d8b67f512ef0f289da88e3761ec  8889041958540    2165150183317                                                                                                                                                                  
storage  c47540138fb4f3e4  10.95.111.214:4505:tls   b7615bd9bc0e1c4bc796636626e1a539  11054299296554   0                                                                                                                                                                              
storage  d14f692f7e8cb4bb  10.95.111.219:4503:tls   c68295ef3a1f825b5c6dd6f5c488b1aa  11054297141857   0                                                                                                                                                                              
storage  159a3452afb39ba4  10.95.111.219:4504:tls   13fd0ef40e1ea16876f9d6034d0416e1  11054295420150   1656540                                                                                                                                                                        
storage  1c84cbe629e664da  10.221.166.13:4505:tls   d50db70f76c64da3b46d67031b4b7b3f  11054295420150   1656540                                                                                                                                                                        
storage  1cf91d84d0405637  10.221.166.13:4503:tls   8c039c3fa145652c939b9e76a29d9105  11054295420150   1656540                                                                                                                                                                        
storage  37cbbce93187ccf1  10.208.190.181:4505:tls  66b70754318016036b391bcc8daeefca  11054295420150   1656540                                                                                                                                                                        
storage  6e2d6a04fe7fcdcf  10.185.175.43:4505:tls   2587c383a9044b046790f372fc3e83c3  11054295420150   1656540                                                                                                                                                                        
storage  71be057fa064dfa0  10.185.115.215:4505:tls  e62d4245342bbfe886acb1d61b0d7a42  11054295420150   1656540                                                                                                                                                                        
storage  7c3f5de634d89332  10.208.190.181:4504:tls  355fc729e9a1104be046cf9c11c71510  11054295420150   1656540                                                                                                                                                                        
storage  86f067a7b14892a6  10.220.237.150:4505:tls  d4b5f11f0832b88def06c3dffabcf5b7  11054295420150   1656540                                                                                                                                                                        
storage  a614df1c88908d1e  10.220.237.150:4505:tls  d4b5f11f0832b88def06c3dffabcf5b7  11054295420150   1656540                                                                                                                                                                        
storage  d090dee9dc188f6c  10.95.111.219:4505:tls   8608598d11e6c794191731ab4308e550  11054295420150   1656540                                                                                                                                                                        
storage  d207c53565bad221  10.208.190.181:4503:tls  99c95c04859c90dbc692228ae1ce2be7  11054295420150   1656540                                                                                                                                                                        
storage  f9f479eddee32ff3  10.221.166.13:4504:tls   9bed6336457eb511f5d3d090a53440c3  11054295420150   1656540                                                                                                                                                                        
storage  09a972450d63c006  10.185.175.85:4503:tls   52238f1eba411da9833f75ecd5868bd2  11054297141857   1721707                                                                                                                                                                        
storage  0a7c5e51d14f6223  10.220.237.150:4503:tls  fd75186a0c9989f34ec503bce4615a0e  11054297141857   1721707                                                                                                                                                                        
storage  1053cd090ac8a252  10.220.237.150:4504:tls  75262da7e16ad452fa18f654e9deca0c  11054297141857   1721707                                                                                                                                                                        
storage  10acef9c859fc163  10.185.175.85:4504:tls   25fabd28ce65f76a15cfbd20badc9592  11054297141857   1721707                                                                                                                                                                        
storage  5d9ae63025eb37b0  10.95.111.214:4504:tls   d0b26295ff5c6fc497e0a7d514794cf3  11054297141857   1721707                                                                                                                                                                        
storage  73a6404856fb4c13  10.95.111.214:4503:tls   0f44b61c2c7ab209eab5ffa04d7906ea  11054297141857   1721707                                                                                                                                                                        
storage  e6a418c607217be0  10.185.115.215:4504:tls  7fdb2d6ba9e1a72457ccfdbc5d81e65c  11054297141857   1721707                                                                                                                                                                        
storage  f4d63771128ea9af  10.185.175.85:4505:tls   b022af6271b7ab4fbe0e0fea69cd6286  11054297141857   1721707                                                                                                                                                                        
storage  2217d81e47c74b1a  10.185.175.43:4503:tls   89317bb27bf21eb2be6e6c3de371d5ff  11054293763610   2036983                                                                                                                                                                        
storage  2f0e9b7a6922fc2f  10.185.175.43:4504:tls   e2632a0f8c0a76bfad6a660fc9284693  11054293763610   2036983                                                                                                                                                                        

Now, we’ve tried to exclude these large data lag processes in the past. However data never moves off of them. I tried something more brutal – I brought down FoundationDB on the 10.185.175.43 machine/zone entirely in the hopes of seeing data redistribution. I saw the moving data value float up and down a few megabytes and then just flatline again.

I was able to line up logs with this action and notice a few standout warnings. I’ve reproduced them below with some minor sanitization.


We saw a steady rate of these which lines up with the lack of actual data movement.

Jul 20 11:20:20 fdb8 trace.10.208.190.181.4505.1594654180.VmrXDJ.1.82.json { “Severity”: “30”, “Time”: “1595265619.425258”, “Type”: “FetchKeysTooLong”, “ID”: “0000000000000000”, “Duration”: “611400”, “Phase”: “0”, “Begin”: “\x15\x1c\x15\x0f\x01\x16\x12\xb8\x00\x15\x14…”, “End”: “\x15\x1c\x15\x0f\x01\x16\x12\xc9\x00\x15\x15\…”, “Machine”: “10.208.190.181:4505”, “LogGroup”: “default”, “Roles”: “SS” }

Then I downed FoundationDB on the previously mentioned machine, saw these which is expected. We also dropped from n=2 replicas to n=1 replicas as we expected possible.

Jul 20 11:21:28 fdb4 trace.10.95.111.214.4503.1593164519.NZ7sl7.2.281.json { “Severity”: “30”, “Time”: “1595265688.623143”, “Type”: “TooManyConnectionsClosed”, “ID”: “0000000000000000”, “SuppressedEventCount”: “19”, “PeerAddr”: “10.185.175.43:4502:tls”, “Machine”: “10.95.111.214:4503”, “LogGroup”: “default”, “Roles”: “SS” }

Finally we see the cluster trying to handle data movement? Very confusingly I can’t find any of these source/destination (storage process?) ids in the status JSON.

Jul 20 11:21:35 fdb7 trace.10.185.175.85.4501.1594654180.i2uZdD.2.339.json { “Severity”: “30”, “Time”: “1595265694.457242”, “Type”: “RelocateShardTooLong”, “ID”: “0000000000000000”, “Error”: “operation_cancelled”, “ErrorDescription”: “Asynchronous operation cancelled”, “ErrorCode”: “1101”, “Duration”: “611298”, “Dest”: “0a7c5e51d14f6223d201b73c5ec0b4a2,1c84cbe629e664da7e217a4513f35b14,7c3f5de634d893329e38da967a22e63e”, “Src”: “0a7c5e51d14f6223d201b73c5ec0b4a2,e6a418c607217be0522f0b5b96d02396”, “Machine”: “10.185.175.85:4501”, “LogGroup”: “default”, “Roles”: “DD,MP” }

More data movement woes:

Jul 20 13:50:14 fdb5 trace.10.185.115.215.4500.1593170776.4NB1xR.2.467.json { “Severity”: “30”, “Time”: “1595274613.655347”, “Type”: “FinishMoveKeysTooLong”, “ID”: “0000000000000000”, “Duration”: “8400”, “Servers”: “09a972450d63c0069b109876b29ab61d,1cf91d84d04056372cf765ee452ed15e,d207c53565bad2213792a5d0f6255f16”, “Machine”: “10.185.115.215:4500”, “LogGroup”: “default”, “Roles”: “CD,DD,MS,RK” }

I can’t even find these ids in the status json – related?

kyle@fdb1:~$ fdbcli --exec ‘status json’ | grep -e 09a972450d63c0069b109876b29ab61d -e 1cf91d84d04056372cf765ee452ed15e -e d207c53565bad2213792a5d0f6255f16
kyle@fdb1:~$

Those and FetchKeysTooLong are now the primary >20 sev logs I see at this point.

I do see the occasional

Jul 20 14:00:38 fdb5 trace.10.185.115.215.4500.1593170776.4NB1xR.2.468.json { “Severity”: “30”, “Time”: “1595275238.168315”, “Type”: “TraceEventThrottle_BgDDMountainChopper”, “ID”: “0000000000000000”, “SuppressedEventCount”: “47”, “Machine”: “10.185.115.215:4500”, “LogGroup”: “default”, “Roles”: “CD,DD,MS,RK” }

But moving data has been fairly static for hours – sometimes increasing. You can see the moving data is now greater in size than the entire logical data set.

Data:
  Replication health     - HEALING: Only one replica remains of some data
  Moving data            - 21.887 GB
  Sum of key-value sizes - 20.820 GB
  Disk space used        - 61.606 GB

6/23 is incident start.

Finally, the cluster is under almost no load.

ip                port    cpu%  mem%  iops  net  class          roles                                                                                                                                                                                                                 
----------------  ------  ----  ----  ----  ---  -------------  ------------------------------------------------                                                                                                                                                                      
 10.185.115.215    4500    6     5     -     6    stateless      coordinator,data_distributor,master,ratekeeper                                                                                                                                                                       
                   4501    3     4     -     4    stateless      cluster_controller                                                                                                                                                                                                   
                   4502    1     7     4     0    transaction    log                                                                                                                                                                                                                  
                   4503    2     52    7     1    storage        storage                                                                                                                                                                                                              
                   4504    1     19    9     0    storage        storage                                                                                                                                                                                                              
                   4505    1     18    7     0    storage        storage                                                                                                                                                                                                              
----------------  ------  ----  ----  ----  ---  -------------  ------------------------------------------------                                                                                                                                                                      
 10.185.175.85     4500    1     3     -     0    stateless      resolver                                                                                                                                                                                                             
                   4501    4     4     -     8    stateless      proxy                                                                                                                                                                                                                
                   4502    1     2     7     0    transaction    log                                                                                                                                                                                                                  
                   4503    1     12    11    0    storage        storage                                                                                                                                                                                                              
                   4504    1     15    11    0    storage        storage                                                                                                                                                                                                              
                   4505    1     10    11    0    storage        storage                                                                                                                                                                                                              
----------------  ------  ----  ----  ----  ---  -------------  ------------------------------------------------                                                                                                                                                                      
 10.208.190.181    4500    0     2     -     0    stateless                                                                                                                                                                                                                           
                   4501    0     4     -     0    stateless                                                                                                                                                                                                                           
                   4502    1     3     7     0    transaction    log                                                                                                                                                                                                                  
                   4503    1     5     11    1    storage        storage                                                                                                                                                                                                              
                   4504    1     14    11    0    storage        storage                                                                                                                                                                                                              
                   4505    1     5     11    0    storage        storage                                                                                                                                                                                                              
----------------  ------  ----  ----  ----  ---  -------------  ------------------------------------------------                                                                                                                                                                      
 10.220.237.150    4500    2     4     -     1    stateless      coordinator                                                                                                                                                                                                          
                   4501    0     4     -     0    stateless                                                                                                                                                                                                                           
                   4502    2     5     6     0    transaction    log                                                                                                                                                                                                                  
                   4503    1     15    18    0    storage        storage                                                                                                                                                                                                              
                   4504    2     35    13    1    storage        storage,storage                                                                                                                                                                                                      
                   4505    1     14    11    0    storage        storage,storage                                                                                                                                                                                                      
----------------  ------  ----  ----  ----  ---  -------------  ------------------------------------------------                                                                                                                                                                      
 10.221.166.13     4500    0     3     -     0    stateless                                                                                                                                                                                                                           
                   4501    0     4     -     0    stateless                                                                                                                                                                                                                           
                   4502    1     2     -     0    transaction                                                                                                                                                                                                                         
                   4503    1     9     10    0    storage        storage                                                                                                                                                                                                              
                   4504    1     13    10    0    storage        storage                                                                                                                                                                                                              
                   4505    1     7     10    0    storage        storage                                                                                                                                                                                                              
----------------  ------  ----  ----  ----  ---  -------------  ------------------------------------------------                                                                                                                                                                      
 10.95.111.214     4500    1     4     -     0    stateless      coordinator                                                                                                                                                                                                          
                   4501    0     4     -     0    stateless                                                                                                                                                                                                                           
                   4502    1     5     4     0    transaction    log                                                                                                                                                                                                                  
                   4503    1     8     13    0    storage        storage,storage                                                                                                                                                                                                      
                   4504    1     9     13    0    storage        storage,storage                                                                                                                                                                                                      
                   4505    1     7     10    0    storage        storage,storage                                                                                                                                                                                                      
----------------  ------  ----  ----  ----  ---  -------------  ------------------------------------------------                                                                                                                                                                      
 10.95.111.219     4500    0     4     -     0    stateless                                                                                                                                                                                                                           
                   4501    0     4     -     0    stateless                                                                                                                                                                                                                           
                   4502    1     5     -     0    transaction                                                                                                                                                                                                                         
                   4503    1     19    12    1    storage        storage                                                                                                                                                                                                              
                   4504    1     21    12    0    storage        storage                                                                                                                                                                                                              
                   4505    1     15    12    0    storage        storage   

Frankly we’re all a bit frustrated as obviously something is preventing FoundationDB from doing one of it’s best things – managing the data!

At this point I’m wondering if it’s actually an issue reading off these problematic shards given we have stood up many new storage processes in new zones that would be suitable healing targets.

I’ll end with re-asking a question from earlier – is it possible to tell which actors/processes are holding the n=2, n=1 etc. shards? This might help drive operational decision making.

I’ll also recap what put us into this state – a series of too fast machine reboots which seems to have too brutally bounced FoundationDB when combined with an accidental mis-targeting of the maintenance command (invalid zones specified). So likely some data movement triggering during a rolling reboot across the cluster where I imagine the reboot process was something like 30s a machine.

Ah finally, I forgot to mention, I’ve now excluded all processes with exorbitant data lag.

fdb> exclude
There are currently 3 servers or processes being excluded from the database:
  10.185.115.215:4503
  10.185.175.43    # We know this is the entire machine
  10.220.237.150:4504

I have confirmed that all mounted storage devices across the cluster are in rw mode and that no SSD reports being in read-only mode.

Mmm perhaps this is interesting. Some of the storage actor’s sqlite files have been touched in weeks.

kyle@fdb2:~$ du -h --time /srv/sdc/foundationdb/*/*
4.0K    2020-03-16 10:55        /srv/sdc/foundationdb/4503/processId
301M    2020-07-20 17:21        /srv/sdc/foundationdb/4503/storage-2217d81e47c74b1a3a7041b1a95332b2.sqlite
21M     2020-07-20 17:21        /srv/sdc/foundationdb/4503/storage-2217d81e47c74b1a3a7041b1a95332b2.sqlite-wal
6.0G    2020-06-26 11:28        /srv/sdc/foundationdb/4503/storage-d487eab27f823aa3183ed5634979408c.sqlite
48M     2020-06-22 16:16        /srv/sdc/foundationdb/4503/storage-d487eab27f823aa3183ed5634979408c.sqlite-wal
4.0K    2020-07-08 03:47        /srv/sdc/foundationdb/4504/fitness
4.0K    2020-03-05 10:21        /srv/sdc/foundationdb/4504/processId
3.9G    2020-07-20 17:21        /srv/sdc/foundationdb/4504/storage-2f0e9b7a6922fc2f348a9bec52c02383.sqlite
52M     2020-07-20 17:21        /srv/sdc/foundationdb/4504/storage-2f0e9b7a6922fc2f348a9bec52c02383.sqlite-wal
5.8G    2020-06-26 11:28        /srv/sdc/foundationdb/4504/storage-8b52db5932066c2d2c79e5e1ee1cbc61.sqlite
48M     2020-06-22 16:16        /srv/sdc/foundationdb/4504/storage-8b52db5932066c2d2c79e5e1ee1cbc61.sqlite-wal
4.0K    2020-07-20 16:43        /srv/sdc/foundationdb/4505/fitness
4.0K    2020-03-16 10:55        /srv/sdc/foundationdb/4505/processId
301M    2020-07-20 17:21        /srv/sdc/foundationdb/4505/storage-6e2d6a04fe7fcdcfed5fb870ea245162.sqlite
32M     2020-07-20 17:21        /srv/sdc/foundationdb/4505/storage-6e2d6a04fe7fcdcfed5fb870ea245162.sqlite-wal
6.1G    2020-06-26 11:28        /srv/sdc/foundationdb/4505/storage-b873c12c848af3f1c199091c8e2c49c7.sqlite
48M     2020-06-22 16:16        /srv/sdc/foundationdb/4505/storage-b873c12c848af3f1c199091c8e2c49c7.sqlite-wal

On this machine in particular several actors were reported with massive data lag.

role     actor_id          process_address          process_id                        durable_version  data_lag
storage  8b52db5932066c2d  10.185.175.43:4504:tls   e2632a0f8c0a76bfad6a660fc9284693  8889482016494    2164815125363                                                                                                                                                                  
storage  b873c12c848af3f1  10.185.175.43:4505:tls   2587c383a9044b046790f372fc3e83c3  8889482016494    2164815125363                                                                                                                                                                  
storage  d487eab27f823aa3  10.185.175.43:4503:tls   89317bb27bf21eb2be6e6c3de371d5ff  8889482016494    2164815125363

I’m not sure – perhaps this is expected if the storage actors are not accepting updates from the tlogs, or perhaps it might help if we’d expect these files to be bumped. Permissions look fine.


I really want to resist the temptation to hand-wave, but I will in case it spurs thoughts in more knowledgable folks. Is it possible that during the incident-start reboots data-distribution created shards which were then “half-orphaned” during the incident unfolding?

Enough that FDB detects e.g. “Only one replica remaining of some data…” though the shards are not actually usable?

I see logs such as the following associated with these storage actors:

... lots of SlowSSLoopx10; again there's no load on the cluster/disks
Jul 20 11:19:25 fdb2 trace.10.185.175.43.4503.1593170931.2mlq6Z.2.320.json {  "Severity": "20", "Time": "1595265564.490685", "Type": "SlowSSLoopx100", "ID": "d487eab27f823aa3", "Elapsed": "0.069077", "Machine": "10.185.175.43:4503", "LogGroup": "default", "Roles": "SS" }
Jul 20 11:21:35 fdb5 trace.10.185.115.215.4500.1593170776.4NB1xR.2.459.json {  "Severity": "20", "Time": "1595265694.556601", "Type": "UndesiredStorageServer", "ID": "3a854a4963363f69", "Server": "2217d81e47c74b1a", "Address": "10.185.175.43:4503:tls", "OtherServer": "d487eab27f823aa3", "NumShards": "15", "OtherNumShards": "238", "Machine": "10.185.115.215:4500", "LogGroup": "default", "Roles": "CD,DD,MS,RK" }
Jul 20 13:18:17 fdb5 trace.10.185.115.215.4500.1593170776.4NB1xR.2.465.json {  "Severity": "20", "Time": "1595272696.583177", "Type": "UndesiredStorageServer", "ID": "3a854a4963363f69", "Server": "d487eab27f823aa3", "Excluded": "10.185.175.43", "Machine": "10.185.115.215:4500", "LogGroup": "default", "Roles": "CD,DD,MS,RK" }
Jul 20 13:18:17 fdb5 trace.10.185.115.215.4500.1593170776.4NB1xR.2.465.json {  "Severity": "20", "Time": "1595272696.583177", "Type": "UndesiredStorageServer", "ID": "3a854a4963363f69", "Server": "d487eab27f823aa3", "Excluded": "10.185.175.43", "Machine": "10.185.115.215:4500", "LogGroup": "default", "Roles": "CD,DD,MS,RK" }

Yes, the special key space: const KeyRangeRef keyServersKeys( LiteralStringRef("\xff/keyServers/"), LiteralStringRef("\xff/keyServers0") ); has the key-to-server mapping. If you saw a shard is mapped to more than k server, that’s because the shard is the in process of moving from one server team to another.

You can read this key space and decode it like this https://github.com/xumengpanda/foundationdb/blob/ba54508c477dc90db1d261ca481663eea0a6220c/fdbclient/SystemData.cpp#L51-L59

FDB should be wise enough to handle this situation. We do tested it in simulation and real world before.

I’m suspicious on the locality configuration of the SSes. If one zone has too few processes, those processes may be grouped with too many other processes and have their disks almost filled out. When FDB DD detects this, it will stops rebalancing data.

Do you have a list of the zoneId and processId of each SSes in the cluster?
Especially, I’m looking at how many zoneIds are in the cluster, and how many processIds are in each zoneID. (assuming each SS uses the same disk size.)

Another trace event that is interesting is TeamCollectionInfo. Could you also paste that?

We run three storage process per zone (one storage disk, three processes per disk). We do not specify the zone ids so they are randomly generated and boil down to a zone per machine. All machines/zones have three storage processes. There now 8 zones/machines and 24 total storage processes.

TeamCollectionInfo examples: https://gist.github.com/ksnavely/bb334f04e91939c102758b11366793db

We are aware that it’s better to specify the zone ids explicitly for failure planning. We run three data hall mode in most other clusters using datacenters for halls so it’s not as big a concern for us there compared to getting the halls right.

Note the storage disks are not even close to full, they are ~3.5TB disks. All machines are identical chassis, RAM, disk sizes/models, NICs, at this time.

I will mention for context at this point I still have one machine’s FDB service purposefully down, that machine is excluded. There are two additional lagging storage actors which are excluded. There are 8 machines total. The cluster is in triple mode.

I’ll bring the downed service back up as it had no effect on the cluster afaict.

[EDIT: done, but I’ve allowed the tlog and stateless process to be included but left the storage processes excluded.]

^Cfdb> exclude
There are currently 5 servers or processes being excluded from the database:
  10.185.115.215:4503
  10.185.175.43:4503
  10.185.175.43:4504
  10.185.175.43:4505
  10.220.237.150:4504
To find out whether it is safe to remove one or more of these
servers from the cluster, type `exclude <addresses>'.
To return one of these servers to the cluster, type `include <addresses>'.

This cluster is running FoundationDB 6.2.10 BTW.

The UndesiredStorageServer says there are two SSes on the same IP:Port. The Server (field) marks itself as unhealthy because OtherServer has more shards that itself: OtherNumShards > NumShards. So OtherServer should eventually be selected as the destination to host data.

(I’m kind of out of ideas for now…)

I’ve re-included all process to the cluster. I killed DD manually. I don’t see any real change happening to the cluster. Data lag has increased.

I want to be open for other’s comments, including disaster recovery. If the community comes up short I’m leaning towards an fdbdr based recovery. We have at least two replicas of all shards, so it should be readable. If the DR can catch up, we can perform an fdbdr switch and ~simultaneously point our clients at the DR cluster. Afterwards we can wipe the original cluster and reverse the DR process to get things back to normal. (Or just carry on with the DR cluster…yadda yadda internal infra management details for us :D)

Long story short fdbdr actually worked just great here. The incident onset is a bit of a forever mystery due to the issues in staging but we ran a thorough evaluation and remediation effort, taking a couple weeks at staging priority, to use fdbdr and transfer all healthy key ranges to a replacement cluster.

It wasn’t satisfying to have to move on but with a few missing puzzle pieces and other priorities I’m happy to say we turned the situation into a productive exercise and valuable hands-on experience with FoundationDB for the team.