Multi DC replication fails during DR test

Setup:

  • multi_dc setup with 1 k8s cluster and 3 namespaces that map to 3 dc

  • dc1 serves as Primary ; dc3 - Secondary ; dc2 - Satellite

Configuration:
  Redundancy mode        - triple
  Storage engine         - ssd-2
  Coordinators           - 9
  Desired Commit Proxies - 2
  Desired GRV Proxies    - 1
  Desired Resolvers      - 1
  Desired Logs           - 3
  Desired Remote Logs    - 3
  Desired Log Routers    - 3
  Usable Regions         - 2
  Regions: 
    Primary -
        Datacenter                    - dc1
        Satellite datacenters         - dc2, dc3
        Satellite Logs                - 3
    Remote -
        Datacenter                    - dc3
        Satellite datacenters         - dc2, dc1
        Satellite Logs                - 3

Cluster:
  FoundationDB processes - 81
  Zones                  - 47
  Machines               - 47
  Memory availability    - 8.0 GB per process on machine with least available
  Retransmissions rate   - 0 Hz
  Fault Tolerance        - 2 machines
  Server time            - 09/17/23 01:59:16

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 1.158 TB
  Disk space used        - 8.639 TB

The problem:

During a DR test, i simulated Primary dc1 going down, bringing it back up and let data being replicated back, but it’s stuck at the moment (see below errors).

  • i would like to understand why this happened, how to recover and how to prevent this type of situation
  • i’ve noticed that fdb operator added a storage pod in dc1 namespace , from k8s perspective how is the recovery process handled ?
Configuration:
  Redundancy mode        - triple
  Storage engine         - ssd-2
  Coordinators           - 9
  Desired Commit Proxies - 2
  Desired GRV Proxies    - 1
  Desired Resolvers      - 1
  Desired Logs           - 3
  Desired Remote Logs    - 3
  Desired Log Routers    - 3
  Usable Regions         - 2
  Regions: 
    Remote -
        Datacenter                    - dc1
        Satellite datacenters         - dc2, dc3
        Satellite Logs                - 3
    Primary -
        Datacenter                    - dc3
        Satellite datacenters         - dc2, dc1
        Satellite Logs                - 3

Cluster:
  FoundationDB processes - 81 (less 0 excluded; 12 with errors)
  Zones                  - 47
  Machines               - 47
  Memory availability    - 8.0 GB per process on machine with least available
  Retransmissions rate   - 1 Hz
  Fault Tolerance        - -1 machines

  Warning: the database may have data loss and availability loss. Please restart following tlog interfaces, otherwise storage servers may never be able to catch up.
  Old log epoch: 253 begin: 752008734932 end: 752125253003, missing log interfaces(id,address): 55cfd4bc6e3b0a8f, 9ab3ab3fdefabd20, 91fa9bbe4783efbb, 
  Old log epoch: 250 begin: 751872585983 end: 752008734932, missing log interfaces(id,address): 055aa740d22950ef, 2bfa78e37a663789, bc810043cb1a881b, 
  Old log epoch: 248 begin: 751768423814 end: 751872585983, missing log interfaces(id,address): 60b3598fd80246f7, e841f74ebd3955bb, 0e8b32bcfc4a4ae2, 
  Old log epoch: 246 begin: 751657077872 end: 751768423814, missing log interfaces(id,address): 730ccd96ca1256b0, a77e054a4e8f1279, 0c79570a7b4707d0, 
  Old log epoch: 244 begin: 751552887228 end: 751657077872, missing log interfaces(id,address): 0ed852c5e99c862c, d28c7dae960cea90, 991deec06ebdd74d, 
  Old log epoch: 242 begin: 751440887046 end: 751552887228, missing log interfaces(id,address): 15e823d14fde4dfb, 3d716439331bb3be, c82bcf34b648311e, 
  Old log epoch: 240 begin: 751332177172 end: 751440887046, missing log interfaces(id,address): 68c1e97818bf5ef4, 0abc75e178f47ee7, 81991d47565c137a, 
  Old log epoch: 237 begin: 751187559126 end: 751332177172, missing log interfaces(id,address): 1493a5a1f23cf477, 0960fcaf02bd12e3, f76dfe8ef10992f5, 
  Old log epoch: 235 begin: 751050558192 end: 751187559126, missing log interfaces(id,address): 0b38863b2b4e2ceb, 725bcab59e5c9ea0, 1de0783cc2013219, 
  Old log epoch: 233 begin: 747960335254 end: 751050558192, missing log interfaces(id,address): 3c4efb7aa938ab14, f286a82bf78eeae2, c0ce84e3fedb16c5, 

  Server time            - 09/17/23 14:30:07

Data:
  Replication health     - (Re)initializing automatic data distribution
  Moving data            - unknown (initializing)
  Sum of key-value sizes - unknown
  Disk space used        - 4.287 TB

Operating space:
  Storage server         - 1541.7 GB free on most full server
  Log server             - 1542.0 GB free on most full server

Workload:
  Read rate              - 56 Hz
  Write rate             - 0 Hz
  Transactions started   - 19 Hz
  Transactions committed - 1 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  10.113.237.132:4501    (  1% cpu;  1% machine; 0.000 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.133:4501    (  1% cpu;  2% machine; 0.002 Gbps;  0% disk IO; 5.7 GB / 8.0 GB RAM  )
  10.113.237.133:4503    (  2% cpu;  2% machine; 0.002 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.134:4501    (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
    Storage server lagging by 350 seconds.
  10.113.237.134:4503    (  0% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.135:4501    (  1% cpu;  1% machine; 0.001 Gbps;  4% disk IO; 0.2 GB / 8.0 GB RAM  )
    Storage server lagging by 342 seconds.
  10.113.237.135:4503    (  0% cpu;  1% machine; 0.001 Gbps;  4% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.136:4501    (  1% cpu;  1% machine; 0.001 Gbps;  9% disk IO; 0.2 GB / 8.0 GB RAM  )
    Storage server lagging by 340 seconds.
  10.113.237.136:4503    (  1% cpu;  1% machine; 0.001 Gbps;  9% disk IO; 0.2 GB / 8.0 GB RAM  )
    Storage server lagging by 280 seconds.
  10.113.237.137:4501    (  0% cpu;  1% machine; 0.001 Gbps;  2% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.137:4503    (  1% cpu;  1% machine; 0.001 Gbps;  2% disk IO; 0.2 GB / 8.0 GB RAM  )
    Storage server lagging by 280 seconds.
  10.113.237.138:4501    (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
    Storage server lagging by 280 seconds.
  10.113.237.138:4503    (  0% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.139:4501    (  0% cpu;  1% machine; 0.001 Gbps; 10% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.140:4501    (  1% cpu;  1% machine; 0.001 Gbps; 10% disk IO; 0.2 GB / 8.0 GB RAM  )
    Storage server lagging by 249 seconds.
  10.113.237.140:4503    (  1% cpu;  1% machine; 0.001 Gbps; 10% disk IO; 0.2 GB / 8.0 GB RAM  )
    Storage server lagging by 249 seconds.
  10.113.237.141:4501    (  1% cpu;  1% machine; 0.001 Gbps;  1% disk IO; 0.2 GB / 8.0 GB RAM  )
    Storage server lagging by 250 seconds.
  10.113.237.141:4503    (  0% cpu;  1% machine; 0.001 Gbps;  1% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.142:4501    (  0% cpu;  1% machine; 0.001 Gbps;  3% disk IO; 0.2 GB / 8.0 GB RAM  )
    Storage server lagging by 315 seconds.
  10.113.237.142:4503    (  1% cpu;  1% machine; 0.001 Gbps;  3% disk IO; 0.2 GB / 8.0 GB RAM  )
    Storage server lagging by 315 seconds.
  10.113.237.143:4501    (  0% cpu;  2% machine; 0.001 Gbps;  9% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.143:4503    (  0% cpu;  2% machine; 0.001 Gbps;  9% disk IO; 0.2 GB / 8.0 GB RAM  )
    Storage server lagging by 250 seconds.
  10.113.237.144:4501    (  0% cpu;  2% machine; 0.001 Gbps;  9% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.144:4503    (  0% cpu;  2% machine; 0.001 Gbps;  9% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.145:4501    (  0% cpu;  2% machine; 0.001 Gbps;  9% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.145:4503    (  0% cpu;  2% machine; 0.001 Gbps;  9% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.146:4501    (  1% cpu;  2% machine; 0.001 Gbps; 10% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.147:4501    (  1% cpu;  2% machine; 0.001 Gbps; 10% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.147:4503    (  0% cpu;  2% machine; 0.001 Gbps; 10% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.148:4501    (  0% cpu;  2% machine; 0.001 Gbps; 10% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.148:4503    (  0% cpu;  2% machine; 0.001 Gbps; 10% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.149:4501    (  0% cpu;  2% machine; 0.001 Gbps; 10% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.149:4503    (  0% cpu;  2% machine; 0.001 Gbps; 10% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.150:4501    (  0% cpu;  1% machine; 0.001 Gbps; 10% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.150:4503    (  0% cpu;  1% machine; 0.001 Gbps; 10% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.151:4501    (  0% cpu;  1% machine; 0.001 Gbps;  7% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.151:4503    (  0% cpu;  1% machine; 0.001 Gbps;  7% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.152:4501    (  0% cpu;  1% machine; 0.001 Gbps;  9% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.152:4503    (  0% cpu;  1% machine; 0.001 Gbps;  9% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.153:4501    (  1% cpu;  5% machine; 0.001 Gbps;  2% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.154:4501    (  1% cpu;  6% machine; 0.017 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.155:4501    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.156:4501    (  2% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 5.5 GB / 8.0 GB RAM  )
  10.113.237.156:4503    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.157:4501    (  1% cpu;  4% machine; 0.001 Gbps;  0% disk IO; 5.7 GB / 8.0 GB RAM  )
  10.113.237.157:4503    (  1% cpu;  4% machine; 0.001 Gbps;  0% disk IO; 5.5 GB / 8.0 GB RAM  )
  10.113.237.158:4501    (  1% cpu;  1% machine; 0.000 Gbps;  0% disk IO; 4.1 GB / 8.0 GB RAM  )
  10.113.237.158:4503    (  1% cpu;  1% machine; 0.000 Gbps;  0% disk IO; 4.9 GB / 8.0 GB RAM  )
  10.113.237.159:4501    (  1% cpu;  3% machine; 0.001 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.159:4503    (  1% cpu;  3% machine; 0.001 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.160:4501    (  1% cpu;  1% machine; 0.000 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.160:4503    (  1% cpu;  1% machine; 0.000 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.161:4501    (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 5.5 GB / 8.0 GB RAM  )
  10.113.237.161:4503    (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.162:4501    (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 5.5 GB / 8.0 GB RAM  )
  10.113.237.162:4503    (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 5.5 GB / 8.0 GB RAM  )
  10.113.237.163:4501    (  1% cpu;  2% machine; 0.017 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.163:4503    (  1% cpu;  2% machine; 0.017 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.164:4501    (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 0.6 GB / 8.0 GB RAM  )
  10.113.237.165:4501    (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 5.5 GB / 8.0 GB RAM  )
  10.113.237.165:4503    (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 5.5 GB / 8.0 GB RAM  )
  10.113.237.166:4501    (  1% cpu;  5% machine; 0.002 Gbps;  0% disk IO; 5.5 GB / 8.0 GB RAM  )
  10.113.237.166:4503    (  1% cpu;  5% machine; 0.002 Gbps;  0% disk IO; 4.8 GB / 8.0 GB RAM  )
  10.113.237.167:4501    (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.167:4503    (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.168:4501    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 4.1 GB / 8.0 GB RAM  )
  10.113.237.168:4503    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.169:4501    (  1% cpu;  1% machine; 0.000 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.169:4503    (  1% cpu;  1% machine; 0.000 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.170:4501    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 5.5 GB / 8.0 GB RAM  )
  10.113.237.170:4503    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 5.5 GB / 8.0 GB RAM  )
  10.113.237.171:4501    (  1% cpu;  1% machine; 0.000 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.171:4503    (  1% cpu;  1% machine; 0.000 Gbps;  0% disk IO; 5.5 GB / 8.0 GB RAM  )
  10.113.237.172:4501    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.172:4503    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 4.9 GB / 8.0 GB RAM  )
  10.113.237.173:4501    (  2% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 0.6 GB / 8.0 GB RAM  )
  10.113.237.174:4501    (  7% cpu;  5% machine; 0.004 Gbps;  2% disk IO; 0.3 GB / 8.0 GB RAM  )
  10.113.237.175:4501    (  1% cpu;  6% machine; 0.001 Gbps;  2% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.176:4501    (  0% cpu;  5% machine; 0.001 Gbps; 18% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.177:4501    (  6% cpu;  5% machine; 0.006 Gbps;  1% disk IO; 0.3 GB / 8.0 GB RAM  )
  10.113.237.178:4501    (  3% cpu;  4% machine; 0.004 Gbps;  1% disk IO; 0.3 GB / 8.0 GB RAM  )

Coordination servers:
  10.113.237.132:4501  (reachable)
  10.113.237.133:4503  (reachable)
  10.113.237.139:4501  (reachable)
  10.113.237.146:4501  (reachable)
  10.113.237.147:4501  (reachable)
  10.113.237.154:4501  (reachable)
  10.113.237.155:4501  (reachable)
  10.113.237.164:4501  (reachable)
  10.113.237.173:4501  (reachable)

Client time: 09/17/23 14:30:04

Could you share some more details, e.g. how did you simulate the failure of dc1? Are you able to share the logs of the k8s operator(s)?

  • i’ve noticed that fdb operator added a storage pod in dc1 namespace , from k8s perspective how is the recovery process handled ?

In theory if the operator is able to connect to the cluster it should be recreating the resources. If the issue was only a network partition the operator has nothing to do and the FDB cluster would be doing the recovery.

in current setup, using nodeSelectors and topologySpreadConstraints every pod (fdb process) is placed on a k8s node (cloud vm), using networkHost ; hence, pods/nodes belong to 3 namespaces (dc1, dc2, dc3) but from cloud infra perspective a namespaces is mapped to an AZ. Simulating dc1 failure is terminating all pods in dc1 namespace (all VMs in AZ1), and bringing them back. Operator pod is free to move around and is not on host Network.
At the moment there are some processes that are stuck and operator is trying to exclude them.

status:
  processGroups:
    - addresses:
        - 10.113.237.136
      processClass: log
      processGroupID: dc1-log-1
    - addresses:
        - 10.113.237.142
      processClass: log
      processGroupID: dc1-log-2
    - addresses:
        - 10.113.237.180
      processClass: stateless
      processGroupID: dc1-stateless-1
    - addresses:
        - 10.113.237.176
      processClass: stateless
      processGroupID: dc1-stateless-2
    - addresses:
        - 10.113.237.179
      processClass: stateless
      processGroupID: dc1-stateless-3
    - addresses:
        - 10.113.237.147
        - 10.113.237.149
        - 10.113.237.152
        - 10.113.237.155
        - 10.113.237.173
      processClass: storage
      processGroupConditions:
        - timestamp: 1694989929
          type: ResourcesTerminating          <<<<<<
      processGroupID: dc1-storage-1
      removalTimestamp: '2023-09-20T14:02:39Z'
    - addresses:
        - 10.113.237.141
      processClass: storage
      processGroupID: dc1-storage-10
    - addresses:
        - 10.113.237.148
      processClass: storage
      processGroupID: dc1-storage-11
    - addresses:
        - 10.113.237.154
      processClass: storage
      processGroupID: dc1-storage-12
    - addresses:
        - 10.113.237.147
      processClass: storage
      processGroupID: dc1-storage-13
    - addresses:
        - 10.113.237.151
      processClass: storage
      processGroupID: dc1-storage-14
    - addresses:
        - 10.113.237.138
      processClass: storage
      processGroupID: dc1-storage-15
    - addresses:
        - 10.113.237.139
      processClass: storage
      processGroupID: dc1-storage-16
    - addresses:
        - 10.113.237.145
      processClass: storage
      processGroupID: dc1-storage-17
    - addresses:
        - 10.113.237.137
      processClass: storage
      processGroupID: dc1-storage-18
    - addresses:
        - 10.113.237.155
      processClass: storage
      processGroupID: dc1-storage-19
    - addresses:
        - 10.113.237.150
      processClass: storage
      processGroupID: dc1-storage-2
    - addresses:
        - 10.113.237.135
      processClass: storage
      processGroupID: dc1-storage-3
    - addresses:
        - 10.113.237.144
      processClass: storage
      processGroupID: dc1-storage-4
    - addresses:
        - 10.113.237.149
      processClass: storage
      processGroupID: dc1-storage-5
    - addresses:
        - 10.113.237.152
      processClass: storage
      processGroupID: dc1-storage-6
    - addresses:
        - 10.113.237.140
      processClass: storage
      processGroupID: dc1-storage-7
    - addresses:
        - 10.113.237.143
      processClass: storage
      processGroupID: dc1-storage-8
    - addresses:
        - 10.113.237.135
        - 10.113.237.146
      processClass: storage
      processGroupConditions:
        - timestamp: 1695160643
          type: MissingPod                                       <<<<<<<<
      processGroupID: dc1-storage-9
      removalTimestamp: '2023-09-20T14:02:39Z'
  • some pod logs:
"level":"info","ts":1695218561.8718004,"logger":"fdbclient","msg":"Fetch values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218563.9435072,"logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218563.948019,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","subReconciler":"controllers.updateSidecarVersions"}
{"level":"info","ts":1695218563.948068,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","subReconciler":"controllers.updatePodConfig"}
{"level":"info","ts":1695218563.9488695,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","subReconciler":"controllers.updateLabels"}
{"level":"info","ts":1695218563.9493287,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","subReconciler":"controllers.updateDatabaseConfiguration"}
{"level":"info","ts":1695218563.9493825,"logger":"fdbclient","msg":"Fetch values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218566.0212674,"logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218566.0240517,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","subReconciler":"controllers.chooseRemovals"}
{"level":"info","ts":1695218566.0241134,"logger":"fdbclient","msg":"Fetch values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218568.0982757,"logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218568.1017754,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","subReconciler":"controllers.excludeProcesses"}
{"level":"info","ts":1695218568.1018329,"logger":"fdbclient","msg":"Fetch values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218570.1727428,"logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218570.1753511,"logger":"controller","msg":"current exclusions","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"excludeProcesses","ex":["10.113.237.134:4501","10.113.237.147","10.113.237.149","10.113.237.152"]}
{"level":"info","ts":1695218570.17548,"logger":"fdbclient","msg":"Running command","namespace":"dc1","cluster":"fdb-cluster-1","path":"/usr/bin/fdb/7.1/fdbcli","args":["/usr/bin/fdb/7.1/fdbcli","--exec","exclude 10.113.237.155 10.113.237.173 10.113.237.135 10.113.237.146","-C","/tmp/4db61298-4806-40b6-9d18-07c5ca38f2c2","--log","--log","--trace_format","xml","--log-dir","/var/log/fdb","--timeout","10"]}
{"level":"error","ts":1695218572.3537273,"logger":"fdbclient","msg":"Error from FDB command","namespace":"dc1","cluster":"fdb-cluster-1","code":1,"stdout":"ERROR: Could not calculate the impact of this exclude on the total free space in the cluster.\nPlease try the exclude again in 30 seconds.\nType `exclude FORCE <ADDRESS...>' to exclude without checking free space.\n","stderr":"","error":"exit status 1","stacktrace":"github.com/FoundationDB/fdb-kubernetes-operator/fdbclient.(*cliAdminClient).runCommandWithBackoff\n\t/workspace/fdbclient/admin_client.go:282\ngithub.com/FoundationDB/fdb-kubernetes-operator/fdbclient.(*cliAdminClient).ExcludeProcesses\n\t/workspace/fdbclient/admin_client.go:432\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.excludeProcesses.reconcile\n\t/workspace/controllers/exclude_processes.go:84\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBClusterReconciler).Reconcile\n\t/workspace/controllers/cluster_controller.go:166\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}
{"level":"info","ts":1695218572.3538575,"logger":"controller","msg":"Delaying requeue for sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","subReconciler":"controllers.excludeProcesses","message":"","error":"exit status 1"}
{"level":"info","ts":1695218572.3539171,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","subReconciler":"controllers.changeCoordinators"}
{"level":"info","ts":1695218572.3539553,"logger":"fdbclient","msg":"Fetch values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218574.425527,"logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218574.4308026,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","subReconciler":"controllers.bounceProcesses"}
{"level":"info","ts":1695218574.4308512,"logger":"fdbclient","msg":"Fetch values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218576.5043163,"logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218576.5072148,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","subReconciler":"controllers.maintenanceModeChecker"}
{"level":"info","ts":1695218576.507269,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","subReconciler":"controllers.updatePods"}
{"level":"info","ts":1695218576.508645,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","subReconciler":"controllers.removeProcessGroups"}
{"level":"info","ts":1695218576.508712,"logger":"fdbclient","msg":"Fetch values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218578.5825465,"logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218578.5856998,"logger":"fdbclient","msg":"Filtering excluded processes","namespace":"dc1","cluster":"fdb-cluster-1","inProgress":["10.113.237.147","10.113.237.149","10.113.237.152"],"fullyExcluded":[],"notExcluded":["10.113.237.155","10.113.237.173","10.113.237.135","10.113.237.146"],"missingInStatus":[]}
{"level":"info","ts":1695218578.5857313,"logger":"controller","msg":"Exclusions to complete","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"removeProcessGroups","remainingServers":["10.113.237.155","10.113.237.173","10.113.237.135","10.113.237.146","10.113.237.147","10.113.237.149","10.113.237.152"]}
{"level":"info","ts":1695218578.5857813,"logger":"fdbclient","msg":"Fetch values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218580.6584988,"logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218580.6614013,"logger":"controller","msg":"Incomplete exclusion still present in removeProcessGroups step","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"removeProcessGroups","processGroupID":"dc1-storage-1","error":"process has missing address in exclusion results: 10.113.237.147"}
{"level":"info","ts":1695218580.661425,"logger":"controller","msg":"Incomplete exclusion still present in removeProcessGroups step","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"removeProcessGroups","processGroupID":"dc1-storage-9","error":"process has missing address in exclusion results: 10.113.237.135"}
{"level":"info","ts":1695218580.6614563,"logger":"controller","msg":"Reconciliation terminated early","namespace":"dc1","cluster":"fdb-cluster-1","subReconciler":"controllers.removeProcessGroups","requeueAfter":0,"message":"Reconciliation needs to exclude more processes"}

Note: i performed several successful test like this , so i am not sure what went wrong.

Thanks for sharing the logs, I take a look tomorrow. Could you share the operator version you used for the test?

fdb: 7.1.26
operator: v1.16.0

thanks for help!

following on this issue:

  • i was able to replicate the same issue in a different cluster
  • i used latest operator version and all pods/(processes) recovered ; nothing of value in logs as everything is reconciled
  • the data is not getting replicated to primary , i think this is foundationdb layer

What is the process to restart Tlogs Interfaces (manually) ?

Using cluster file `/var/dynamic-conf/fdb.cluster'.

Configuration:
  Redundancy mode        - triple
  Storage engine         - ssd-2
  Coordinators           - 9
  Desired Commit Proxies - 2
  Desired GRV Proxies    - 1
  Desired Resolvers      - 1
  Desired Logs           - 3
  Desired Remote Logs    - 3
  Desired Log Routers    - 3
  Usable Regions         - 2
  Regions: 
    Primary -
        Datacenter                    - dc1
        Satellite datacenters         - dc2, dc3
        Satellite Logs                - 3
    Remote -
        Datacenter                    - dc3
        Satellite datacenters         - dc2, dc1
        Satellite Logs                - 3

Cluster:
  FoundationDB processes - 88
  Zones                  - 47
  Machines               - 47
  Memory availability    - 7.2 GB per process on machine with least available
  Retransmissions rate   - 1 Hz
  Fault Tolerance        - -1 machines

  Warning: the database may have data loss and availability loss. Please restart following tlog interfaces, otherwise storage servers may never be able to catch up.
  Old log epoch: 61 begin: 67747007429 end: 67862561162, missing log interfaces(id,address): 702bb3191d379a75, e2f31f2ec89c3ea8, 0e8867a3cedf9d49, 
  Old log epoch: 57 begin: 67633217945 end: 67747007429, missing log interfaces(id,address): b78ad3e928f598b3, 347462edcd47ca65, ef55f1c0644d9947, 
  Old log epoch: 54 begin: 67522181213 end: 67633217945, missing log interfaces(id,address): 53a5357d543c3689, d2226fd86c5e4438, 03d2166c63706f71, 
  Old log epoch: 52 begin: 67392885066 end: 67522181213, missing log interfaces(id,address): c115949861718c45, 6f94496c83dbfa4d, a29f67a398b280f0, 
  Old log epoch: 50 begin: 67276398632 end: 67392885066, missing log interfaces(id,address): 1ee2a591852d31f8, d55326815f9961fd, c0a8494a55bc0801, 
  Old log epoch: 47 begin: 67160817068 end: 67276398632, missing log interfaces(id,address): 7f5eee096b3696fd, d98c85335819c2c2, c222722bc97d9ba3, 
  Old log epoch: 45 begin: 67040025375 end: 67160817068, missing log interfaces(id,address): 444e5c6f3f8baeaa, 83bb01e32d493108, dac3f70601069d4b, 
  Old log epoch: 43 begin: 66922430237 end: 67040025375, missing log interfaces(id,address): 8c5201059109ec7a, eb9fdc998f238375, 9f9c019b466d372d, 
  Old log epoch: 41 begin: 66797519416 end: 66922430237, missing log interfaces(id,address): 1adc6b952c144d1e, 5df11c41622c8012, c74557cd9aafe021, 
  Old log epoch: 39 begin: 66689005581 end: 66797519416, missing log interfaces(id,address): 42cce254c5a04d12, 72265d0fb2500d78, ed0242fd8c8211ed, 
  Old log epoch: 37 begin: 66573370304 end: 66689005581, missing log interfaces(id,address): fea5c6457e5ecd86, c3e17575ac55487b, 064c7fd246c37ef7, 
  Old log epoch: 35 begin: 66405317743 end: 66573370304, missing log interfaces(id,address): c557b060d856f82f, 044bb1ff6e8ac834, 0342f98cd441be44, 
  Old log epoch: 33 begin: 66295640128 end: 66405317743, missing log interfaces(id,address): c4653c3267281519, f0bbc162395d2017, 75c42aad636f5f3e, 
  Old log epoch: 31 begin: 59531875452 end: 66295640128, missing log interfaces(id,address): e2539e896af33195, ec5a827750f277a1, 82791cad4b47b650, 

  Server time            - 09/22/23 01:49:50

Data:
  Replication health     - UNHEALTHY: No replicas remain of some data
  Moving data            - 807.670 GB
  Sum of key-value sizes - 1.158 TB
  Disk space used        - 6.196 TB

Operating space:
  Storage server         - 1744.2 GB free on most full server
  Log server             - 1746.4 GB free on most full server

Workload:
  Read rate              - 198 Hz
  Write rate             - 0 Hz
  Transactions started   - 44 Hz
  Transactions committed - 0 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  10.113.237.132:4501    (  1% cpu;  3% machine; 0.001 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.132:4503    (  1% cpu;  3% machine; 0.001 Gbps;  0% disk IO; 0.4 GB / 8.0 GB RAM  )
  10.113.237.133:4501    (  1% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.9 GB / 8.0 GB RAM  )
  10.113.237.133:4503    (  7% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.134:4501    (  6% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.134:4503    (  2% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.135:4501    (  4% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.135:4503    (  4% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.136:4501    (  4% cpu;  7% machine; 0.120 Gbps;  5% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.136:4503    ( 20% cpu;  7% machine; 0.120 Gbps;  5% disk IO; 4.0 GB / 8.0 GB RAM  )
  10.113.237.137:4501    (  5% cpu;  4% machine; 0.001 Gbps;  1% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.137:4503    (  5% cpu;  4% machine; 0.001 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.138:4501    (  5% cpu;  4% machine; 0.001 Gbps;  1% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.138:4503    (  6% cpu;  4% machine; 0.001 Gbps;  1% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.139:4501    (  5% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.139:4503    (  6% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.140:4501    (  6% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.140:4503    (  2% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.141:4501    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 0.3 GB / 8.0 GB RAM  )
  10.113.237.141:4503    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.142:4501    (  4% cpu;  3% machine; 0.001 Gbps;  0% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.142:4503    (  5% cpu;  3% machine; 0.001 Gbps;  0% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.143:4501    (  1% cpu;  5% machine; 0.034 Gbps;  5% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.143:4503    ( 10% cpu;  5% machine; 0.034 Gbps;  5% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.144:4501    (  6% cpu;  4% machine; 0.001 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.144:4503    (  6% cpu;  4% machine; 0.001 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.145:4501    (  5% cpu;  4% machine; 0.001 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.145:4503    (  5% cpu;  4% machine; 0.001 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.146:4501    (  6% cpu;  4% machine; 0.001 Gbps;  1% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.146:4503    (  2% cpu;  4% machine; 0.001 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.147:4501    (  5% cpu;  7% machine; 0.188 Gbps;  1% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.147:4503    ( 10% cpu;  7% machine; 0.188 Gbps;  1% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.148:4501    (  4% cpu;  9% machine; 0.129 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.148:4503    ( 21% cpu;  9% machine; 0.129 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.149:4501    (  5% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.149:4503    (  5% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.150:4501    (  4% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.150:4503    (  4% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.151:4501    (  2% cpu;  4% machine; 0.001 Gbps;  3% disk IO; 9.4 GB / 8.0 GB RAM  )
  10.113.237.151:4503    (  9% cpu;  4% machine; 0.001 Gbps;  3% disk IO; 5.6 GB / 8.0 GB RAM  )
  10.113.237.152:4501    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 2.5 GB / 8.0 GB RAM  )
  10.113.237.152:4503    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 2.4 GB / 8.0 GB RAM  )
  10.113.237.154:4501    (  3% cpu;  3% machine; 0.002 Gbps;  3% disk IO; 4.8 GB / 8.0 GB RAM  )
  10.113.237.154:4503    (  4% cpu;  3% machine; 0.002 Gbps;  3% disk IO; 4.8 GB / 8.0 GB RAM  )
  10.113.237.155:4501    (  8% cpu;  4% machine; 0.001 Gbps;  7% disk IO; 6.8 GB / 8.0 GB RAM  )
  10.113.237.155:4503    (  9% cpu;  4% machine; 0.001 Gbps;  7% disk IO; 6.8 GB / 8.0 GB RAM  )
  10.113.237.156:4501    ( 18% cpu;  7% machine; 0.089 Gbps; 24% disk IO; 6.8 GB / 8.0 GB RAM  )
  10.113.237.156:4503    ( 24% cpu;  7% machine; 0.089 Gbps; 22% disk IO; 9.4 GB / 8.0 GB RAM  )
  10.113.237.157:4501    (  9% cpu;  4% machine; 0.003 Gbps;  6% disk IO; 6.8 GB / 8.0 GB RAM  )
  10.113.237.157:4503    ( 10% cpu;  4% machine; 0.003 Gbps;  6% disk IO; 6.9 GB / 8.0 GB RAM  )
  10.113.237.158:4501    (  5% cpu;  3% machine; 0.001 Gbps;  2% disk IO; 8.4 GB / 8.0 GB RAM  )
  10.113.237.158:4503    (  2% cpu;  3% machine; 0.001 Gbps;  2% disk IO; 6.8 GB / 8.0 GB RAM  )
  10.113.237.159:4501    (  3% cpu;  4% machine; 0.057 Gbps; 12% disk IO; 6.8 GB / 8.0 GB RAM  )
  10.113.237.159:4503    ( 19% cpu;  4% machine; 0.057 Gbps; 13% disk IO; 9.4 GB / 8.0 GB RAM  )
  10.113.237.160:4501    (  9% cpu;  4% machine; 0.001 Gbps;  5% disk IO; 6.8 GB / 8.0 GB RAM  )
  10.113.237.160:4503    (  7% cpu;  4% machine; 0.001 Gbps;  5% disk IO; 6.9 GB / 8.0 GB RAM  )
  10.113.237.161:4501    (  5% cpu;  2% machine; 0.001 Gbps;  3% disk IO; 8.4 GB / 8.0 GB RAM  )
  10.113.237.161:4503    (  3% cpu;  2% machine; 0.001 Gbps;  4% disk IO; 6.8 GB / 8.0 GB RAM  )
  10.113.237.162:4501    (  2% cpu;  3% machine; 0.040 Gbps;  4% disk IO; 6.7 GB / 8.0 GB RAM  )
  10.113.237.162:4503    ( 10% cpu;  3% machine; 0.040 Gbps;  4% disk IO; 6.7 GB / 8.0 GB RAM  )
  10.113.237.163:4501    ( 23% cpu;  7% machine; 0.170 Gbps; 19% disk IO; 4.8 GB / 8.0 GB RAM  )
  10.113.237.163:4503    ( 18% cpu;  7% machine; 0.170 Gbps; 19% disk IO; 4.8 GB / 8.0 GB RAM  )
  10.113.237.165:4501    (  8% cpu;  3% machine; 0.001 Gbps;  3% disk IO; 4.8 GB / 8.0 GB RAM  )
  10.113.237.165:4503    (  2% cpu;  3% machine; 0.001 Gbps;  3% disk IO; 4.8 GB / 8.0 GB RAM  )
  10.113.237.166:4501    ( 23% cpu;  4% machine; 0.001 Gbps; 10% disk IO; 5.9 GB / 8.0 GB RAM  )
  10.113.237.166:4503    (  2% cpu;  4% machine; 0.001 Gbps;  7% disk IO; 7.4 GB / 8.0 GB RAM  )
  10.113.237.167:4501    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 2.5 GB / 8.0 GB RAM  )
  10.113.237.167:4503    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 2.6 GB / 8.0 GB RAM  )
  10.113.237.168:4501    (  8% cpu;  3% machine; 0.002 Gbps;  4% disk IO; 9.4 GB / 8.0 GB RAM  )
  10.113.237.168:4503    (  2% cpu;  3% machine; 0.002 Gbps;  3% disk IO; 6.8 GB / 8.0 GB RAM  )
  10.113.237.169:4501    (  8% cpu;  4% machine; 0.001 Gbps;  5% disk IO; 7.0 GB / 8.0 GB RAM  )
  10.113.237.169:4503    (  7% cpu;  4% machine; 0.001 Gbps;  4% disk IO; 6.8 GB / 8.0 GB RAM  )
  10.113.237.170:4501    (  3% cpu;  4% machine; 0.100 Gbps;  9% disk IO; 7.1 GB / 8.0 GB RAM  )
  10.113.237.170:4503    ( 14% cpu;  4% machine; 0.100 Gbps;  9% disk IO; 4.8 GB / 8.0 GB RAM  )
  10.113.237.171:4501    (  3% cpu;  3% machine; 0.001 Gbps;  3% disk IO; 4.8 GB / 8.0 GB RAM  )
  10.113.237.171:4503    (  8% cpu;  3% machine; 0.001 Gbps;  3% disk IO; 4.8 GB / 8.0 GB RAM  )
  10.113.237.172:4501    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.172:4503    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.173:4501    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.173:4503    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.174:4501    (  2% cpu; 12% machine; 0.001 Gbps;  3% disk IO; 0.3 GB / 7.2 GB RAM  )
  10.113.237.176:4501    (  4% cpu;  9% machine; 0.004 Gbps;  3% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.177:4501    (  1% cpu;  5% machine; 0.000 Gbps;  2% disk IO; 0.3 GB / 8.0 GB RAM  )
  10.113.237.178:4501    (  2% cpu;  5% machine; 0.000 Gbps;  2% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.179:4501    (  3% cpu;  7% machine; 0.000 Gbps;  2% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.180:4501    ( 13% cpu; 14% machine; 0.005 Gbps;  2% disk IO; 0.3 GB / 8.0 GB RAM  )
  10.113.237.181:4501    (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.181:4503    (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )

Coordination servers:
  10.113.237.132:4501  (reachable)
  10.113.237.134:4503  (reachable)
  10.113.237.138:4503  (reachable)
  10.113.237.152:4501  (reachable)
  10.113.237.167:4503  (reachable)
  10.113.237.171:4501  (reachable)
  10.113.237.172:4503  (reachable)
  10.113.237.173:4503  (reachable)
  10.113.237.181:4503  (reachable)

Client time: 09/22/23 01:49:50

WARNING: A single process is both a transaction log and a storage server.
  For best performance use dedicated disks for the transaction logs by setting process classes.

Could you share the operator logs for your test run? I would be interested to see what the subreconciler addPods and addPVCs is saying.

Simulating dc1 failure is terminating all pods in dc1 namespace (all VMs in AZ1), and bringing them back.

I assume you are running a “kubectl delete …”, correct? Are all the deleted Pods are actually created again? And the PVCs in this namespace are not touched?

What is the process to restart Tlogs Interfaces (manually) ?

Those should be automatically restarted once the Pod is recreated and the according PVC is mounted. You can manually restart a process by using the kill command from fdbcli e.g. kill; kill <IP:Port>.

Yes pods are created back and running…

  • since pods map to nodes , i scale down nodes in that particular AZ (i use ocp mchineset for this) , and pods stay in Pending until i scale back up
  • i also, have to delete pvc in ns otherwise pods do not start again
  • i use local storage operator to create a storageclass (that will create pvcs) out of VM’s nvme disk

i restarted processes with kubectl fdb plugin and left it over the weekend … so that worked , but have to test it more…

{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Has unhealthy process group","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","method":"CheckReconciliation","namespace":"dc1","cluster":"fdb-cluster-1","processGroupID":"dc1-storage-11","state":"HasUnhealthyProcess","conditions":["MissingProcesses"]}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Not all process groups are reconciled","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","method":"CheckReconciliation","namespace":"dc1","cluster":"fdb-cluster-1","desiredProcessGroups":22,"reconciledProcessGroups":21}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","duration_seconds":0.087244812}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Cluster was not fully reconciled by reconciliation process","namespace":"dc1","cluster":"fdb-cluster-1","status":{"hasUnhealthyProcess":2},"CurrentGeneration":0,"OriginalGeneration":2,"DelayedRequeue":false}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Reconciliation run finished","namespace":"dc1","cluster":"fdb-cluster-1","duration_seconds":0.342980124,"cacheStatus":true}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Fetch machine-readable status for reconcilitation loop","namespace":"dc1","cluster":"fdb-cluster-1","cacheStatus":true}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Trying connection options","namespace":"dc1","cluster":"fdb-cluster-1","connectionString":["fdb_cluster_1:jBYTGxM4zAJ2fdbYTAWUctcWWs8bNS13@10.113.237.132:4501,10.113.237.134:4503,10.113.237.138:4503,10.113.237.152:4501,10.113.237.167:4503,10.113.237.171:4501,10.113.237.172:4503,10.113.237.173:4503,10.113.237.181:4503","fdb_cluster_1:VXpHlwDugQ4N9oTlUTQDIwDS1aZf18BP@10.113.237.144:4501,10.113.237.145:4501,10.113.237.150:4501,10.113.237.148:4501,10.113.237.136:4501"]}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to get connection string from cluster","namespace":"dc1","cluster":"fdb-cluster-1","connectionString":"fdb_cluster_1:jBYTGxM4zAJ2fdbYTAWUctcWWs8bNS13@10.113.237.132:4501,10.113.237.134:4503,10.113.237.138:4503,10.113.237.152:4501,10.113.237.167:4503,10.113.237.171:4501,10.113.237.172:4503,10.113.237.173:4503,10.113.237.181:4503"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"fdbclient","msg":"Fetch values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd/coordinators"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd/coordinators"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Chose connection option","namespace":"dc1","cluster":"fdb-cluster-1","connectionString":"fdb_cluster_1:jBYTGxM4zAJ2fdbYTAWUctcWWs8bNS13@10.113.237.132:4501,10.113.237.134:4503,10.113.237.138:4503,10.113.237.152:4501,10.113.237.167:4503,10.113.237.171:4501,10.113.237.172:4503,10.113.237.173:4503,10.113.237.181:4503"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"fdbclient","msg":"Fetch values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Disable taint feature","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","Disabled":true}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","duration_seconds":0.112349926}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateLockConfiguration"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateLockConfiguration","duration_seconds":0.000006051}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateConfigMap"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateConfigMap","duration_seconds":0.000147837}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.checkClientCompatibility"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.checkClientCompatibility","duration_seconds":0.000014101}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.deletePodsForBuggification"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.deletePodsForBuggification","duration_seconds":0.000503897}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.replaceMisconfiguredProcessGroups"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.replaceMisconfiguredProcessGroups","duration_seconds":0.001388085}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.replaceFailedProcessGroups"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Check desired fault tolerance","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.replaceFailedProcessGroups","expectedFaultTolerance":2,"maxZoneFailuresWithoutLosingData":-1,"maxZoneFailuresWithoutLosingAvailability":-1}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.replaceFailedProcessGroups","duration_seconds":0.000012672}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addProcessGroups"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addProcessGroups","duration_seconds":0.000021395}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addServices"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addServices","duration_seconds":0.00002893}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addPVCs"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addPVCs","duration_seconds":0.000232452}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addPods"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addPods","duration_seconds":0.000455288}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.generateInitialClusterFile"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.generateInitialClusterFile","duration_seconds":0.00000533}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.removeIncompatibleProcesses"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.removeIncompatibleProcesses","duration_seconds":0.000011198}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateSidecarVersions"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateSidecarVersions","duration_seconds":0.000002623}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updatePodConfig"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updatePodConfig","duration_seconds":0.000850175}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateMetadata"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateMetadata","duration_seconds":0.000577766}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateDatabaseConfiguration"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateDatabaseConfiguration","duration_seconds":0.000055063}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.chooseRemovals"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.chooseRemovals","duration_seconds":0.000054176}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.excludeProcesses"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.excludeProcesses","duration_seconds":0.000020555}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.changeCoordinators"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.changeCoordinators","duration_seconds":0.000960452}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.bounceProcesses"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.bounceProcesses","duration_seconds":0.000116449}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.maintenanceModeChecker"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.maintenanceModeChecker","duration_seconds":0.000003884}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updatePods"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updatePods","duration_seconds":0.001538568}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.removeProcessGroups"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.removeProcessGroups","duration_seconds":0.000048341}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.removeServices"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.removeServices","duration_seconds":0.000007047}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Disable taint feature","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","Disabled":true}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","duration_seconds":0.080311831}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Reconciliation complete","namespace":"dc1","cluster":"fdb-cluster-1","generation":2}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Reconciliation run finished","namespace":"dc1","cluster":"fdb-cluster-1","duration_seconds":0.362400171,"cacheStatus":true}


  • i also, have to delete pvc in ns otherwise pods do not start again

That probably means all data for those PVCs are lost. The Pods are probably not being scheduled as the PVC cannot be bound to the previously existing PV (as you removed the node).

  • i use local storage operator to create a storageclass (that will create pvcs) out of VM’s nvme disk

Basically what you are testing is what happens when a whole DC gets all resources deleted, including the data. Is that the intention?

yes, the intention is to test losing the entire AZ, with VMs and its data and let fdb. cluster replicate data back to primary.

i see some storage server lagging and i assume this is related to networking which in cloud we do not control. am i right in my assumption ? how to remediate ?

bash kubectl-fdb-plugin.sh "fdb analyze fdb-cluster-1"                  
Checking cluster: dc1/fdb-cluster-1
✔ Cluster is available
✔ Cluster is fully replicated
✔ Cluster is reconciled
✔ ProcessGroups are all in ready condition
✔ Pods are all running and available
Checking cluster: dc1/fdb-cluster-1 with auto-fix: false
✖ Process: dc1-storage-10 with address: 10.113.237.132:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-9 with address: 10.113.237.139:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-16 with address: 10.113.237.137:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-12 with address: 10.113.237.147:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-8 with address: 10.113.237.144:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-6 with address: 10.113.237.142:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-1 with address: 10.113.237.135:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-11 with address: 10.113.237.148:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-17 with address: 10.113.237.143:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-7 with address: 10.113.237.136:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-2 with address: 10.113.237.133:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-14 with address: 10.113.237.141:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-11 with address: 10.113.237.148:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-10 with address: 10.113.237.132:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-4 with address: 10.113.237.150:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-8 with address: 10.113.237.144:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-12 with address: 10.113.237.147:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-17 with address: 10.113.237.143:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-1 with address: 10.113.237.135:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-6 with address: 10.113.237.142:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-5 with address: 10.113.237.140:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-5 with address: 10.113.237.140:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-15 with address: 10.113.237.138:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-15 with address: 10.113.237.138:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-3 with address: 10.113.237.146:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-9 with address: 10.113.237.139:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-7 with address: 10.113.237.136:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-13 with address: 10.113.237.145:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-4 with address: 10.113.237.150:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-3 with address: 10.113.237.146:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-2 with address: 10.113.237.133:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-13 with address: 10.113.237.145:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-16 with address: 10.113.237.137:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-14 with address: 10.113.237.141:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
Error: 
found issues in status json for cluster fdb-cluster-1. Please check them

Hi, can you please validate my multi_dc design. I am basically following https://github.com/FoundationDB/fdb-kubernetes-operator/tree/main/config/tests/multi_dc , with:

  • 1 k8s cluster
  • 3 fdb running in 3 separate ns, each fdb is controlled by an operator pod.

Q: Would it be better to deploy all fdb instances in 1 ns and let 1 operator pod manage all 3 , because anyway the pods get scheduled on different AZ nodes ?

@johscheuer the fix for my problem was adding killProcesses: true which allows operator to bounce fdbserver processes. The question is how safe is to employ this feature in a production system for day to day operations ? Thanks

@johscheuer the fix for my problem was adding killProcesses: true which allows operator to bounce fdbserver processes. The question is how safe is to employ this feature in a production system for day to day operations ? Thanks

That setting can be considered safe and is enabled per default. If this setting is set to false the operator is not able to perform any upgrades on the FDB cluster and is not able to rollout any new knobs.

The issues is back ! Similar setup, when i fail dc1 and dc3 becomes primary, data distributor is stuck. DB itself is operational, but not sure how to get rid of these errors.

fdb> status details

WARNING: Long delay (Ctrl-C to interrupt)

Using cluster file `/var/dynamic-conf/fdb.cluster'.

Could not communicate with all of the coordination servers.
  The database will remain operational as long as we
  can connect to a quorum of servers, however the fault
  tolerance of the system is reduced as long as the
  servers remain disconnected.

  10.113.181.133:4500:tls  (reachable)
  10.113.181.134:4500:tls  (reachable)
  10.113.181.135:4500:tls  (reachable)
  10.113.181.137:4506:tls  (unreachable)
  10.113.181.143:4512:tls  (unreachable)
  10.113.181.148:4508:tls  (unreachable)
  10.113.181.162:4506:tls  (reachable)
  10.113.181.186:4500:tls  (reachable)
  10.113.181.189:4506:tls  (reachable)

Unable to start batch priority transaction after 5 seconds.

105 client(s) reported: Cluster file contents do not match current cluster connection string. Verify the cluster file and its parent directory are writable and that the cluster file has not been overwritten externally.
  10.113.181.70:45960
  10.113.181.73:60212
  10.113.181.74:58594
  10.113.181.74:58616
  ...

Configuration:
  Redundancy mode        - triple
  Storage engine         - ssd-redwood-1-experimental
  Coordinators           - 9
  Desired Commit Proxies - 6
  Desired GRV Proxies    - 6
  Desired Resolvers      - 1
  Desired Logs           - 12
  Desired Remote Logs    - -1
  Desired Log Routers    - -1
  Usable Regions         - 2
  Regions: 
    Remote -
        Datacenter                    - dc1
        Satellite datacenters         - dc2, dc3
        Satellite Redundancy Mode     - one_satellite_triple
        Satellite Logs                - 3
    Primary -
        Datacenter                    - dc3
        Satellite datacenters         - dc2, dc1
        Satellite Redundancy Mode     - one_satellite_triple
        Satellite Logs                - 3

Cluster:
  FoundationDB processes - 135
  Zones                  - 36
  Machines               - 36
  Memory availability    - 5.8 GB per process on machine with least available
  Retransmissions rate   - 753 Hz
  Fault Tolerance        - -1 machines

  Warning: the database may have data loss and availability loss. Please restart following tlog interfaces, otherwise storage servers may never be able to catch up.
  Old log epoch: 74 begin: 38874972420 end: 38994907062, missing log interfaces(id,address): 71335d2656850118,10.113.181.140:4504:tls 7f0fa163889b3e00, c3b211c348b085c4,10.113.181.140:4508:tls 9c0d9c63a6f092b5,10.113.181.152:4504:tls d7fd29f2559b7049, adc93d4fab98c806,10.113.181.152:4508:tls 57e0f3a0e00d2d10, ed123c246aa9cb12,10.113.181.140:4500:tls 7bab49aa66bc049a,10.113.181.140:4506:tls bc0118120edcce82,10.113.181.152:4502:tls 6116d2ca14060926, 75b5a139a3fa4571,10.113.181.152:4510:tls 
  Old log epoch: 71 begin: 38759330507 end: 38874972420, missing log interfaces(id,address): 71e712ddab78ff3a, 9a5e2eaaf838e1ee, cb39710dba26c2b5,10.113.181.152:4504:tls 24a88d75574553da, a1775ea3a4e65898,10.113.181.140:4502:tls e706bb456fbb358f, 865e548b416bf57c,10.113.181.140:4500:tls 3400e4833e512637, 0a31a2b25ce98081, 4001e2d5c0182fac,10.113.181.152:4510:tls 930a36b80f4daa62, 2559e06c8e8b70d5,10.113.181.140:4510:tls 
  Old log epoch: 68 begin: 38657521349 end: 38759330507, missing log interfaces(id,address): ed301cb66419898d, ec100e44aac75fe4, 0780f861bfb5bb77,10.113.181.140:4508:tls 39a9d64b24719a0c,10.113.181.152:4508:tls f91e5b8fbbfa5480, 6abc4eaf50905f6c, 3514be4e186985fe, fbbf7813c128e9cb,10.113.181.152:4502:tls 5cf3d8f449817cec, aeff656124c157a0, 5dd3274195146095, 793b8499f0e77246,10.113.181.140:4510:tls 
  Old log epoch: 64 begin: 38547485201 end: 38657521349, missing log interfaces(id,address): a26b6fc942e654dd, 30e160a7c612cd9c, 00168df61b4310f4,10.113.181.152:4512:tls 0c79121e8fb675c3,10.113.181.152:4506:tls ae01d56975be4178, 731de2c70562f2ad, 0380336b2d93d9a4,10.113.181.140:4506:tls ca69af6fd8a24635, 455ecb2874a31da6, 219bea2df920c7e8, 29965f9b060c10bf,10.113.181.140:4510:tls 27d9c3d2a84949b6, 
  Old log epoch: 62 begin: 38437398767 end: 38547485201, missing log interfaces(id,address): cd735a2edbd0915e, 9eb2f20f30b5c10d,10.113.181.140:4504:tls 203c7fa35261f1bf, c2bdce2f6ad30cfd,10.113.181.140:4512:tls afd50f34d42b3e33, 0380dbc2f92e6054, 4d191f730b681eac, d5ff43efcd0cbacb, db0fcc67bf4d9ba3, b76187797292a3ee,10.113.181.152:4502:tls 8fc59a925b7f4324, b609db2119fb4bc2, 
  Old log epoch: 58 begin: 38319854028 end: 38437398767, missing log interfaces(id,address): 05eb1b2b107845ea, 6e780dfeb94c9568, 35c38dda8db5374a,10.113.181.140:4508:tls 47b4390d6c070bf5, bfb3f5ea346c1633, 9fffb1ea8707c974, caf3389b790921a9, c18f79c615067707,10.113.181.140:4506:tls 9d44d19e8275536c, 03ae3212a5a2d1d4,10.113.181.152:4502:tls 55e6aa8e1ddcb76d, 4cf64ddffedd039d, 
  Old log epoch: 56 begin: 38203476708 end: 38319854028, missing log interfaces(id,address): 53a6bc5f8c5c3b3b, 49b33abba501770a,10.113.181.140:4504:tls 4d81f8137f0cc950, 3291b1ed30a811bd, ce211606b7bdd22c, 441741e98ee6250f, ddaa7f74d2f461c9, bb24ff7197041205, 806e6749a64041d8, 011e267e31727dae,10.113.181.152:4510:tls 52d4acba46e95479, 02957ee5ab13e727,10.113.181.140:4510:tls 
  Old log epoch: 54 begin: 38086066158 end: 38203476708, missing log interfaces(id,address): 3a7de37e7b7f0737, c4c4ab4ca437e297,10.113.181.152:4512:tls 5bd9dfc1cfc1cb3a, 1f7a31df652dbaa1, ea9173de13454aba, 60af4818c0462dc0,10.113.181.140:4502:tls be5d1bc8756d88d6, 92e3071dc4a887d5, 9d2f36841ffc5e0b, 966346dcdd75471f, 0ee6673b39c47f9c, fb6e4dd817a1ae00, 
  Old log epoch: 52 begin: 37970818507 end: 38086066158, missing log interfaces(id,address): 5278e37d48583a0f, 8a5948fcf1e832d6,10.113.181.140:4504:tls 27ad26fe80c5f36e, b5b700ec92444f4e, 13ded6ffe4fe8181, 2cf01597eabdc204, e00111808f94a8e8, 08cc6c1fba28e915, c2978afdfe3e032d, 4d3da7fd129d2521, b015a1306d3dad07, e37ca16b22f367fd, 
  Old log epoch: 49 begin: 37853757189 end: 37970818507, missing log interfaces(id,address): bd4004c3c562946c, 3bac3402b5e52fa0, 8ba9861df9511d7e, 1a0cfa8455e6d25d, e3d7c8680103efcc, ef9434cf7bd31a27, e73b7e98abcd4110, 3e6f13fb299fc22a, 0242830cbd9cc752, 8e61b95207e9d897,10.113.181.140:4510:tls 6fe6714d14906dc6, 0a94aafd3e05a15e, 
  Old log epoch: 46 begin: 37753602931 end: 37853757189, missing log interfaces(id,address): 766bf3a4c45549a6, 83801643eaeaa682, 6ec6e0fc3832481c,10.113.181.140:4508:tls 24dbfc38339f9654, 94cd143fa262beb9, 9c69a531f4b16207, baeafd7c130855e7, 84f23b4dd1e94613, ccd1512c952c9b01, ed1753d9adf76d29, 46ce8e83291c5a96, 70227e0d4a5b0719, 
  Old log epoch: 44 begin: 37639144151 end: 37753602931, missing log interfaces(id,address): 5763961e16e81da2, 8f25b5b3272473b2, 82de555ec4269fbd,10.113.181.140:4508:tls fc5bcb92201e3fe1, 8ee1a2dccc41e13e, c26794966bc8d46f, 3d9d649144fef4ba, 09b729bf9cd8efec, f9a5baa71b8d9806, 0e29836c9e792c30, 463c4c8c7527c86c, 5f92865a6ec0f7f5, 
  Old log epoch: 42 begin: 37513640716 end: 37639144151, missing log interfaces(id,address): d65510220b4be792, b36e2490e306e6d4, e9083b342281d112, 8cf3d37a8e2d2ccb, d34a14b6c88618ca, 799e3bae96a17241, 7dbe611e2282ac53, bc824e43a51c8445, 2a1eeab3e15694dd, 16a92180a3f9117d, 953d7efe36c21fb0, 47f8aec863759814, 
  Old log epoch: 39 begin: 37399909539 end: 37513640716, missing log interfaces(id,address): c3bc61b3684a39a8, ca869aa2bf6a86d3, d0c5345c5db884fc, 99ab6955d6890575, e3b57fb3db1aa6b9, b3464b5bb7fb1963, f98ed9d852de8b36, d2e3e1f57b5c9ac3, 7b521424256e70aa, a1d52f3b534883d3, 8eb1b2f70a80b19d, 9537003560d4af05, 
  Old log epoch: 36 begin: 37282570409 end: 37399909539, missing log interfaces(id,address): a97764d71df46cdf, 0886383d2d74662e, e37cc349278ef55d, 18a11496cb987f7c, fc53489f6c06c88f, 04506b4ab793a501, c270ba1afe3db622, e200e61a40becb79, 66722cf88ef8effa, c7fe0835c8a9f18d, c3f6af36a5fb6540, 8b123d48e6dd5d54, 
  Old log epoch: 34 begin: 37168705147 end: 37282570409, missing log interfaces(id,address): f503b989587d17d3, 0fadce5d18eb3861, 142cd478ad1e8f6d, 61ece9e71989949a, 81d46d4560a5a6ef, ec011a005decd4b3, 905af2a165b0fcc4, d7ac355ff88e14ef, f5c5aa3c06edd52d, 1f9f714cb87ed50f, a1c981818d6896c5, 7e88e526b78d2813, 
  Old log epoch: 32 begin: 37049090218 end: 37168705147, missing log interfaces(id,address): 1b058ce66a9c3a20, 3743ee1675075486, 27b435d9b78dee86, ffb88be7a9b7f347, eca5816ef1b0d711, 136de7fb91eb55f1, e7d5de4e6f72fed4, 1eae9db3c7798cb9, 75d98918317600b6, 34a98307057f398c, 7ac50e0faba07ba4, 6900ef39b80c4401, 
  Old log epoch: 30 begin: 36928933015 end: 37049090218, missing log interfaces(id,address): 7a0c664d36e63e09, 0a7983f242d979c9, 8b14658f7ea1846b, 90990fe9b80a6ba3, c8b0d1c5d1061c3e, 8333e49b48a86d9a, 59e214294ae73c5b, bd382390dfdb9d11, 4406ee34dbecb0e3, 357bd9522941aadb, 2267908b9a3011d0, 82b84e8c6c569867, 
  Old log epoch: 28 begin: 36808487925 end: 36928933015, missing log interfaces(id,address): 37a67f6865d0ab92, 744ebbb5e9188080, 0f12e31ebf7d2946, ec97c0912ee481dd, 6fa0acf88b05f74e, ff7323430aa35115, 791dc5ce76d8b15c, 9d0775d6ad644c7b, 72fc511441a04379, a70671158aad8adb, d3ccbe19d0995fcf, 5f3f17a08538d0a2, 
  Old log epoch: 26 begin: 36688408186 end: 36808487925, missing log interfaces(id,address): 85558b8735300770, 262135ed05a89419, c1ae2b5f0ca07b0e, 54df9f7206b70411, 5052a642c26b3331, c58ac6c3b2a93982, df6bd8fa3396cc9f, 601bf024a19418b9, c61cc8ab5c0c38b6, 33c08a7cfb98dae0, 9150fdd4ca49235e, a1bc3b75d8443669, 
  Old log epoch: 24 begin: 36567450619 end: 36688408186, missing log interfaces(id,address): 43a41e0c7e477e28, e2f4b9546a7eb502, 9075d6b9c2bcbf95, b94c6eee865e4b3f, 3a1ac6c2910ff4f6, f5f708a957db5836, f9520838c5c8c621, 5c29d8faf70683af, cc719760f454a957, 68cd6ac40879ba82, c2698ae510b65c44, 58f1b060203d8f38, 
  Old log epoch: 22 begin: 36446461954 end: 36567450619, missing log interfaces(id,address): 0f585d18b10f93ad, 9f25ee184f987179, e754e03e9a153bcb, 9b34d536d238919a, 5a2cc2a8f1aa22d6, f9aecd3db82672ec, be9c51c0b36f2d6d, 24877de37d59b139, 6d0f19dfc29ff6d5, 4aa6b0ec8beb4a82, 41f8fbcc0717b2a4, 387cea7717c33b9d, 
  Old log epoch: 20 begin: 36331579736 end: 36446461954, missing log interfaces(id,address): b34bbe757e1eb960, 9ab29bdaf049ebaa, 08e831dca4c3c89c, dac5c5d56ec4e470, adb9142b7aa54186, d7aa2db83b88a9b8, 0f0154e07929b766, 0b6d277c18f31e64, 17fece7b35d9f561, a6b847b5084a25d6, 51a39c808bae5c79, 47918044bf6bd13d, 
  Old log epoch: 18 begin: 36211817247 end: 36331579736, missing log interfaces(id,address): d84134d0c850ab33, dbd5296490a10c40, 530c48db7fa32273, e7c9a66f8c6e75d6, 17801f8907265aab, c70f80ace49b91a9, 3c9d0acf7c44ebe5, ecd339ace265a7d4, 04c10668a6b42574, 8eafc872aeda059c, cda96e2a2e56e8b0, 8c78e5dc7a1e5f79, 
  Old log epoch: 16 begin: 36089151863 end: 36211817247, missing log interfaces(id,address): b219dba859020dfb, e301a35693f2f369, aa3a1a9713859a4a, 128e3915f60b3ecd, 24f91b9556f70761, 632019f989957983, 309d4cad6535ac4f, a6a4e923503c5b30, 4c75abdb4b9ee74e, f160414f80e52c3f, 27f4b71b2bba1f27, 229573f9bc951605, 
  Old log epoch: 14 begin: 35966709485 end: 36089151863, missing log interfaces(id,address): 1023eb6c9a5d75b9, 03fa35250f0bd1c7, 7865bb56320fc903, f1a452941f9d4e40, 5c0f3b5db35f2c58, 0f7c6a6e2a317ed6, 3fb6d2c5b91623fc, f525f31a91cb7f48, bc116c1f4f0a805a, bbb1b45bbc9550ed, 304cd8e9cbc24af5, d58ac0f37e343103, 
  Old log epoch: 12 begin: 35850783780 end: 35966709485, missing log interfaces(id,address): 72d599c0fb3633fc, 21328ec9671b1edc, b699f8aa8b1b5d1e, 2191653427b609d8, 52799373235ffcb4, 8622332cd7a2adbb, 384297d43d067b54, 85d69debe728882f, b408de57c37bf600, 45990919c3449ec5, 527f08ab48ed9095, 79e89e09f179d489, 
  Old log epoch: 10 begin: 579171909 end: 35850783780, missing log interfaces(id,address): fd16724cfb1b1508, c6345ef9b64f15df, 6eb0c16f75b402b5, 968e586fce45f37d, 94d56e879689af20, d229b5c24a764068, f04f3b83b57f137c, 0bd478a569dd5406, dcc95c4cd4c772e5, bbb61ced3513fb3a, f0e7c3fa2d9e3478, 4f641f4baa4b6b77, 

  Server time            - 05/21/24 15:24:44

Data:
  Replication health     - (Re)initializing automatic data distribution
  Moving data            - unknown (initializing)
  Sum of key-value sizes - unknown
  Disk space used        - 5.393 TB

Operating space:
  Storage server         - 1441.9 GB free on most full server
  Log server             - 824.7 GB free on most full server

Workload:
  Read rate              - 384707 Hz
  Write rate             - 34087 Hz
  Transactions started   - 3864 Hz
  Transactions committed - 3756 Hz
  Conflict rate          - 4 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

I’m able to reproduce this with a test cluster that uses the triplet configuration. I don’t think that this is an operator issue, but maybe something misreporting. @jzhou do you know why this messages could come up in a triplet configuration? I used a similar configuration (I used one_satellite_double instead of one_satellite_triple and the ssd-2 storage engine). This can be reproduced reliable with FDB version 7.1.57.

As you may be aware, we no longer use DR feature. So fixing the issue is low on our list. If this is reproducible in simulation, it might have a better chance to be fixed.

@jzhou - To clarify, the above issue was encountered when running the multi-region configuration with one_satellite_triple (and not the older DR configuration).