Multi DC replication fails during DR test

Setup:

  • multi_dc setup with 1 k8s cluster and 3 namespaces that map to 3 dc

  • dc1 serves as Primary ; dc3 - Secondary ; dc2 - Satellite

Configuration:
  Redundancy mode        - triple
  Storage engine         - ssd-2
  Coordinators           - 9
  Desired Commit Proxies - 2
  Desired GRV Proxies    - 1
  Desired Resolvers      - 1
  Desired Logs           - 3
  Desired Remote Logs    - 3
  Desired Log Routers    - 3
  Usable Regions         - 2
  Regions: 
    Primary -
        Datacenter                    - dc1
        Satellite datacenters         - dc2, dc3
        Satellite Logs                - 3
    Remote -
        Datacenter                    - dc3
        Satellite datacenters         - dc2, dc1
        Satellite Logs                - 3

Cluster:
  FoundationDB processes - 81
  Zones                  - 47
  Machines               - 47
  Memory availability    - 8.0 GB per process on machine with least available
  Retransmissions rate   - 0 Hz
  Fault Tolerance        - 2 machines
  Server time            - 09/17/23 01:59:16

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 1.158 TB
  Disk space used        - 8.639 TB

The problem:

During a DR test, i simulated Primary dc1 going down, bringing it back up and let data being replicated back, but it’s stuck at the moment (see below errors).

  • i would like to understand why this happened, how to recover and how to prevent this type of situation
  • i’ve noticed that fdb operator added a storage pod in dc1 namespace , from k8s perspective how is the recovery process handled ?
Configuration:
  Redundancy mode        - triple
  Storage engine         - ssd-2
  Coordinators           - 9
  Desired Commit Proxies - 2
  Desired GRV Proxies    - 1
  Desired Resolvers      - 1
  Desired Logs           - 3
  Desired Remote Logs    - 3
  Desired Log Routers    - 3
  Usable Regions         - 2
  Regions: 
    Remote -
        Datacenter                    - dc1
        Satellite datacenters         - dc2, dc3
        Satellite Logs                - 3
    Primary -
        Datacenter                    - dc3
        Satellite datacenters         - dc2, dc1
        Satellite Logs                - 3

Cluster:
  FoundationDB processes - 81 (less 0 excluded; 12 with errors)
  Zones                  - 47
  Machines               - 47
  Memory availability    - 8.0 GB per process on machine with least available
  Retransmissions rate   - 1 Hz
  Fault Tolerance        - -1 machines

  Warning: the database may have data loss and availability loss. Please restart following tlog interfaces, otherwise storage servers may never be able to catch up.
  Old log epoch: 253 begin: 752008734932 end: 752125253003, missing log interfaces(id,address): 55cfd4bc6e3b0a8f, 9ab3ab3fdefabd20, 91fa9bbe4783efbb, 
  Old log epoch: 250 begin: 751872585983 end: 752008734932, missing log interfaces(id,address): 055aa740d22950ef, 2bfa78e37a663789, bc810043cb1a881b, 
  Old log epoch: 248 begin: 751768423814 end: 751872585983, missing log interfaces(id,address): 60b3598fd80246f7, e841f74ebd3955bb, 0e8b32bcfc4a4ae2, 
  Old log epoch: 246 begin: 751657077872 end: 751768423814, missing log interfaces(id,address): 730ccd96ca1256b0, a77e054a4e8f1279, 0c79570a7b4707d0, 
  Old log epoch: 244 begin: 751552887228 end: 751657077872, missing log interfaces(id,address): 0ed852c5e99c862c, d28c7dae960cea90, 991deec06ebdd74d, 
  Old log epoch: 242 begin: 751440887046 end: 751552887228, missing log interfaces(id,address): 15e823d14fde4dfb, 3d716439331bb3be, c82bcf34b648311e, 
  Old log epoch: 240 begin: 751332177172 end: 751440887046, missing log interfaces(id,address): 68c1e97818bf5ef4, 0abc75e178f47ee7, 81991d47565c137a, 
  Old log epoch: 237 begin: 751187559126 end: 751332177172, missing log interfaces(id,address): 1493a5a1f23cf477, 0960fcaf02bd12e3, f76dfe8ef10992f5, 
  Old log epoch: 235 begin: 751050558192 end: 751187559126, missing log interfaces(id,address): 0b38863b2b4e2ceb, 725bcab59e5c9ea0, 1de0783cc2013219, 
  Old log epoch: 233 begin: 747960335254 end: 751050558192, missing log interfaces(id,address): 3c4efb7aa938ab14, f286a82bf78eeae2, c0ce84e3fedb16c5, 

  Server time            - 09/17/23 14:30:07

Data:
  Replication health     - (Re)initializing automatic data distribution
  Moving data            - unknown (initializing)
  Sum of key-value sizes - unknown
  Disk space used        - 4.287 TB

Operating space:
  Storage server         - 1541.7 GB free on most full server
  Log server             - 1542.0 GB free on most full server

Workload:
  Read rate              - 56 Hz
  Write rate             - 0 Hz
  Transactions started   - 19 Hz
  Transactions committed - 1 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  10.113.237.132:4501    (  1% cpu;  1% machine; 0.000 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.133:4501    (  1% cpu;  2% machine; 0.002 Gbps;  0% disk IO; 5.7 GB / 8.0 GB RAM  )
  10.113.237.133:4503    (  2% cpu;  2% machine; 0.002 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.134:4501    (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
    Storage server lagging by 350 seconds.
  10.113.237.134:4503    (  0% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.135:4501    (  1% cpu;  1% machine; 0.001 Gbps;  4% disk IO; 0.2 GB / 8.0 GB RAM  )
    Storage server lagging by 342 seconds.
  10.113.237.135:4503    (  0% cpu;  1% machine; 0.001 Gbps;  4% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.136:4501    (  1% cpu;  1% machine; 0.001 Gbps;  9% disk IO; 0.2 GB / 8.0 GB RAM  )
    Storage server lagging by 340 seconds.
  10.113.237.136:4503    (  1% cpu;  1% machine; 0.001 Gbps;  9% disk IO; 0.2 GB / 8.0 GB RAM  )
    Storage server lagging by 280 seconds.
  10.113.237.137:4501    (  0% cpu;  1% machine; 0.001 Gbps;  2% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.137:4503    (  1% cpu;  1% machine; 0.001 Gbps;  2% disk IO; 0.2 GB / 8.0 GB RAM  )
    Storage server lagging by 280 seconds.
  10.113.237.138:4501    (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
    Storage server lagging by 280 seconds.
  10.113.237.138:4503    (  0% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.139:4501    (  0% cpu;  1% machine; 0.001 Gbps; 10% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.140:4501    (  1% cpu;  1% machine; 0.001 Gbps; 10% disk IO; 0.2 GB / 8.0 GB RAM  )
    Storage server lagging by 249 seconds.
  10.113.237.140:4503    (  1% cpu;  1% machine; 0.001 Gbps; 10% disk IO; 0.2 GB / 8.0 GB RAM  )
    Storage server lagging by 249 seconds.
  10.113.237.141:4501    (  1% cpu;  1% machine; 0.001 Gbps;  1% disk IO; 0.2 GB / 8.0 GB RAM  )
    Storage server lagging by 250 seconds.
  10.113.237.141:4503    (  0% cpu;  1% machine; 0.001 Gbps;  1% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.142:4501    (  0% cpu;  1% machine; 0.001 Gbps;  3% disk IO; 0.2 GB / 8.0 GB RAM  )
    Storage server lagging by 315 seconds.
  10.113.237.142:4503    (  1% cpu;  1% machine; 0.001 Gbps;  3% disk IO; 0.2 GB / 8.0 GB RAM  )
    Storage server lagging by 315 seconds.
  10.113.237.143:4501    (  0% cpu;  2% machine; 0.001 Gbps;  9% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.143:4503    (  0% cpu;  2% machine; 0.001 Gbps;  9% disk IO; 0.2 GB / 8.0 GB RAM  )
    Storage server lagging by 250 seconds.
  10.113.237.144:4501    (  0% cpu;  2% machine; 0.001 Gbps;  9% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.144:4503    (  0% cpu;  2% machine; 0.001 Gbps;  9% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.145:4501    (  0% cpu;  2% machine; 0.001 Gbps;  9% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.145:4503    (  0% cpu;  2% machine; 0.001 Gbps;  9% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.146:4501    (  1% cpu;  2% machine; 0.001 Gbps; 10% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.147:4501    (  1% cpu;  2% machine; 0.001 Gbps; 10% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.147:4503    (  0% cpu;  2% machine; 0.001 Gbps; 10% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.148:4501    (  0% cpu;  2% machine; 0.001 Gbps; 10% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.148:4503    (  0% cpu;  2% machine; 0.001 Gbps; 10% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.149:4501    (  0% cpu;  2% machine; 0.001 Gbps; 10% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.149:4503    (  0% cpu;  2% machine; 0.001 Gbps; 10% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.150:4501    (  0% cpu;  1% machine; 0.001 Gbps; 10% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.150:4503    (  0% cpu;  1% machine; 0.001 Gbps; 10% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.151:4501    (  0% cpu;  1% machine; 0.001 Gbps;  7% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.151:4503    (  0% cpu;  1% machine; 0.001 Gbps;  7% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.152:4501    (  0% cpu;  1% machine; 0.001 Gbps;  9% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.152:4503    (  0% cpu;  1% machine; 0.001 Gbps;  9% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.153:4501    (  1% cpu;  5% machine; 0.001 Gbps;  2% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.154:4501    (  1% cpu;  6% machine; 0.017 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.155:4501    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.156:4501    (  2% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 5.5 GB / 8.0 GB RAM  )
  10.113.237.156:4503    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.157:4501    (  1% cpu;  4% machine; 0.001 Gbps;  0% disk IO; 5.7 GB / 8.0 GB RAM  )
  10.113.237.157:4503    (  1% cpu;  4% machine; 0.001 Gbps;  0% disk IO; 5.5 GB / 8.0 GB RAM  )
  10.113.237.158:4501    (  1% cpu;  1% machine; 0.000 Gbps;  0% disk IO; 4.1 GB / 8.0 GB RAM  )
  10.113.237.158:4503    (  1% cpu;  1% machine; 0.000 Gbps;  0% disk IO; 4.9 GB / 8.0 GB RAM  )
  10.113.237.159:4501    (  1% cpu;  3% machine; 0.001 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.159:4503    (  1% cpu;  3% machine; 0.001 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.160:4501    (  1% cpu;  1% machine; 0.000 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.160:4503    (  1% cpu;  1% machine; 0.000 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.161:4501    (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 5.5 GB / 8.0 GB RAM  )
  10.113.237.161:4503    (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.162:4501    (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 5.5 GB / 8.0 GB RAM  )
  10.113.237.162:4503    (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 5.5 GB / 8.0 GB RAM  )
  10.113.237.163:4501    (  1% cpu;  2% machine; 0.017 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.163:4503    (  1% cpu;  2% machine; 0.017 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.164:4501    (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 0.6 GB / 8.0 GB RAM  )
  10.113.237.165:4501    (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 5.5 GB / 8.0 GB RAM  )
  10.113.237.165:4503    (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 5.5 GB / 8.0 GB RAM  )
  10.113.237.166:4501    (  1% cpu;  5% machine; 0.002 Gbps;  0% disk IO; 5.5 GB / 8.0 GB RAM  )
  10.113.237.166:4503    (  1% cpu;  5% machine; 0.002 Gbps;  0% disk IO; 4.8 GB / 8.0 GB RAM  )
  10.113.237.167:4501    (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.167:4503    (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.168:4501    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 4.1 GB / 8.0 GB RAM  )
  10.113.237.168:4503    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.169:4501    (  1% cpu;  1% machine; 0.000 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.169:4503    (  1% cpu;  1% machine; 0.000 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.170:4501    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 5.5 GB / 8.0 GB RAM  )
  10.113.237.170:4503    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 5.5 GB / 8.0 GB RAM  )
  10.113.237.171:4501    (  1% cpu;  1% machine; 0.000 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.171:4503    (  1% cpu;  1% machine; 0.000 Gbps;  0% disk IO; 5.5 GB / 8.0 GB RAM  )
  10.113.237.172:4501    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 4.7 GB / 8.0 GB RAM  )
  10.113.237.172:4503    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 4.9 GB / 8.0 GB RAM  )
  10.113.237.173:4501    (  2% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 0.6 GB / 8.0 GB RAM  )
  10.113.237.174:4501    (  7% cpu;  5% machine; 0.004 Gbps;  2% disk IO; 0.3 GB / 8.0 GB RAM  )
  10.113.237.175:4501    (  1% cpu;  6% machine; 0.001 Gbps;  2% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.176:4501    (  0% cpu;  5% machine; 0.001 Gbps; 18% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.177:4501    (  6% cpu;  5% machine; 0.006 Gbps;  1% disk IO; 0.3 GB / 8.0 GB RAM  )
  10.113.237.178:4501    (  3% cpu;  4% machine; 0.004 Gbps;  1% disk IO; 0.3 GB / 8.0 GB RAM  )

Coordination servers:
  10.113.237.132:4501  (reachable)
  10.113.237.133:4503  (reachable)
  10.113.237.139:4501  (reachable)
  10.113.237.146:4501  (reachable)
  10.113.237.147:4501  (reachable)
  10.113.237.154:4501  (reachable)
  10.113.237.155:4501  (reachable)
  10.113.237.164:4501  (reachable)
  10.113.237.173:4501  (reachable)

Client time: 09/17/23 14:30:04

Could you share some more details, e.g. how did you simulate the failure of dc1? Are you able to share the logs of the k8s operator(s)?

  • i’ve noticed that fdb operator added a storage pod in dc1 namespace , from k8s perspective how is the recovery process handled ?

In theory if the operator is able to connect to the cluster it should be recreating the resources. If the issue was only a network partition the operator has nothing to do and the FDB cluster would be doing the recovery.

in current setup, using nodeSelectors and topologySpreadConstraints every pod (fdb process) is placed on a k8s node (cloud vm), using networkHost ; hence, pods/nodes belong to 3 namespaces (dc1, dc2, dc3) but from cloud infra perspective a namespaces is mapped to an AZ. Simulating dc1 failure is terminating all pods in dc1 namespace (all VMs in AZ1), and bringing them back. Operator pod is free to move around and is not on host Network.
At the moment there are some processes that are stuck and operator is trying to exclude them.

status:
  processGroups:
    - addresses:
        - 10.113.237.136
      processClass: log
      processGroupID: dc1-log-1
    - addresses:
        - 10.113.237.142
      processClass: log
      processGroupID: dc1-log-2
    - addresses:
        - 10.113.237.180
      processClass: stateless
      processGroupID: dc1-stateless-1
    - addresses:
        - 10.113.237.176
      processClass: stateless
      processGroupID: dc1-stateless-2
    - addresses:
        - 10.113.237.179
      processClass: stateless
      processGroupID: dc1-stateless-3
    - addresses:
        - 10.113.237.147
        - 10.113.237.149
        - 10.113.237.152
        - 10.113.237.155
        - 10.113.237.173
      processClass: storage
      processGroupConditions:
        - timestamp: 1694989929
          type: ResourcesTerminating          <<<<<<
      processGroupID: dc1-storage-1
      removalTimestamp: '2023-09-20T14:02:39Z'
    - addresses:
        - 10.113.237.141
      processClass: storage
      processGroupID: dc1-storage-10
    - addresses:
        - 10.113.237.148
      processClass: storage
      processGroupID: dc1-storage-11
    - addresses:
        - 10.113.237.154
      processClass: storage
      processGroupID: dc1-storage-12
    - addresses:
        - 10.113.237.147
      processClass: storage
      processGroupID: dc1-storage-13
    - addresses:
        - 10.113.237.151
      processClass: storage
      processGroupID: dc1-storage-14
    - addresses:
        - 10.113.237.138
      processClass: storage
      processGroupID: dc1-storage-15
    - addresses:
        - 10.113.237.139
      processClass: storage
      processGroupID: dc1-storage-16
    - addresses:
        - 10.113.237.145
      processClass: storage
      processGroupID: dc1-storage-17
    - addresses:
        - 10.113.237.137
      processClass: storage
      processGroupID: dc1-storage-18
    - addresses:
        - 10.113.237.155
      processClass: storage
      processGroupID: dc1-storage-19
    - addresses:
        - 10.113.237.150
      processClass: storage
      processGroupID: dc1-storage-2
    - addresses:
        - 10.113.237.135
      processClass: storage
      processGroupID: dc1-storage-3
    - addresses:
        - 10.113.237.144
      processClass: storage
      processGroupID: dc1-storage-4
    - addresses:
        - 10.113.237.149
      processClass: storage
      processGroupID: dc1-storage-5
    - addresses:
        - 10.113.237.152
      processClass: storage
      processGroupID: dc1-storage-6
    - addresses:
        - 10.113.237.140
      processClass: storage
      processGroupID: dc1-storage-7
    - addresses:
        - 10.113.237.143
      processClass: storage
      processGroupID: dc1-storage-8
    - addresses:
        - 10.113.237.135
        - 10.113.237.146
      processClass: storage
      processGroupConditions:
        - timestamp: 1695160643
          type: MissingPod                                       <<<<<<<<
      processGroupID: dc1-storage-9
      removalTimestamp: '2023-09-20T14:02:39Z'
  • some pod logs:
"level":"info","ts":1695218561.8718004,"logger":"fdbclient","msg":"Fetch values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218563.9435072,"logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218563.948019,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","subReconciler":"controllers.updateSidecarVersions"}
{"level":"info","ts":1695218563.948068,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","subReconciler":"controllers.updatePodConfig"}
{"level":"info","ts":1695218563.9488695,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","subReconciler":"controllers.updateLabels"}
{"level":"info","ts":1695218563.9493287,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","subReconciler":"controllers.updateDatabaseConfiguration"}
{"level":"info","ts":1695218563.9493825,"logger":"fdbclient","msg":"Fetch values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218566.0212674,"logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218566.0240517,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","subReconciler":"controllers.chooseRemovals"}
{"level":"info","ts":1695218566.0241134,"logger":"fdbclient","msg":"Fetch values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218568.0982757,"logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218568.1017754,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","subReconciler":"controllers.excludeProcesses"}
{"level":"info","ts":1695218568.1018329,"logger":"fdbclient","msg":"Fetch values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218570.1727428,"logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218570.1753511,"logger":"controller","msg":"current exclusions","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"excludeProcesses","ex":["10.113.237.134:4501","10.113.237.147","10.113.237.149","10.113.237.152"]}
{"level":"info","ts":1695218570.17548,"logger":"fdbclient","msg":"Running command","namespace":"dc1","cluster":"fdb-cluster-1","path":"/usr/bin/fdb/7.1/fdbcli","args":["/usr/bin/fdb/7.1/fdbcli","--exec","exclude 10.113.237.155 10.113.237.173 10.113.237.135 10.113.237.146","-C","/tmp/4db61298-4806-40b6-9d18-07c5ca38f2c2","--log","--log","--trace_format","xml","--log-dir","/var/log/fdb","--timeout","10"]}
{"level":"error","ts":1695218572.3537273,"logger":"fdbclient","msg":"Error from FDB command","namespace":"dc1","cluster":"fdb-cluster-1","code":1,"stdout":"ERROR: Could not calculate the impact of this exclude on the total free space in the cluster.\nPlease try the exclude again in 30 seconds.\nType `exclude FORCE <ADDRESS...>' to exclude without checking free space.\n","stderr":"","error":"exit status 1","stacktrace":"github.com/FoundationDB/fdb-kubernetes-operator/fdbclient.(*cliAdminClient).runCommandWithBackoff\n\t/workspace/fdbclient/admin_client.go:282\ngithub.com/FoundationDB/fdb-kubernetes-operator/fdbclient.(*cliAdminClient).ExcludeProcesses\n\t/workspace/fdbclient/admin_client.go:432\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.excludeProcesses.reconcile\n\t/workspace/controllers/exclude_processes.go:84\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBClusterReconciler).Reconcile\n\t/workspace/controllers/cluster_controller.go:166\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}
{"level":"info","ts":1695218572.3538575,"logger":"controller","msg":"Delaying requeue for sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","subReconciler":"controllers.excludeProcesses","message":"","error":"exit status 1"}
{"level":"info","ts":1695218572.3539171,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","subReconciler":"controllers.changeCoordinators"}
{"level":"info","ts":1695218572.3539553,"logger":"fdbclient","msg":"Fetch values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218574.425527,"logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218574.4308026,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","subReconciler":"controllers.bounceProcesses"}
{"level":"info","ts":1695218574.4308512,"logger":"fdbclient","msg":"Fetch values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218576.5043163,"logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218576.5072148,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","subReconciler":"controllers.maintenanceModeChecker"}
{"level":"info","ts":1695218576.507269,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","subReconciler":"controllers.updatePods"}
{"level":"info","ts":1695218576.508645,"logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","subReconciler":"controllers.removeProcessGroups"}
{"level":"info","ts":1695218576.508712,"logger":"fdbclient","msg":"Fetch values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218578.5825465,"logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218578.5856998,"logger":"fdbclient","msg":"Filtering excluded processes","namespace":"dc1","cluster":"fdb-cluster-1","inProgress":["10.113.237.147","10.113.237.149","10.113.237.152"],"fullyExcluded":[],"notExcluded":["10.113.237.155","10.113.237.173","10.113.237.135","10.113.237.146"],"missingInStatus":[]}
{"level":"info","ts":1695218578.5857313,"logger":"controller","msg":"Exclusions to complete","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"removeProcessGroups","remainingServers":["10.113.237.155","10.113.237.173","10.113.237.135","10.113.237.146","10.113.237.147","10.113.237.149","10.113.237.152"]}
{"level":"info","ts":1695218578.5857813,"logger":"fdbclient","msg":"Fetch values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218580.6584988,"logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":1695218580.6614013,"logger":"controller","msg":"Incomplete exclusion still present in removeProcessGroups step","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"removeProcessGroups","processGroupID":"dc1-storage-1","error":"process has missing address in exclusion results: 10.113.237.147"}
{"level":"info","ts":1695218580.661425,"logger":"controller","msg":"Incomplete exclusion still present in removeProcessGroups step","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"removeProcessGroups","processGroupID":"dc1-storage-9","error":"process has missing address in exclusion results: 10.113.237.135"}
{"level":"info","ts":1695218580.6614563,"logger":"controller","msg":"Reconciliation terminated early","namespace":"dc1","cluster":"fdb-cluster-1","subReconciler":"controllers.removeProcessGroups","requeueAfter":0,"message":"Reconciliation needs to exclude more processes"}

Note: i performed several successful test like this , so i am not sure what went wrong.

Thanks for sharing the logs, I take a look tomorrow. Could you share the operator version you used for the test?

fdb: 7.1.26
operator: v1.16.0

thanks for help!

following on this issue:

  • i was able to replicate the same issue in a different cluster
  • i used latest operator version and all pods/(processes) recovered ; nothing of value in logs as everything is reconciled
  • the data is not getting replicated to primary , i think this is foundationdb layer

What is the process to restart Tlogs Interfaces (manually) ?

Using cluster file `/var/dynamic-conf/fdb.cluster'.

Configuration:
  Redundancy mode        - triple
  Storage engine         - ssd-2
  Coordinators           - 9
  Desired Commit Proxies - 2
  Desired GRV Proxies    - 1
  Desired Resolvers      - 1
  Desired Logs           - 3
  Desired Remote Logs    - 3
  Desired Log Routers    - 3
  Usable Regions         - 2
  Regions: 
    Primary -
        Datacenter                    - dc1
        Satellite datacenters         - dc2, dc3
        Satellite Logs                - 3
    Remote -
        Datacenter                    - dc3
        Satellite datacenters         - dc2, dc1
        Satellite Logs                - 3

Cluster:
  FoundationDB processes - 88
  Zones                  - 47
  Machines               - 47
  Memory availability    - 7.2 GB per process on machine with least available
  Retransmissions rate   - 1 Hz
  Fault Tolerance        - -1 machines

  Warning: the database may have data loss and availability loss. Please restart following tlog interfaces, otherwise storage servers may never be able to catch up.
  Old log epoch: 61 begin: 67747007429 end: 67862561162, missing log interfaces(id,address): 702bb3191d379a75, e2f31f2ec89c3ea8, 0e8867a3cedf9d49, 
  Old log epoch: 57 begin: 67633217945 end: 67747007429, missing log interfaces(id,address): b78ad3e928f598b3, 347462edcd47ca65, ef55f1c0644d9947, 
  Old log epoch: 54 begin: 67522181213 end: 67633217945, missing log interfaces(id,address): 53a5357d543c3689, d2226fd86c5e4438, 03d2166c63706f71, 
  Old log epoch: 52 begin: 67392885066 end: 67522181213, missing log interfaces(id,address): c115949861718c45, 6f94496c83dbfa4d, a29f67a398b280f0, 
  Old log epoch: 50 begin: 67276398632 end: 67392885066, missing log interfaces(id,address): 1ee2a591852d31f8, d55326815f9961fd, c0a8494a55bc0801, 
  Old log epoch: 47 begin: 67160817068 end: 67276398632, missing log interfaces(id,address): 7f5eee096b3696fd, d98c85335819c2c2, c222722bc97d9ba3, 
  Old log epoch: 45 begin: 67040025375 end: 67160817068, missing log interfaces(id,address): 444e5c6f3f8baeaa, 83bb01e32d493108, dac3f70601069d4b, 
  Old log epoch: 43 begin: 66922430237 end: 67040025375, missing log interfaces(id,address): 8c5201059109ec7a, eb9fdc998f238375, 9f9c019b466d372d, 
  Old log epoch: 41 begin: 66797519416 end: 66922430237, missing log interfaces(id,address): 1adc6b952c144d1e, 5df11c41622c8012, c74557cd9aafe021, 
  Old log epoch: 39 begin: 66689005581 end: 66797519416, missing log interfaces(id,address): 42cce254c5a04d12, 72265d0fb2500d78, ed0242fd8c8211ed, 
  Old log epoch: 37 begin: 66573370304 end: 66689005581, missing log interfaces(id,address): fea5c6457e5ecd86, c3e17575ac55487b, 064c7fd246c37ef7, 
  Old log epoch: 35 begin: 66405317743 end: 66573370304, missing log interfaces(id,address): c557b060d856f82f, 044bb1ff6e8ac834, 0342f98cd441be44, 
  Old log epoch: 33 begin: 66295640128 end: 66405317743, missing log interfaces(id,address): c4653c3267281519, f0bbc162395d2017, 75c42aad636f5f3e, 
  Old log epoch: 31 begin: 59531875452 end: 66295640128, missing log interfaces(id,address): e2539e896af33195, ec5a827750f277a1, 82791cad4b47b650, 

  Server time            - 09/22/23 01:49:50

Data:
  Replication health     - UNHEALTHY: No replicas remain of some data
  Moving data            - 807.670 GB
  Sum of key-value sizes - 1.158 TB
  Disk space used        - 6.196 TB

Operating space:
  Storage server         - 1744.2 GB free on most full server
  Log server             - 1746.4 GB free on most full server

Workload:
  Read rate              - 198 Hz
  Write rate             - 0 Hz
  Transactions started   - 44 Hz
  Transactions committed - 0 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  10.113.237.132:4501    (  1% cpu;  3% machine; 0.001 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.132:4503    (  1% cpu;  3% machine; 0.001 Gbps;  0% disk IO; 0.4 GB / 8.0 GB RAM  )
  10.113.237.133:4501    (  1% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.9 GB / 8.0 GB RAM  )
  10.113.237.133:4503    (  7% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.134:4501    (  6% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.134:4503    (  2% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.135:4501    (  4% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.135:4503    (  4% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.136:4501    (  4% cpu;  7% machine; 0.120 Gbps;  5% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.136:4503    ( 20% cpu;  7% machine; 0.120 Gbps;  5% disk IO; 4.0 GB / 8.0 GB RAM  )
  10.113.237.137:4501    (  5% cpu;  4% machine; 0.001 Gbps;  1% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.137:4503    (  5% cpu;  4% machine; 0.001 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.138:4501    (  5% cpu;  4% machine; 0.001 Gbps;  1% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.138:4503    (  6% cpu;  4% machine; 0.001 Gbps;  1% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.139:4501    (  5% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.139:4503    (  6% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.140:4501    (  6% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.140:4503    (  2% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.141:4501    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 0.3 GB / 8.0 GB RAM  )
  10.113.237.141:4503    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.142:4501    (  4% cpu;  3% machine; 0.001 Gbps;  0% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.142:4503    (  5% cpu;  3% machine; 0.001 Gbps;  0% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.143:4501    (  1% cpu;  5% machine; 0.034 Gbps;  5% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.143:4503    ( 10% cpu;  5% machine; 0.034 Gbps;  5% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.144:4501    (  6% cpu;  4% machine; 0.001 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.144:4503    (  6% cpu;  4% machine; 0.001 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.145:4501    (  5% cpu;  4% machine; 0.001 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.145:4503    (  5% cpu;  4% machine; 0.001 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.146:4501    (  6% cpu;  4% machine; 0.001 Gbps;  1% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.146:4503    (  2% cpu;  4% machine; 0.001 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.147:4501    (  5% cpu;  7% machine; 0.188 Gbps;  1% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.147:4503    ( 10% cpu;  7% machine; 0.188 Gbps;  1% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.148:4501    (  4% cpu;  9% machine; 0.129 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.148:4503    ( 21% cpu;  9% machine; 0.129 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.149:4501    (  5% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.149:4503    (  5% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.150:4501    (  4% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.3 GB / 8.0 GB RAM  )
  10.113.237.150:4503    (  4% cpu;  3% machine; 0.001 Gbps;  1% disk IO; 3.8 GB / 8.0 GB RAM  )
  10.113.237.151:4501    (  2% cpu;  4% machine; 0.001 Gbps;  3% disk IO; 9.4 GB / 8.0 GB RAM  )
  10.113.237.151:4503    (  9% cpu;  4% machine; 0.001 Gbps;  3% disk IO; 5.6 GB / 8.0 GB RAM  )
  10.113.237.152:4501    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 2.5 GB / 8.0 GB RAM  )
  10.113.237.152:4503    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 2.4 GB / 8.0 GB RAM  )
  10.113.237.154:4501    (  3% cpu;  3% machine; 0.002 Gbps;  3% disk IO; 4.8 GB / 8.0 GB RAM  )
  10.113.237.154:4503    (  4% cpu;  3% machine; 0.002 Gbps;  3% disk IO; 4.8 GB / 8.0 GB RAM  )
  10.113.237.155:4501    (  8% cpu;  4% machine; 0.001 Gbps;  7% disk IO; 6.8 GB / 8.0 GB RAM  )
  10.113.237.155:4503    (  9% cpu;  4% machine; 0.001 Gbps;  7% disk IO; 6.8 GB / 8.0 GB RAM  )
  10.113.237.156:4501    ( 18% cpu;  7% machine; 0.089 Gbps; 24% disk IO; 6.8 GB / 8.0 GB RAM  )
  10.113.237.156:4503    ( 24% cpu;  7% machine; 0.089 Gbps; 22% disk IO; 9.4 GB / 8.0 GB RAM  )
  10.113.237.157:4501    (  9% cpu;  4% machine; 0.003 Gbps;  6% disk IO; 6.8 GB / 8.0 GB RAM  )
  10.113.237.157:4503    ( 10% cpu;  4% machine; 0.003 Gbps;  6% disk IO; 6.9 GB / 8.0 GB RAM  )
  10.113.237.158:4501    (  5% cpu;  3% machine; 0.001 Gbps;  2% disk IO; 8.4 GB / 8.0 GB RAM  )
  10.113.237.158:4503    (  2% cpu;  3% machine; 0.001 Gbps;  2% disk IO; 6.8 GB / 8.0 GB RAM  )
  10.113.237.159:4501    (  3% cpu;  4% machine; 0.057 Gbps; 12% disk IO; 6.8 GB / 8.0 GB RAM  )
  10.113.237.159:4503    ( 19% cpu;  4% machine; 0.057 Gbps; 13% disk IO; 9.4 GB / 8.0 GB RAM  )
  10.113.237.160:4501    (  9% cpu;  4% machine; 0.001 Gbps;  5% disk IO; 6.8 GB / 8.0 GB RAM  )
  10.113.237.160:4503    (  7% cpu;  4% machine; 0.001 Gbps;  5% disk IO; 6.9 GB / 8.0 GB RAM  )
  10.113.237.161:4501    (  5% cpu;  2% machine; 0.001 Gbps;  3% disk IO; 8.4 GB / 8.0 GB RAM  )
  10.113.237.161:4503    (  3% cpu;  2% machine; 0.001 Gbps;  4% disk IO; 6.8 GB / 8.0 GB RAM  )
  10.113.237.162:4501    (  2% cpu;  3% machine; 0.040 Gbps;  4% disk IO; 6.7 GB / 8.0 GB RAM  )
  10.113.237.162:4503    ( 10% cpu;  3% machine; 0.040 Gbps;  4% disk IO; 6.7 GB / 8.0 GB RAM  )
  10.113.237.163:4501    ( 23% cpu;  7% machine; 0.170 Gbps; 19% disk IO; 4.8 GB / 8.0 GB RAM  )
  10.113.237.163:4503    ( 18% cpu;  7% machine; 0.170 Gbps; 19% disk IO; 4.8 GB / 8.0 GB RAM  )
  10.113.237.165:4501    (  8% cpu;  3% machine; 0.001 Gbps;  3% disk IO; 4.8 GB / 8.0 GB RAM  )
  10.113.237.165:4503    (  2% cpu;  3% machine; 0.001 Gbps;  3% disk IO; 4.8 GB / 8.0 GB RAM  )
  10.113.237.166:4501    ( 23% cpu;  4% machine; 0.001 Gbps; 10% disk IO; 5.9 GB / 8.0 GB RAM  )
  10.113.237.166:4503    (  2% cpu;  4% machine; 0.001 Gbps;  7% disk IO; 7.4 GB / 8.0 GB RAM  )
  10.113.237.167:4501    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 2.5 GB / 8.0 GB RAM  )
  10.113.237.167:4503    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 2.6 GB / 8.0 GB RAM  )
  10.113.237.168:4501    (  8% cpu;  3% machine; 0.002 Gbps;  4% disk IO; 9.4 GB / 8.0 GB RAM  )
  10.113.237.168:4503    (  2% cpu;  3% machine; 0.002 Gbps;  3% disk IO; 6.8 GB / 8.0 GB RAM  )
  10.113.237.169:4501    (  8% cpu;  4% machine; 0.001 Gbps;  5% disk IO; 7.0 GB / 8.0 GB RAM  )
  10.113.237.169:4503    (  7% cpu;  4% machine; 0.001 Gbps;  4% disk IO; 6.8 GB / 8.0 GB RAM  )
  10.113.237.170:4501    (  3% cpu;  4% machine; 0.100 Gbps;  9% disk IO; 7.1 GB / 8.0 GB RAM  )
  10.113.237.170:4503    ( 14% cpu;  4% machine; 0.100 Gbps;  9% disk IO; 4.8 GB / 8.0 GB RAM  )
  10.113.237.171:4501    (  3% cpu;  3% machine; 0.001 Gbps;  3% disk IO; 4.8 GB / 8.0 GB RAM  )
  10.113.237.171:4503    (  8% cpu;  3% machine; 0.001 Gbps;  3% disk IO; 4.8 GB / 8.0 GB RAM  )
  10.113.237.172:4501    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.172:4503    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.173:4501    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.173:4503    (  1% cpu;  2% machine; 0.001 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.174:4501    (  2% cpu; 12% machine; 0.001 Gbps;  3% disk IO; 0.3 GB / 7.2 GB RAM  )
  10.113.237.176:4501    (  4% cpu;  9% machine; 0.004 Gbps;  3% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.177:4501    (  1% cpu;  5% machine; 0.000 Gbps;  2% disk IO; 0.3 GB / 8.0 GB RAM  )
  10.113.237.178:4501    (  2% cpu;  5% machine; 0.000 Gbps;  2% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.179:4501    (  3% cpu;  7% machine; 0.000 Gbps;  2% disk IO; 0.2 GB / 8.0 GB RAM  )
  10.113.237.180:4501    ( 13% cpu; 14% machine; 0.005 Gbps;  2% disk IO; 0.3 GB / 8.0 GB RAM  )
  10.113.237.181:4501    (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 0.1 GB / 8.0 GB RAM  )
  10.113.237.181:4503    (  1% cpu;  1% machine; 0.001 Gbps;  0% disk IO; 0.2 GB / 8.0 GB RAM  )

Coordination servers:
  10.113.237.132:4501  (reachable)
  10.113.237.134:4503  (reachable)
  10.113.237.138:4503  (reachable)
  10.113.237.152:4501  (reachable)
  10.113.237.167:4503  (reachable)
  10.113.237.171:4501  (reachable)
  10.113.237.172:4503  (reachable)
  10.113.237.173:4503  (reachable)
  10.113.237.181:4503  (reachable)

Client time: 09/22/23 01:49:50

WARNING: A single process is both a transaction log and a storage server.
  For best performance use dedicated disks for the transaction logs by setting process classes.

Could you share the operator logs for your test run? I would be interested to see what the subreconciler addPods and addPVCs is saying.

Simulating dc1 failure is terminating all pods in dc1 namespace (all VMs in AZ1), and bringing them back.

I assume you are running a “kubectl delete …”, correct? Are all the deleted Pods are actually created again? And the PVCs in this namespace are not touched?

What is the process to restart Tlogs Interfaces (manually) ?

Those should be automatically restarted once the Pod is recreated and the according PVC is mounted. You can manually restart a process by using the kill command from fdbcli e.g. kill; kill <IP:Port>.

Yes pods are created back and running…

  • since pods map to nodes , i scale down nodes in that particular AZ (i use ocp mchineset for this) , and pods stay in Pending until i scale back up
  • i also, have to delete pvc in ns otherwise pods do not start again
  • i use local storage operator to create a storageclass (that will create pvcs) out of VM’s nvme disk

i restarted processes with kubectl fdb plugin and left it over the weekend … so that worked , but have to test it more…

{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Has unhealthy process group","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","method":"CheckReconciliation","namespace":"dc1","cluster":"fdb-cluster-1","processGroupID":"dc1-storage-11","state":"HasUnhealthyProcess","conditions":["MissingProcesses"]}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Not all process groups are reconciled","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","method":"CheckReconciliation","namespace":"dc1","cluster":"fdb-cluster-1","desiredProcessGroups":22,"reconciledProcessGroups":21}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","duration_seconds":0.087244812}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Cluster was not fully reconciled by reconciliation process","namespace":"dc1","cluster":"fdb-cluster-1","status":{"hasUnhealthyProcess":2},"CurrentGeneration":0,"OriginalGeneration":2,"DelayedRequeue":false}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Reconciliation run finished","namespace":"dc1","cluster":"fdb-cluster-1","duration_seconds":0.342980124,"cacheStatus":true}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Fetch machine-readable status for reconcilitation loop","namespace":"dc1","cluster":"fdb-cluster-1","cacheStatus":true}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Trying connection options","namespace":"dc1","cluster":"fdb-cluster-1","connectionString":["fdb_cluster_1:jBYTGxM4zAJ2fdbYTAWUctcWWs8bNS13@10.113.237.132:4501,10.113.237.134:4503,10.113.237.138:4503,10.113.237.152:4501,10.113.237.167:4503,10.113.237.171:4501,10.113.237.172:4503,10.113.237.173:4503,10.113.237.181:4503","fdb_cluster_1:VXpHlwDugQ4N9oTlUTQDIwDS1aZf18BP@10.113.237.144:4501,10.113.237.145:4501,10.113.237.150:4501,10.113.237.148:4501,10.113.237.136:4501"]}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to get connection string from cluster","namespace":"dc1","cluster":"fdb-cluster-1","connectionString":"fdb_cluster_1:jBYTGxM4zAJ2fdbYTAWUctcWWs8bNS13@10.113.237.132:4501,10.113.237.134:4503,10.113.237.138:4503,10.113.237.152:4501,10.113.237.167:4503,10.113.237.171:4501,10.113.237.172:4503,10.113.237.173:4503,10.113.237.181:4503"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"fdbclient","msg":"Fetch values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd/coordinators"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd/coordinators"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Chose connection option","namespace":"dc1","cluster":"fdb-cluster-1","connectionString":"fdb_cluster_1:jBYTGxM4zAJ2fdbYTAWUctcWWs8bNS13@10.113.237.132:4501,10.113.237.134:4503,10.113.237.138:4503,10.113.237.152:4501,10.113.237.167:4503,10.113.237.171:4501,10.113.237.172:4503,10.113.237.173:4503,10.113.237.181:4503"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"fdbclient","msg":"Fetch values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"fdbclient","msg":"Done fetching values from FDB","namespace":"dc1","cluster":"fdb-cluster-1","key":"\ufffd\ufffd/status/json"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Disable taint feature","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","Disabled":true}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","duration_seconds":0.112349926}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateLockConfiguration"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateLockConfiguration","duration_seconds":0.000006051}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateConfigMap"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateConfigMap","duration_seconds":0.000147837}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.checkClientCompatibility"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.checkClientCompatibility","duration_seconds":0.000014101}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.deletePodsForBuggification"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.deletePodsForBuggification","duration_seconds":0.000503897}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.replaceMisconfiguredProcessGroups"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.replaceMisconfiguredProcessGroups","duration_seconds":0.001388085}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.replaceFailedProcessGroups"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Check desired fault tolerance","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.replaceFailedProcessGroups","expectedFaultTolerance":2,"maxZoneFailuresWithoutLosingData":-1,"maxZoneFailuresWithoutLosingAvailability":-1}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.replaceFailedProcessGroups","duration_seconds":0.000012672}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addProcessGroups"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addProcessGroups","duration_seconds":0.000021395}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addServices"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addServices","duration_seconds":0.00002893}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addPVCs"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addPVCs","duration_seconds":0.000232452}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addPods"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.addPods","duration_seconds":0.000455288}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.generateInitialClusterFile"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.generateInitialClusterFile","duration_seconds":0.00000533}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.removeIncompatibleProcesses"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.removeIncompatibleProcesses","duration_seconds":0.000011198}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateSidecarVersions"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateSidecarVersions","duration_seconds":0.000002623}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updatePodConfig"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updatePodConfig","duration_seconds":0.000850175}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateMetadata"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateMetadata","duration_seconds":0.000577766}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateDatabaseConfiguration"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateDatabaseConfiguration","duration_seconds":0.000055063}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.chooseRemovals"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.chooseRemovals","duration_seconds":0.000054176}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.excludeProcesses"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.excludeProcesses","duration_seconds":0.000020555}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.changeCoordinators"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.changeCoordinators","duration_seconds":0.000960452}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.bounceProcesses"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.bounceProcesses","duration_seconds":0.000116449}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.maintenanceModeChecker"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.maintenanceModeChecker","duration_seconds":0.000003884}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updatePods"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updatePods","duration_seconds":0.001538568}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.removeProcessGroups"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.removeProcessGroups","duration_seconds":0.000048341}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.removeServices"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.removeServices","duration_seconds":0.000007047}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Attempting to run sub-reconciler","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus"}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Disable taint feature","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","Disabled":true}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Subreconciler finished run","namespace":"dc1","cluster":"fdb-cluster-1","reconciler":"controllers.updateStatus","duration_seconds":0.080311831}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Reconciliation complete","namespace":"dc1","cluster":"fdb-cluster-1","generation":2}
{"level":"info","ts":"2023-09-21T21:28:21Z","logger":"controller","msg":"Reconciliation run finished","namespace":"dc1","cluster":"fdb-cluster-1","duration_seconds":0.362400171,"cacheStatus":true}


  • i also, have to delete pvc in ns otherwise pods do not start again

That probably means all data for those PVCs are lost. The Pods are probably not being scheduled as the PVC cannot be bound to the previously existing PV (as you removed the node).

  • i use local storage operator to create a storageclass (that will create pvcs) out of VM’s nvme disk

Basically what you are testing is what happens when a whole DC gets all resources deleted, including the data. Is that the intention?

yes, the intention is to test losing the entire AZ, with VMs and its data and let fdb. cluster replicate data back to primary.

i see some storage server lagging and i assume this is related to networking which in cloud we do not control. am i right in my assumption ? how to remediate ?

bash kubectl-fdb-plugin.sh "fdb analyze fdb-cluster-1"                  
Checking cluster: dc1/fdb-cluster-1
✔ Cluster is available
✔ Cluster is fully replicated
✔ Cluster is reconciled
✔ ProcessGroups are all in ready condition
✔ Pods are all running and available
Checking cluster: dc1/fdb-cluster-1 with auto-fix: false
✖ Process: dc1-storage-10 with address: 10.113.237.132:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-9 with address: 10.113.237.139:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-16 with address: 10.113.237.137:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-12 with address: 10.113.237.147:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-8 with address: 10.113.237.144:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-6 with address: 10.113.237.142:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-1 with address: 10.113.237.135:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-11 with address: 10.113.237.148:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-17 with address: 10.113.237.143:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-7 with address: 10.113.237.136:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-2 with address: 10.113.237.133:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-14 with address: 10.113.237.141:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-11 with address: 10.113.237.148:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-10 with address: 10.113.237.132:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-4 with address: 10.113.237.150:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-8 with address: 10.113.237.144:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-12 with address: 10.113.237.147:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-17 with address: 10.113.237.143:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-1 with address: 10.113.237.135:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-6 with address: 10.113.237.142:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-5 with address: 10.113.237.140:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-5 with address: 10.113.237.140:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-15 with address: 10.113.237.138:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-15 with address: 10.113.237.138:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-3 with address: 10.113.237.146:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-9 with address: 10.113.237.139:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-7 with address: 10.113.237.136:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-13 with address: 10.113.237.145:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-4 with address: 10.113.237.150:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-3 with address: 10.113.237.146:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-2 with address: 10.113.237.133:4501 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-13 with address: 10.113.237.145:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-16 with address: 10.113.237.137:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
✖ Process: dc1-storage-14 with address: 10.113.237.141:4503 error: storage_server_lagging type: , time: 1970-01-01 00:00:00 +0000 UTC
Error: 
found issues in status json for cluster fdb-cluster-1. Please check them

Hi, can you please validate my multi_dc design. I am basically following https://github.com/FoundationDB/fdb-kubernetes-operator/tree/main/config/tests/multi_dc , with:

  • 1 k8s cluster
  • 3 fdb running in 3 separate ns, each fdb is controlled by an operator pod.

Q: Would it be better to deploy all fdb instances in 1 ns and let 1 operator pod manage all 3 , because anyway the pods get scheduled on different AZ nodes ?

@johscheuer the fix for my problem was adding killProcesses: true which allows operator to bounce fdbserver processes. The question is how safe is to employ this feature in a production system for day to day operations ? Thanks

@johscheuer the fix for my problem was adding killProcesses: true which allows operator to bounce fdbserver processes. The question is how safe is to employ this feature in a production system for day to day operations ? Thanks

That setting can be considered safe and is enabled per default. If this setting is set to false the operator is not able to perform any upgrades on the FDB cluster and is not able to rollout any new knobs.