Setup:
-
multi_dc setup with 1 k8s cluster and 3 namespaces that map to 3 dc
-
dc1 serves as Primary ; dc3 - Secondary ; dc2 - Satellite
Configuration:
Redundancy mode - triple
Storage engine - ssd-2
Coordinators - 9
Desired Commit Proxies - 2
Desired GRV Proxies - 1
Desired Resolvers - 1
Desired Logs - 3
Desired Remote Logs - 3
Desired Log Routers - 3
Usable Regions - 2
Regions:
Primary -
Datacenter - dc1
Satellite datacenters - dc2, dc3
Satellite Logs - 3
Remote -
Datacenter - dc3
Satellite datacenters - dc2, dc1
Satellite Logs - 3
Cluster:
FoundationDB processes - 81
Zones - 47
Machines - 47
Memory availability - 8.0 GB per process on machine with least available
Retransmissions rate - 0 Hz
Fault Tolerance - 2 machines
Server time - 09/17/23 01:59:16
Data:
Replication health - Healthy
Moving data - 0.000 GB
Sum of key-value sizes - 1.158 TB
Disk space used - 8.639 TB
The problem:
During a DR test, i simulated Primary dc1 going down, bringing it back up and let data being replicated back, but it’s stuck at the moment (see below errors).
- i would like to understand why this happened, how to recover and how to prevent this type of situation
- i’ve noticed that fdb operator added a storage pod in dc1 namespace , from k8s perspective how is the recovery process handled ?
Configuration:
Redundancy mode - triple
Storage engine - ssd-2
Coordinators - 9
Desired Commit Proxies - 2
Desired GRV Proxies - 1
Desired Resolvers - 1
Desired Logs - 3
Desired Remote Logs - 3
Desired Log Routers - 3
Usable Regions - 2
Regions:
Remote -
Datacenter - dc1
Satellite datacenters - dc2, dc3
Satellite Logs - 3
Primary -
Datacenter - dc3
Satellite datacenters - dc2, dc1
Satellite Logs - 3
Cluster:
FoundationDB processes - 81 (less 0 excluded; 12 with errors)
Zones - 47
Machines - 47
Memory availability - 8.0 GB per process on machine with least available
Retransmissions rate - 1 Hz
Fault Tolerance - -1 machines
Warning: the database may have data loss and availability loss. Please restart following tlog interfaces, otherwise storage servers may never be able to catch up.
Old log epoch: 253 begin: 752008734932 end: 752125253003, missing log interfaces(id,address): 55cfd4bc6e3b0a8f, 9ab3ab3fdefabd20, 91fa9bbe4783efbb,
Old log epoch: 250 begin: 751872585983 end: 752008734932, missing log interfaces(id,address): 055aa740d22950ef, 2bfa78e37a663789, bc810043cb1a881b,
Old log epoch: 248 begin: 751768423814 end: 751872585983, missing log interfaces(id,address): 60b3598fd80246f7, e841f74ebd3955bb, 0e8b32bcfc4a4ae2,
Old log epoch: 246 begin: 751657077872 end: 751768423814, missing log interfaces(id,address): 730ccd96ca1256b0, a77e054a4e8f1279, 0c79570a7b4707d0,
Old log epoch: 244 begin: 751552887228 end: 751657077872, missing log interfaces(id,address): 0ed852c5e99c862c, d28c7dae960cea90, 991deec06ebdd74d,
Old log epoch: 242 begin: 751440887046 end: 751552887228, missing log interfaces(id,address): 15e823d14fde4dfb, 3d716439331bb3be, c82bcf34b648311e,
Old log epoch: 240 begin: 751332177172 end: 751440887046, missing log interfaces(id,address): 68c1e97818bf5ef4, 0abc75e178f47ee7, 81991d47565c137a,
Old log epoch: 237 begin: 751187559126 end: 751332177172, missing log interfaces(id,address): 1493a5a1f23cf477, 0960fcaf02bd12e3, f76dfe8ef10992f5,
Old log epoch: 235 begin: 751050558192 end: 751187559126, missing log interfaces(id,address): 0b38863b2b4e2ceb, 725bcab59e5c9ea0, 1de0783cc2013219,
Old log epoch: 233 begin: 747960335254 end: 751050558192, missing log interfaces(id,address): 3c4efb7aa938ab14, f286a82bf78eeae2, c0ce84e3fedb16c5,
Server time - 09/17/23 14:30:07
Data:
Replication health - (Re)initializing automatic data distribution
Moving data - unknown (initializing)
Sum of key-value sizes - unknown
Disk space used - 4.287 TB
Operating space:
Storage server - 1541.7 GB free on most full server
Log server - 1542.0 GB free on most full server
Workload:
Read rate - 56 Hz
Write rate - 0 Hz
Transactions started - 19 Hz
Transactions committed - 1 Hz
Conflict rate - 0 Hz
Backup and DR:
Running backups - 0
Running DRs - 0
Process performance details:
10.113.237.132:4501 ( 1% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 0.2 GB / 8.0 GB RAM )
10.113.237.133:4501 ( 1% cpu; 2% machine; 0.002 Gbps; 0% disk IO; 5.7 GB / 8.0 GB RAM )
10.113.237.133:4503 ( 2% cpu; 2% machine; 0.002 Gbps; 0% disk IO; 4.7 GB / 8.0 GB RAM )
10.113.237.134:4501 ( 1% cpu; 1% machine; 0.001 Gbps; 0% disk IO; 0.2 GB / 8.0 GB RAM )
Storage server lagging by 350 seconds.
10.113.237.134:4503 ( 0% cpu; 1% machine; 0.001 Gbps; 0% disk IO; 0.2 GB / 8.0 GB RAM )
10.113.237.135:4501 ( 1% cpu; 1% machine; 0.001 Gbps; 4% disk IO; 0.2 GB / 8.0 GB RAM )
Storage server lagging by 342 seconds.
10.113.237.135:4503 ( 0% cpu; 1% machine; 0.001 Gbps; 4% disk IO; 0.2 GB / 8.0 GB RAM )
10.113.237.136:4501 ( 1% cpu; 1% machine; 0.001 Gbps; 9% disk IO; 0.2 GB / 8.0 GB RAM )
Storage server lagging by 340 seconds.
10.113.237.136:4503 ( 1% cpu; 1% machine; 0.001 Gbps; 9% disk IO; 0.2 GB / 8.0 GB RAM )
Storage server lagging by 280 seconds.
10.113.237.137:4501 ( 0% cpu; 1% machine; 0.001 Gbps; 2% disk IO; 0.2 GB / 8.0 GB RAM )
10.113.237.137:4503 ( 1% cpu; 1% machine; 0.001 Gbps; 2% disk IO; 0.2 GB / 8.0 GB RAM )
Storage server lagging by 280 seconds.
10.113.237.138:4501 ( 1% cpu; 1% machine; 0.001 Gbps; 0% disk IO; 0.2 GB / 8.0 GB RAM )
Storage server lagging by 280 seconds.
10.113.237.138:4503 ( 0% cpu; 1% machine; 0.001 Gbps; 0% disk IO; 0.2 GB / 8.0 GB RAM )
10.113.237.139:4501 ( 0% cpu; 1% machine; 0.001 Gbps; 10% disk IO; 0.1 GB / 8.0 GB RAM )
10.113.237.140:4501 ( 1% cpu; 1% machine; 0.001 Gbps; 10% disk IO; 0.2 GB / 8.0 GB RAM )
Storage server lagging by 249 seconds.
10.113.237.140:4503 ( 1% cpu; 1% machine; 0.001 Gbps; 10% disk IO; 0.2 GB / 8.0 GB RAM )
Storage server lagging by 249 seconds.
10.113.237.141:4501 ( 1% cpu; 1% machine; 0.001 Gbps; 1% disk IO; 0.2 GB / 8.0 GB RAM )
Storage server lagging by 250 seconds.
10.113.237.141:4503 ( 0% cpu; 1% machine; 0.001 Gbps; 1% disk IO; 0.2 GB / 8.0 GB RAM )
10.113.237.142:4501 ( 0% cpu; 1% machine; 0.001 Gbps; 3% disk IO; 0.2 GB / 8.0 GB RAM )
Storage server lagging by 315 seconds.
10.113.237.142:4503 ( 1% cpu; 1% machine; 0.001 Gbps; 3% disk IO; 0.2 GB / 8.0 GB RAM )
Storage server lagging by 315 seconds.
10.113.237.143:4501 ( 0% cpu; 2% machine; 0.001 Gbps; 9% disk IO; 0.2 GB / 8.0 GB RAM )
10.113.237.143:4503 ( 0% cpu; 2% machine; 0.001 Gbps; 9% disk IO; 0.2 GB / 8.0 GB RAM )
Storage server lagging by 250 seconds.
10.113.237.144:4501 ( 0% cpu; 2% machine; 0.001 Gbps; 9% disk IO; 0.2 GB / 8.0 GB RAM )
10.113.237.144:4503 ( 0% cpu; 2% machine; 0.001 Gbps; 9% disk IO; 0.2 GB / 8.0 GB RAM )
10.113.237.145:4501 ( 0% cpu; 2% machine; 0.001 Gbps; 9% disk IO; 0.2 GB / 8.0 GB RAM )
10.113.237.145:4503 ( 0% cpu; 2% machine; 0.001 Gbps; 9% disk IO; 0.1 GB / 8.0 GB RAM )
10.113.237.146:4501 ( 1% cpu; 2% machine; 0.001 Gbps; 10% disk IO; 0.1 GB / 8.0 GB RAM )
10.113.237.147:4501 ( 1% cpu; 2% machine; 0.001 Gbps; 10% disk IO; 0.2 GB / 8.0 GB RAM )
10.113.237.147:4503 ( 0% cpu; 2% machine; 0.001 Gbps; 10% disk IO; 0.1 GB / 8.0 GB RAM )
10.113.237.148:4501 ( 0% cpu; 2% machine; 0.001 Gbps; 10% disk IO; 0.1 GB / 8.0 GB RAM )
10.113.237.148:4503 ( 0% cpu; 2% machine; 0.001 Gbps; 10% disk IO; 0.2 GB / 8.0 GB RAM )
10.113.237.149:4501 ( 0% cpu; 2% machine; 0.001 Gbps; 10% disk IO; 0.1 GB / 8.0 GB RAM )
10.113.237.149:4503 ( 0% cpu; 2% machine; 0.001 Gbps; 10% disk IO; 0.2 GB / 8.0 GB RAM )
10.113.237.150:4501 ( 0% cpu; 1% machine; 0.001 Gbps; 10% disk IO; 0.1 GB / 8.0 GB RAM )
10.113.237.150:4503 ( 0% cpu; 1% machine; 0.001 Gbps; 10% disk IO; 0.1 GB / 8.0 GB RAM )
10.113.237.151:4501 ( 0% cpu; 1% machine; 0.001 Gbps; 7% disk IO; 0.1 GB / 8.0 GB RAM )
10.113.237.151:4503 ( 0% cpu; 1% machine; 0.001 Gbps; 7% disk IO; 0.1 GB / 8.0 GB RAM )
10.113.237.152:4501 ( 0% cpu; 1% machine; 0.001 Gbps; 9% disk IO; 0.1 GB / 8.0 GB RAM )
10.113.237.152:4503 ( 0% cpu; 1% machine; 0.001 Gbps; 9% disk IO; 0.1 GB / 8.0 GB RAM )
10.113.237.153:4501 ( 1% cpu; 5% machine; 0.001 Gbps; 2% disk IO; 0.1 GB / 8.0 GB RAM )
10.113.237.154:4501 ( 1% cpu; 6% machine; 0.017 Gbps; 0% disk IO; 0.2 GB / 8.0 GB RAM )
10.113.237.155:4501 ( 1% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.2 GB / 8.0 GB RAM )
10.113.237.156:4501 ( 2% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 5.5 GB / 8.0 GB RAM )
10.113.237.156:4503 ( 1% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 4.7 GB / 8.0 GB RAM )
10.113.237.157:4501 ( 1% cpu; 4% machine; 0.001 Gbps; 0% disk IO; 5.7 GB / 8.0 GB RAM )
10.113.237.157:4503 ( 1% cpu; 4% machine; 0.001 Gbps; 0% disk IO; 5.5 GB / 8.0 GB RAM )
10.113.237.158:4501 ( 1% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 4.1 GB / 8.0 GB RAM )
10.113.237.158:4503 ( 1% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 4.9 GB / 8.0 GB RAM )
10.113.237.159:4501 ( 1% cpu; 3% machine; 0.001 Gbps; 0% disk IO; 4.7 GB / 8.0 GB RAM )
10.113.237.159:4503 ( 1% cpu; 3% machine; 0.001 Gbps; 0% disk IO; 4.7 GB / 8.0 GB RAM )
10.113.237.160:4501 ( 1% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 4.7 GB / 8.0 GB RAM )
10.113.237.160:4503 ( 1% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 4.7 GB / 8.0 GB RAM )
10.113.237.161:4501 ( 1% cpu; 1% machine; 0.001 Gbps; 0% disk IO; 5.5 GB / 8.0 GB RAM )
10.113.237.161:4503 ( 1% cpu; 1% machine; 0.001 Gbps; 0% disk IO; 4.7 GB / 8.0 GB RAM )
10.113.237.162:4501 ( 1% cpu; 1% machine; 0.001 Gbps; 0% disk IO; 5.5 GB / 8.0 GB RAM )
10.113.237.162:4503 ( 1% cpu; 1% machine; 0.001 Gbps; 0% disk IO; 5.5 GB / 8.0 GB RAM )
10.113.237.163:4501 ( 1% cpu; 2% machine; 0.017 Gbps; 0% disk IO; 4.7 GB / 8.0 GB RAM )
10.113.237.163:4503 ( 1% cpu; 2% machine; 0.017 Gbps; 0% disk IO; 4.7 GB / 8.0 GB RAM )
10.113.237.164:4501 ( 1% cpu; 1% machine; 0.001 Gbps; 0% disk IO; 0.6 GB / 8.0 GB RAM )
10.113.237.165:4501 ( 1% cpu; 1% machine; 0.001 Gbps; 0% disk IO; 5.5 GB / 8.0 GB RAM )
10.113.237.165:4503 ( 1% cpu; 1% machine; 0.001 Gbps; 0% disk IO; 5.5 GB / 8.0 GB RAM )
10.113.237.166:4501 ( 1% cpu; 5% machine; 0.002 Gbps; 0% disk IO; 5.5 GB / 8.0 GB RAM )
10.113.237.166:4503 ( 1% cpu; 5% machine; 0.002 Gbps; 0% disk IO; 4.8 GB / 8.0 GB RAM )
10.113.237.167:4501 ( 1% cpu; 1% machine; 0.001 Gbps; 0% disk IO; 4.7 GB / 8.0 GB RAM )
10.113.237.167:4503 ( 1% cpu; 1% machine; 0.001 Gbps; 0% disk IO; 4.7 GB / 8.0 GB RAM )
10.113.237.168:4501 ( 1% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 4.1 GB / 8.0 GB RAM )
10.113.237.168:4503 ( 1% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 4.7 GB / 8.0 GB RAM )
10.113.237.169:4501 ( 1% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 4.7 GB / 8.0 GB RAM )
10.113.237.169:4503 ( 1% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 4.7 GB / 8.0 GB RAM )
10.113.237.170:4501 ( 1% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 5.5 GB / 8.0 GB RAM )
10.113.237.170:4503 ( 1% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 5.5 GB / 8.0 GB RAM )
10.113.237.171:4501 ( 1% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 4.7 GB / 8.0 GB RAM )
10.113.237.171:4503 ( 1% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 5.5 GB / 8.0 GB RAM )
10.113.237.172:4501 ( 1% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 4.7 GB / 8.0 GB RAM )
10.113.237.172:4503 ( 1% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 4.9 GB / 8.0 GB RAM )
10.113.237.173:4501 ( 2% cpu; 2% machine; 0.001 Gbps; 0% disk IO; 0.6 GB / 8.0 GB RAM )
10.113.237.174:4501 ( 7% cpu; 5% machine; 0.004 Gbps; 2% disk IO; 0.3 GB / 8.0 GB RAM )
10.113.237.175:4501 ( 1% cpu; 6% machine; 0.001 Gbps; 2% disk IO; 0.1 GB / 8.0 GB RAM )
10.113.237.176:4501 ( 0% cpu; 5% machine; 0.001 Gbps; 18% disk IO; 0.1 GB / 8.0 GB RAM )
10.113.237.177:4501 ( 6% cpu; 5% machine; 0.006 Gbps; 1% disk IO; 0.3 GB / 8.0 GB RAM )
10.113.237.178:4501 ( 3% cpu; 4% machine; 0.004 Gbps; 1% disk IO; 0.3 GB / 8.0 GB RAM )
Coordination servers:
10.113.237.132:4501 (reachable)
10.113.237.133:4503 (reachable)
10.113.237.139:4501 (reachable)
10.113.237.146:4501 (reachable)
10.113.237.147:4501 (reachable)
10.113.237.154:4501 (reachable)
10.113.237.155:4501 (reachable)
10.113.237.164:4501 (reachable)
10.113.237.173:4501 (reachable)
Client time: 09/17/23 14:30:04