Hello @mengxu!
Yes, allow me give an update and to elaborate on lowering the redundancy level.
Redundancy level question
During the operational troubleshooting for this cluster, at one point the cluster was in triple
redundancy mode, but reporting “Only two replicas remain of some data.”
At this point, one of our operators lowered the cluster’s redundancy level to double
, expecting FoundationDB to choose the two live replicas, and basically wind up in a “normal” double
redundancy situation.
Instead we found FoundationDB reporting “Only one replica remains of some data.”! So my question is: how intelligent is FoundationDB in choosing which replicas “survive” a lowering of redundancy modes, e.g. triple
→ double
.
Current Update
At this point, a few days later in troubleshooting, I can tell there is some serious issue trying to heal this data set. This may have been part of the problem leading to the situation I described above.
We expanded the cluster with three new machines/zones as I mentioned and found it did bring us back to an n=2 replicas state. However some of the storage processes are just very clearly unhealthy. Here is an example data lag – just extracted from status json.
role actor_id process_address process_id durable_version data_lag
storage 2ac9f1823a4d5f67 10.95.111.214:4504:tls d0b26295ff5c6fc497e0a7d514794cf3 None None
storage 6f04ac5764ebccd4 10.95.111.214:4503:tls 0f44b61c2c7ab209eab5ffa04d7906ea None None
storage 721e454b96e75d0c 10.95.111.214:4505:tls b7615bd9bc0e1c4bc796636626e1a539 None None
storage 8b52db5932066c2d 10.185.175.43:4504:tls e2632a0f8c0a76bfad6a660fc9284693 8889482016494 2164815125363
storage b873c12c848af3f1 10.185.175.43:4505:tls 2587c383a9044b046790f372fc3e83c3 8889482016494 2164815125363
storage d487eab27f823aa3 10.185.175.43:4503:tls 89317bb27bf21eb2be6e6c3de371d5ff 8889482016494 2164815125363
storage a78d2fc8068c6bc8 10.220.237.150:4504:tls 75262da7e16ad452fa18f654e9deca0c 8889041958540 2165150183317
storage db7c8c1449e77cb4 10.185.115.215:4503:tls 92907d8b67f512ef0f289da88e3761ec 8889041958540 2165150183317
storage c47540138fb4f3e4 10.95.111.214:4505:tls b7615bd9bc0e1c4bc796636626e1a539 11054299296554 0
storage d14f692f7e8cb4bb 10.95.111.219:4503:tls c68295ef3a1f825b5c6dd6f5c488b1aa 11054297141857 0
storage 159a3452afb39ba4 10.95.111.219:4504:tls 13fd0ef40e1ea16876f9d6034d0416e1 11054295420150 1656540
storage 1c84cbe629e664da 10.221.166.13:4505:tls d50db70f76c64da3b46d67031b4b7b3f 11054295420150 1656540
storage 1cf91d84d0405637 10.221.166.13:4503:tls 8c039c3fa145652c939b9e76a29d9105 11054295420150 1656540
storage 37cbbce93187ccf1 10.208.190.181:4505:tls 66b70754318016036b391bcc8daeefca 11054295420150 1656540
storage 6e2d6a04fe7fcdcf 10.185.175.43:4505:tls 2587c383a9044b046790f372fc3e83c3 11054295420150 1656540
storage 71be057fa064dfa0 10.185.115.215:4505:tls e62d4245342bbfe886acb1d61b0d7a42 11054295420150 1656540
storage 7c3f5de634d89332 10.208.190.181:4504:tls 355fc729e9a1104be046cf9c11c71510 11054295420150 1656540
storage 86f067a7b14892a6 10.220.237.150:4505:tls d4b5f11f0832b88def06c3dffabcf5b7 11054295420150 1656540
storage a614df1c88908d1e 10.220.237.150:4505:tls d4b5f11f0832b88def06c3dffabcf5b7 11054295420150 1656540
storage d090dee9dc188f6c 10.95.111.219:4505:tls 8608598d11e6c794191731ab4308e550 11054295420150 1656540
storage d207c53565bad221 10.208.190.181:4503:tls 99c95c04859c90dbc692228ae1ce2be7 11054295420150 1656540
storage f9f479eddee32ff3 10.221.166.13:4504:tls 9bed6336457eb511f5d3d090a53440c3 11054295420150 1656540
storage 09a972450d63c006 10.185.175.85:4503:tls 52238f1eba411da9833f75ecd5868bd2 11054297141857 1721707
storage 0a7c5e51d14f6223 10.220.237.150:4503:tls fd75186a0c9989f34ec503bce4615a0e 11054297141857 1721707
storage 1053cd090ac8a252 10.220.237.150:4504:tls 75262da7e16ad452fa18f654e9deca0c 11054297141857 1721707
storage 10acef9c859fc163 10.185.175.85:4504:tls 25fabd28ce65f76a15cfbd20badc9592 11054297141857 1721707
storage 5d9ae63025eb37b0 10.95.111.214:4504:tls d0b26295ff5c6fc497e0a7d514794cf3 11054297141857 1721707
storage 73a6404856fb4c13 10.95.111.214:4503:tls 0f44b61c2c7ab209eab5ffa04d7906ea 11054297141857 1721707
storage e6a418c607217be0 10.185.115.215:4504:tls 7fdb2d6ba9e1a72457ccfdbc5d81e65c 11054297141857 1721707
storage f4d63771128ea9af 10.185.175.85:4505:tls b022af6271b7ab4fbe0e0fea69cd6286 11054297141857 1721707
storage 2217d81e47c74b1a 10.185.175.43:4503:tls 89317bb27bf21eb2be6e6c3de371d5ff 11054293763610 2036983
storage 2f0e9b7a6922fc2f 10.185.175.43:4504:tls e2632a0f8c0a76bfad6a660fc9284693 11054293763610 2036983
Now, we’ve tried to exclude these large data lag processes in the past. However data never moves off of them. I tried something more brutal – I brought down FoundationDB on the 10.185.175.43
machine/zone entirely in the hopes of seeing data redistribution. I saw the moving data value float up and down a few megabytes and then just flatline again.
I was able to line up logs with this action and notice a few standout warnings. I’ve reproduced them below with some minor sanitization.
We saw a steady rate of these which lines up with the lack of actual data movement.
Jul 20 11:20:20 fdb8 trace.10.208.190.181.4505.1594654180.VmrXDJ.1.82.json { “Severity”: “30”, “Time”: “1595265619.425258”, “Type”: “FetchKeysTooLong”, “ID”: “0000000000000000”, “Duration”: “611400”, “Phase”: “0”, “Begin”: “\x15\x1c\x15\x0f\x01\x16\x12\xb8\x00\x15\x14…”, “End”: “\x15\x1c\x15\x0f\x01\x16\x12\xc9\x00\x15\x15\…”, “Machine”: “10.208.190.181:4505”, “LogGroup”: “default”, “Roles”: “SS” }
Then I downed FoundationDB on the previously mentioned machine, saw these which is expected. We also dropped from n=2 replicas to n=1 replicas as we expected possible.
Jul 20 11:21:28 fdb4 trace.10.95.111.214.4503.1593164519.NZ7sl7.2.281.json { “Severity”: “30”, “Time”: “1595265688.623143”, “Type”: “TooManyConnectionsClosed”, “ID”: “0000000000000000”, “SuppressedEventCount”: “19”, “PeerAddr”: “10.185.175.43:4502:tls”, “Machine”: “10.95.111.214:4503”, “LogGroup”: “default”, “Roles”: “SS” }
Finally we see the cluster trying to handle data movement? Very confusingly I can’t find any of these source/destination (storage process?) ids in the status JSON.
Jul 20 11:21:35 fdb7 trace.10.185.175.85.4501.1594654180.i2uZdD.2.339.json { “Severity”: “30”, “Time”: “1595265694.457242”, “Type”: “RelocateShardTooLong”, “ID”: “0000000000000000”, “Error”: “operation_cancelled”, “ErrorDescription”: “Asynchronous operation cancelled”, “ErrorCode”: “1101”, “Duration”: “611298”, “Dest”: “0a7c5e51d14f6223d201b73c5ec0b4a2,1c84cbe629e664da7e217a4513f35b14,7c3f5de634d893329e38da967a22e63e”, “Src”: “0a7c5e51d14f6223d201b73c5ec0b4a2,e6a418c607217be0522f0b5b96d02396”, “Machine”: “10.185.175.85:4501”, “LogGroup”: “default”, “Roles”: “DD,MP” }
More data movement woes:
Jul 20 13:50:14 fdb5 trace.10.185.115.215.4500.1593170776.4NB1xR.2.467.json { “Severity”: “30”, “Time”: “1595274613.655347”, “Type”: “FinishMoveKeysTooLong”, “ID”: “0000000000000000”, “Duration”: “8400”, “Servers”: “09a972450d63c0069b109876b29ab61d,1cf91d84d04056372cf765ee452ed15e,d207c53565bad2213792a5d0f6255f16”, “Machine”: “10.185.115.215:4500”, “LogGroup”: “default”, “Roles”: “CD,DD,MS,RK” }
I can’t even find these ids in the status json – related?
kyle@fdb1:~$ fdbcli --exec ‘status json’ | grep -e 09a972450d63c0069b109876b29ab61d -e 1cf91d84d04056372cf765ee452ed15e -e d207c53565bad2213792a5d0f6255f16
kyle@fdb1:~$
Those and FetchKeysTooLong
are now the primary >20 sev logs I see at this point.
I do see the occasional
Jul 20 14:00:38 fdb5 trace.10.185.115.215.4500.1593170776.4NB1xR.2.468.json { “Severity”: “30”, “Time”: “1595275238.168315”, “Type”: “TraceEventThrottle_BgDDMountainChopper”, “ID”: “0000000000000000”, “SuppressedEventCount”: “47”, “Machine”: “10.185.115.215:4500”, “LogGroup”: “default”, “Roles”: “CD,DD,MS,RK” }
But moving data has been fairly static for hours – sometimes increasing. You can see the moving data is now greater in size than the entire logical data set.
Data:
Replication health - HEALING: Only one replica remains of some data
Moving data - 21.887 GB
Sum of key-value sizes - 20.820 GB
Disk space used - 61.606 GB
6/23 is incident start.
Finally, the cluster is under almost no load.
ip port cpu% mem% iops net class roles
---------------- ------ ---- ---- ---- --- ------------- ------------------------------------------------
10.185.115.215 4500 6 5 - 6 stateless coordinator,data_distributor,master,ratekeeper
4501 3 4 - 4 stateless cluster_controller
4502 1 7 4 0 transaction log
4503 2 52 7 1 storage storage
4504 1 19 9 0 storage storage
4505 1 18 7 0 storage storage
---------------- ------ ---- ---- ---- --- ------------- ------------------------------------------------
10.185.175.85 4500 1 3 - 0 stateless resolver
4501 4 4 - 8 stateless proxy
4502 1 2 7 0 transaction log
4503 1 12 11 0 storage storage
4504 1 15 11 0 storage storage
4505 1 10 11 0 storage storage
---------------- ------ ---- ---- ---- --- ------------- ------------------------------------------------
10.208.190.181 4500 0 2 - 0 stateless
4501 0 4 - 0 stateless
4502 1 3 7 0 transaction log
4503 1 5 11 1 storage storage
4504 1 14 11 0 storage storage
4505 1 5 11 0 storage storage
---------------- ------ ---- ---- ---- --- ------------- ------------------------------------------------
10.220.237.150 4500 2 4 - 1 stateless coordinator
4501 0 4 - 0 stateless
4502 2 5 6 0 transaction log
4503 1 15 18 0 storage storage
4504 2 35 13 1 storage storage,storage
4505 1 14 11 0 storage storage,storage
---------------- ------ ---- ---- ---- --- ------------- ------------------------------------------------
10.221.166.13 4500 0 3 - 0 stateless
4501 0 4 - 0 stateless
4502 1 2 - 0 transaction
4503 1 9 10 0 storage storage
4504 1 13 10 0 storage storage
4505 1 7 10 0 storage storage
---------------- ------ ---- ---- ---- --- ------------- ------------------------------------------------
10.95.111.214 4500 1 4 - 0 stateless coordinator
4501 0 4 - 0 stateless
4502 1 5 4 0 transaction log
4503 1 8 13 0 storage storage,storage
4504 1 9 13 0 storage storage,storage
4505 1 7 10 0 storage storage,storage
---------------- ------ ---- ---- ---- --- ------------- ------------------------------------------------
10.95.111.219 4500 0 4 - 0 stateless
4501 0 4 - 0 stateless
4502 1 5 - 0 transaction
4503 1 19 12 1 storage storage
4504 1 21 12 0 storage storage
4505 1 15 12 0 storage storage
Frankly we’re all a bit frustrated as obviously something is preventing FoundationDB from doing one of it’s best things – managing the data!
At this point I’m wondering if it’s actually an issue reading off these problematic shards given we have stood up many new storage processes in new zones that would be suitable healing targets.
I’ll end with re-asking a question from earlier – is it possible to tell which actors/processes are holding the n=2, n=1 etc. shards? This might help drive operational decision making.