FDB cluster freeze

We have experienced this issues twice in the production that a single nodes out of 15 storage server when goes down, FDB try to rebalance the data on that node I believe to maintain the replication factor of 3.

As status shown below , FDB shows replication health healing and during that time all read/writes on the cluster goes to zero and application gets error reading/writing data and it retry so more transactions start on the DB but none get processed.

Does anyone know what can be done or might be the reason for entire cluster having issue processing read/writes . We have 21 log server processes so it should handle the transactions wokk load.

As per the error/message on workload on log server space approaching 5% i was thinking to set the parameter knob_log_server_mutation_buffer_size=4GiB as my understanding is default is 2GB and we have 8GB physical memeory on these log servers boxes. Do you guys think that might help to process workload during rebalancing work ?

Status:-

Unable to start batch priority transaction after 5 seconds.

Unable to retrieve all status information.

Configuration:
Redundancy mode - three_datacenter
Storage engine - ssd-2
Coordinators - 7
Exclusions - 4 (type `exclude’ for details)
Desired Proxies - 5
Desired Resolvers - 7
Desired Logs - 12
Usable Regions - 1

Cluster:
FoundationDB processes - 239
Zones - 51
Machines - 51
Memory availability - 6.3 GB per process on machine with least available
Retransmissions rate - 1007 Hz
Fault Tolerance - 3 machines
Server time - 03/17/23 22:32:33

Data:
Replication health - HEALING: Restoring replication factor
Moving data - 402.283 GB
Sum of key-value sizes - 1.513 TB
Disk space used - 11.387 TB

Operating space:
Storage server - 5841.4 GB free on most full server
Log server - 1.1 GB free on most full server

Workload:
Read rate - 19868 Hz
Write rate - 7121 Hz
Transactions started - 367 Hz
Transactions committed - 7 Hz
Conflict rate - 0 Hz
Performance limited by process: Log server running out of space (approaching 5% limit).
Most limiting process: 10.49.73.106:4500

Backup and DR:
Running backups - 0
Running DRs - 0

Client time: 03/17/23 22:32:20

any other parameter we can set to control the rebalancing after node failure so it don’t bring the entire cluster stand still till balancing is finished.

Which FDB version is your cluster? I remember there were fixes to some problems of batching transaction throttling. You might want to look at Ratekeeper events RkUpdate* to see the limiting reasons.

I am using FDB 6.3.15 version.

When I ran status in fdbcli command I don’t see any information on the Ratekeeper. I also ran "fdbcli --exec “status json” | jq ‘.cluster.ratekeeper’ " and it shows null. There is no mention of setting ratekeeper so i assume it is by default taking care of slowing down transaction rate.

Do you know any setting which can help to avoid this issue if entire cluster being standstill. Will setting more memory for transaction servers help as per the error I saw while issue was going on.

Oh, you said "log server space approaching 5% ". Is this disk space? If so, that’s probably the reason Ratekeeper is throttling so much. If this is the reason, getting more free disk space is needed.

Ratekeeper is not in that section. You can search through the status json doc to find something like:

            "4b1ae94df8f0eac2dbc320c3c3701ddc" : {
                "address" : "100.82.97.119:4501",
...
               "roles" : [
                    {
                        "id" : "0303036b22f13d8b",
                        "role" : "ratekeeper"
                    }
                ],

Then search that process’s log for RkUpdate* events.

Ratekeeper is automatically recruited and should always be there. Setting more memory for transaction servers may not help, because there could be many other reasons of throttling.

At the time of the issue below I did checked the space on the server and there was no disk space issue.

Performance limited by process: Log server running out of space (approaching 5% limit).
Most limiting process: 10.49.73.106:4500

I just now replace the failed storage node with the new node and even when i added new node disk utilization per process shoot up on all processes and read/write drops. So when node goes bad and FDB has to rebalance the data or when we add new nodes to replace the failed one we see this issue.

I don’t see any disk space issue on any of the nodes during issues .


I think what happened is that when you exclude a problematic storage server (SS), the cluster tries to rebalancing data off that SS, causing high IOPS on other SSes. A side effect is that mutation data for that problematic SS is still kept on the tlog. Before the SS is removed, tlog disk space will keep increasing. So if the tlog disk is small, it’s possible the free space decreases to 5%. If this happens, Ratekeeper starts throttling traffic aggressively.

We usually choose the tlog disk size to be able to hold one day’s data. So this gives us plenty of time to react. Not sure how much disk space you’ve given to tlogs.

Since FDB 7.0, there is a exclude failed command, which tells the cluster to ignore the problematic SS, thus not keeping mutation data for that SS on tlogs. As a result, tlog won’t run out of space and Ratekeeper won’t throttle traffic.

1 Like

I have checked the space on the server mention in the status command at the time of issue and there was enough space available.

We have 65GB on log servers and we have 21 log servers. How to figure out how much tlog space is needed for a day or two.

Also when i added new node today and rebalance started again cluster freeze i.e. no read-write processing and status was showing healthy so no space issue etc. was reported.

Whenever cluster.data.moving_data.in_queue_bytes increases i.e. node failed or new node added to cluster

So far I have noticed whenever node goes down or we replace the failed node with new node which trigger data rebalancing/data movement across nodes it shoot up both DISK busy per process reach 100% and cpu utilization 60-70%. Main issue read/write operations goes to ZERO and then recover once rebalancing is completed (A very long time) .

We have 15 storage servers and each server has 11 storage processes from 4500-4511 port. We are using i3en.3xlarge EC2 instance . During the issue IOPS was around 66k/s /throughput was 500mb/s which is below what this EC2 instance type can handle (New – The Next Generation (I3en) of I/O-Optimized EC2 Instances | AWS News Blog) and CPU utilization was around 60-70%

We have 21 log storage . Only one log process per node is running . It is m5ad.large type node .During issue 400 IOPS was max so again well bellow the instance type capacity. CPU was low.

I think if one storage node goes down FDB has to handle/redistribute around 685 GB (thats the disk space used by data mount) . Is anyway to throttle the rebalancing so disk/CPU usage on storage server don’t jump too much . If thats not possible I am thinking of adding more storage server to reduce the GB per node in turn reduce the bytes to move between nodes when one node goes down(lets say instead of 680 GB reduce to 300 GB by adding 15 more storage servers).

I hope throttling can be done on rebalancing. I did found few articles but still not 100% sure what is the exact command or process to throttle rebalancing to avoid the read/write dropping to zero.

As shown by the cluster status you pasted in the first post, the issue that you are facing is that one or more of your log processes have run out of space. Your graphs may show that to not be the case (though you only pasted operations and busyness graphs), but FDB still believes that a transaction log has run out of space, and has halted operations in the cluster accordingly. So what you need to debug here is why are you running out of space when your graphs or manual checking suggest that you aren’t, and how can you resolve the space issue (remove extraneous large files, fix storage space limitation, expand size of disk, etc.).

three_datacenter stores four copies of data on tlogs, so if you have $L$ logs, $S$ storage servers, $M$ MB of mutations per second and $D$ MB per tlog disk, then:

  • one storage server failing means you’ll have something around $D / (M / S)$ seconds until the tlog fills up
  • if an AZ is offline, then $L * D / M / 4$ seconds until all logs fill up

Thanks Alex. How i can get the current or average MB of mutations per second. Also $D$ per tlog disk is the space allocated on transaction server ? If yes in our case it is 65 GB on each transaction server and we have 21 of them.

I will need to rebuild transaction servers with more space capacity as we are using ephermal and for instance type of m5ad.large limit is 75GB so either i have to change the instance type of go for another EC2 type and attach EBS volume. Do you have any recommendation om which EC2 type of instance will do for log servers and can I attache EBS volume instead of ephermal which gives better IOPS

I think you can take a look at cluster → workload → bytes → written → hz in the fdbcli status json output

thanks Alex I was capturing the stats but did not have graph which i created and it clearly shows the log server space getting low during rebalancing as per graph attached . I will work on replacing the transaction nodes with higher EC2 type to increase the log space as EC2 m5ad.large gives only 75GB ephermal ssd disk .