Key-value sizes at DR source and destination have a big difference

I am doing DR tests from a static FDB cluster. After the initial syncup was finished, there is a big difference in key-value sizes at the source and destination, 500GB vs. 420GB.

The source has “Sum of key-value sizes - 504.576 GB” in status output.

The DR status says “The DR on tag `default’ is a complete copy of the primary database.”

The destination has: Sum of key-value sizes - 420.455 GB

What is the true meaning of key-value sizes? Why such a big difference?

The difference is so big, that we are concerned. How can we verify the DR data transfer is 100% correct? Is there a verification mechanism/option we can use in DR?

We are using FDBv6.2.27, with triple replicas. The source has a 3-DC/2-region config (so 6 total replicas), but the destination is only 1-DC. (DR from 3-DC to 3-DC didn’t work for us. See Locking coordination state with DR)

I set up another DR from the source to another destination. When the DR competed, this second target has the same KV size as the first target:

  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 420.436 GB

And I noticed the “Sum of key-value sizes" at the source was increased to 520GB (from 520 above).

I found a saved copy of the status of the source cluster generated on May 17. It was 432GB, quite close to the target’s. Before I saved the status on May 17, I started a DR and it was failed (see the post I posted on May 17, which I mentioned above.)

It seems to me that a DR job would increase the KV size at the source. Even after the DR job is finished or failed, the size would not decrease.

Hi folks, does any of you encounter this phenomenon with FDB v6.2?

@osamarin @mengxu

My guess is that the difference comes from mutations being stored in the system key-space that were or will be applied to the destination. How much lag does the DR report when you run fdbdr status? In theory, the amount of space used shouldn’t be particularly large if the lag is small. Additionally, it is supposed to clean up these mutations if you do a full abort of the DR.

You could confirm that the extra data is in the system key space and where it is using the locality API to get the boundary keys. This gives you shard boundaries, and since shards are bounded in size you can get a sense for what ranges contain a lot of data. For example, you can do fdb.locality.get_boundary_keys(db, "\xff", "\xff\xff") in Python.

If the source cluster has run many full backups it may also be accumulating metadata in the system space which would not be replicated to the destination cluster. More details in another thread.

AJ and Steve, thanks for your input. Sounds like the increase is due to backup or DR activities.
I just deleted the source cluster a couple of days ago (for company compliance). I cannot query any more. I’ll pay attention to the matter when doing other backup and DR activities in the future. Thanks.