Three_data_hall vs multi_dc


While running some performance tests on three_data_hall and multi_dc setup , i get results to be 10 times slower in multi_dc scenario, for loading and reading. All storage and logs count being aprox the same. Increasing number of logs, proxies, log routers do not have an impact on performance ; only reducing nr of storage servers seems to improve the numbers a little bit (nothing significant) . Both scenarios use triple replication, and reside in one k8s across 3 namespaces mapped to 3 AZ (from cloud provider). Is there a logical justification for this type of behavior ?

Could you share the multi_dc setup? I assume that one dc is mapped to one AZ? I’m not an expert with the three_data_hall setup but there are some subtle differences between the three_data_hall and the multi_dc setup:

  • In three_data_hall there will be 3 replicas of a storage team, one per data hall.
  • In multi_dc there will be 6 replicas of a storage team, 3 per dc (primary and remote), spread across the fault domains.
  • In three_data_hall a commit is replicated 4 times, having 2 replicas per data hall. A commit has to wait until the data is persisted on all 4 log servers.
  • In multi_dc there will be 3 replicas in the main dc + and with one_satellite_double there will be 2 replicas in the satellite. The mutations for the remote side are then fetched from the satellite (adding some additional load). A commit has to wait until all 5 log servers have persisted the data.

When a client reads the data from the remote side it can happen that the storage server in the remote side has to wait until the according version is available locally. That means the log routers in the remote side must have fetched the new version/mutations from the satellite.

I simplified a few steps but the ideas should be correct. Have you checked what is the limiting factor? e.g. is the RateKeeper throttling, and if yes, what is the reason for throttling? It would also be great to know the read/write ratio + if reads are local or not.