We’ve got some behaviour here that I’m trying to understand, partly for my own learning and partly so I can explain it to the rest of our engineers and think about how we can fix it.
We’re running in AWS, single region, 3 AZs, with the DB configured in three_data_hall mode and the IDs set up so that data should be replicated to storage nodes in each of the AZs.
We know we have a few ‘hot keys’/‘hot ranges’ in our application, and we’re working to rearchitect the system to prevent that. Cloud compute with remote block storage has pretty high latency and poor I/O volumes compared to direct-attached NVMe, so when we run our load tests it shows up pretty fast.
What’s odd is how it shows up. For a given load test we’ll see disk I/O at or near 100% on quite a few nodes, but we’ll only get a single storage node whose durability lag spikes. If we keep running the test it’ll be reporting tens of minutes of lag, when all the other nodes are stable in the 5-20s range.
But my understanding of three_data_hall is that writes will need to go to 3 nodes for replication, and reads should be load balanced between the replicas. So I don’t understand why we’re only seeing one hot node instead of three. What am I missing?
We only have a small number of clients (using the Java lib with JNI to invoke the C bindings) which serve our API frontend and can handle a large volume of API requests each, so does each client select and use a single read node for a shard, or something, and you only get load balanced reads when you have a large number of clients?