Reduce number of conflicts for detecting when counter reaches zero

I have the next case: Let’s say I have ~5k users that want to join a voice chat and with first join, we must spin up the media stack and with last one left close everything down.

Users came and leave in huge bursts, let’s say they come all within one second and leave the same way.

To detect the start/end of a call I have two keys: one for the total number (counter) of users and one is a flag if the voice has been started.

Starting a call is very easy - just check if started is false and then schedule work to the media server. If 100 users jump on call it would ‘only’ cost 199 transaction iterations. But when leaving we have to have read conflict with total counter meaning that we need to wait for each other to leave chat and execute every transaction sequently until the counter reaches zero. In my simple tests, it takes ~578 retries for 100 users to leave the chat.

I am wrapping my head for a long time with this issue. Is there any simple solution to avoid this conflicts?

It sounds like you’re already using atomic add for the counter, which makes sense. One option is to have a “background” worker watching/polling the counter. Once it reaches zero, then that worker can update started to false (or whatever else needs to happen).

To make sure that the transactions that change counter are idempotent I would recommend additionally writing some kind of user_id to a keyspace with the invariant that counter == number of user_id’s. Otherwise retrying on e.g. commit_unknown_result could cause the count to go negative or other weirdness.

Something like this in python-ish pseudocode:

Add user to voice chat:

if subspace["started"] is true or not present and subspace["users"][user_id] is not present:
    subspace["users"][user_id] = ''
    subspace["started"] = true

Remove user from voice chat:

if subspace["users"][user_id] is present:
    delete subspace["users"][user_id]

Background worker

    if subspace["started"] == true and subspace["count"] == 0:
        subspace["started"] = false # voice chat is over. Tear down and don't let users join anymore

It is an interesting problem. One way could be to shard the counter k ways. Each user would increment and decrement a pre-determined shard. Once all shards reach 0, cleanup can be done. This suggestion is based on assumption that given n users, #conflicts are O(n^2); splitting n should reduce the conflicts.

Another way could be that users can just decrement counter without a read conflict. And then issue a fresh read transaction to see if value is 0; if the value is 0, then they can set another boolean as false (with proper read write conflict), similar to ‘started’, and whichever client successfully does that can clean things up.

This is very rough skeletal idea, with need of careful idempotent operations and handling of failures in between two transactions.