Monitoring Watches

We hit one huge issue in our code that made our cluster unresponsive and it was an issue with watches. We had bad experience with watches, we used to use them everywhere, but for some reason it is really hard for fdb. We switched to doing simple pulling every 5 seconds + event bus for speeding up refresh and it works well for us so far. But some parts of the code are opted to still use watches for transactional reasons.

And i haven’t found any way to monitor how many watches are now online via fdbcli. Is it possible?

I think the “StorageMetrics” event (traced periodically by storages) has “ActiveWatches” and “WatchBytes” fields, which might be interesting.

Do you have any more details about how the cluster became unresponsive?

My mental model has starting a watch costing about the same as a get, plus each active watch costs a bit of storage cpu (I think it should be O(log(active watches)) for each mutation that storage processes) and storage memory.

Another consideration is that watches can fall back to polling if the server has too many watches already, which seems to amount to attempting to start a new watch every second. Also the storage server seems to timeout a watch after 900 seconds, and if the client is still waiting it will start a new watch.

(This is all based on me reading source code, not necessarily operational experience)

2 Likes

We also happen to be using watches a lot in large-scale productions settings but instead of watching a ton of keys, we watch just a single key for each class of changes so that they can go poll for changes when it happens. That reduces the number of keys that a process needs to watch significantly as we’ve found (and eliminates the concern that we could run out of watches). We also “space” out the polling after we detect changes so that not all processes hit the system immediately upon the detection of a change (if this is something you can tolerate).

1 Like

I just feel that we put huge load on FDB when using whatches for some reason, we are not sure what’s was the problem, we just feel that server is much slower. May be this is application code, may be a client or may be a server.

We are constantly getting this feeling recently, but it turns out we are hitting physical limits and FDB performs just fine. May be something weird is happening when you have a lot of watches.

Something I just thought of - it appears that every time a watched key is written to, the storage server performs a get on that key to see if it has changed. So watching a key and then writing the same value to it over and over could use a lot of cpu.

Edit: I think it does this get once per watch

1 Like