Monitoring Watches

ex3ndr · June 2, 2020, 6:01pm

We hit one huge issue in our code that made our cluster unresponsive and it was an issue with watches. We had bad experience with watches, we used to use them everywhere, but for some reason it is really hard for fdb. We switched to doing simple pulling every 5 seconds + event bus for speeding up refresh and it works well for us so far. But some parts of the code are opted to still use watches for transactional reasons.

And i haven’t found any way to monitor how many watches are now online via fdbcli. Is it possible?

andrew.noyes · June 2, 2020, 7:53pm

I think the “StorageMetrics” event (traced periodically by storages) has “ActiveWatches” and “WatchBytes” fields, which might be interesting.

Do you have any more details about how the cluster became unresponsive?

My mental model has starting a watch costing about the same as a get, plus each active watch costs a bit of storage cpu (I think it should be O(log(active watches)) for each mutation that storage processes) and storage memory.

Another consideration is that watches can fall back to polling if the server has too many watches already, which seems to amount to attempting to start a new watch every second. Also the storage server seems to timeout a watch after 900 seconds, and if the client is still waiting it will start a new watch.

(This is all based on me reading source code, not necessarily operational experience)

panghy · June 3, 2020, 2:11am

We also happen to be using watches a lot in large-scale productions settings but instead of watching a ton of keys, we watch just a single key for each class of changes so that they can go poll for changes when it happens. That reduces the number of keys that a process needs to watch significantly as we’ve found (and eliminates the concern that we could run out of watches). We also “space” out the polling after we detect changes so that not all processes hit the system immediately upon the detection of a change (if this is something you can tolerate).

ex3ndr · June 3, 2020, 5:31am

I just feel that we put huge load on FDB when using whatches for some reason, we are not sure what’s was the problem, we just feel that server is much slower. May be this is application code, may be a client or may be a server.

We are constantly getting this feeling recently, but it turns out we are hitting physical limits and FDB performs just fine. May be something weird is happening when you have a lot of watches.

andrew.noyes · June 4, 2020, 11:43pm

Something I just thought of - it appears that every time a watched key is written to, the storage server performs a get on that key to see if it has changed. So watching a key and then writing the same value to it over and over could use a lot of cpu.

Edit: I think it does this get once per watch

Topic		Replies	Views
Watchers and not missing changes Using FoundationDB	13	1988	June 5, 2018
Question about Watches limit Using FoundationDB	4	847	June 7, 2020
Changes feed without hot keys Using FoundationDB	7	1449	January 25, 2019
What do you monitor? Using FoundationDB	35	9991	September 26, 2022
Where do watches register themselves? Using FoundationDB	4	1097	May 7, 2020

Monitoring Watches

Related topics