We had an issue on our development cluster the other day where it got overloaded and several of the processes we had given class ‘storage’ fell over, causing FDB to recruit ‘log’ class processes to the SS role. Our log processes are on instances with much less disk, so they filled their disks overnight while no-one was watching (and, it being a non-prod cluster, we didn’t have out-of-hours alerting set up) and we came in in the morning to a non-responsive cluster.
We think we understand why they fell over, and we got everything back working, but I’m having trouble reconstructing the timeline of events. I know that all of the log class processes (both those with the TL role and those with no role waiting as ‘hot standby’) got recruited to the SS role, because we’ve got a load of warning logs from basically all of them simultaneously about the backup URL being invalid (not sure why, the config to launch the backup process via fdbmonitor only exists on the storage nodes, but that’s a separate issue).
But I’m not sure if that log is actually when they got recruited, or if they were all doing a periodic task that just happened to warn at that point, or what. At the moment our DataDog config drops "Severity": "10" logs from FDB to try and keep our log ingest costs reasonable. I’ve gone to the instances themselves and looked through our JSON logfiles, but there is so much output there I’m having trouble sorting through it.
Is there a specific field value I can look for that indicates “This process has just been recruited to role X” and if so, what is it? “This process has just been recruited to a new role, and here’s the list of all current roles” would also be fine. Bonus points for “This process has just lost role X” as well.
Once I’ve got that info, I can start working out the exclusion regex to log it to DataDog without all the other INFO logs, and then how to turn it into useful metrics and alerts that’ll tell me when processes get recruited to ‘unexpected’ roles.
fdbcli --exec "status json" should give you the information as to which processes have which roles… it’s just a lot of data. My team has a custom metrics plugin that parses that JSON and publishes it to our prometheus endpoint (though we don’t scrape per-process roles.)
You could take a similar approach! I do wonder if there’s a log showing when role transitions occur, that would be much more direct.
@danm All role changes on a process are logged by the process in a trace log event with a Type field of Role. This jq expression will extract some of the most interesting fields:
The ID field is unique for a role instance, so you can use that to track the specific lifetime of a single execution of a role. Each new execution of a stateless role will have a unique ID which is retired when the execution ends, while stateful roles (such as StorageServer) will have an ID assigned at creation which is stored on disk and reused each time the disk files are used to start an execution of the role.
Yeah, we’re using the FDB plugin for DataDog and have a patch to their python script to collect metrics about backup and DR from the status JSON that they don’t collect by default. But I was hoping to track the lifetime of a single process and say “At timestamp X the progress started, X+n1 it gained the log role, X+n2 it gained the storage role, X+n3 it lost the storage role again” etc, which seemed more of a log output thing than a metrics thing. Especially since the metrics from the status json are only collected by DataDog once per minute IIRC.
Ace. Thank you very much I will hunt for those logs in our raw data and then once I’ve found them I can write a regex to include them in what ends up on DataDog.