The catch-clause is a bit confusing to me here as it writes with SevError. My understanding was always that a Sev 40 will only be logged if the system can’t recovery by itself. But this is imho not the case here. accept could throw an error if a connection fails while it is accepted. Shouldn’t the listener simply retry if this happens? Or maybe be a bit more selective and only fail in certain cases? (Like if the system runs out of file descriptors)
The reason this is problematic for us is because we grep for Sev 40s in production and several people get woken up whenever this happens. So before we change this (at least for our production binary), I would like to understand the rationale behind this Sev 40.
I think the guideline is a bit different than this. In fact, we expect the system to respond gracefully to many types of errors that could occur (here, I’m taking system to mean the cluster, or maybe even the fdbserver process + fdbmonitor). For example, a process running out of memory or even crashing often has little impact on the cluster but would log a severity 40 event. Another common example is IO errors, of which many can be safely tolerated by the cluster (although they may require intervention to restore all processes to the cluster). Of course in these cases there may be some root problem that needs to be investigated.
I think there are a few types of situations that we are representing in severity 40 events. Some examples include unexpected circumstances that lead to the death of the process (i.e. not the intentional death of a process, such as when issuing an fdbcli kill), an event that causes a process to stop functioning normally even if it’s still running, the cluster reaching a state where it’s not fully operable (e.g. KeyValueStoreMemory_OutOfSpace), corruption being detected, etc. In the case of accept, the SevError seems appropriate in the code as written because the error here would terminate the process.
Having said the above, I think this is the meat of your question, and I agree that some of accept’s error cases may not need such a heavy-handed approach as restarting the process. I don’t think we’ve seen an issue with that in practice, but if I were to change this I would probably enumerate the errors that can be safely retried, falling back to the throw for all other errors.
While it is important to keep tabs on Severity 40s in your clusters, I think it’s worth mentioning that they aren’t being used with paging in mind. As I stated above, there are classes of errors where we expect the cluster to be able to respond gracefully, and while it’s of course not wrong to respond to them immediately, it’s often fine to deal with them later (e.g. during business hours). Because we treat them with that mindset, it’s possible you may end up playing a little whack-a-mole trying to suppress certain error events.
So the way I would like to think about severity-levels in general is this:
SevError: Something happened that either should never happen (undefined behavior from there) or something happened that the system can’t recover from properly. Therefore, page the on-call team.
SevWarnAlways: Something happened that the system can handle by itself but will need a human to look at it eventually -> Create a ticket.
Everything else will just be used for investigation purposes.
Now from your response I conclude that your philosophy is different. However, this philosophy is contradicted while testing as you fail tests if they log a Sev 40. Uniformity in testing and production would be reasonable. Also, assertions are checked in production and an assertion failure is something that should never happen in production and therefore might need immediate attention as this would be a bug anywhere - like global data corruption. An example would be the following trace line we saw in testing (this would indicated metadata being corrupted):
Alerting on more global scenarios (Ratekeeper going below a certain rate, fdbcli being unable to connect etc) is obviously necessary but imho it is not sufficient.
In general, I would like to have a way of creating an on-call page if certain events happen. Losing a process is usually not problematic, but losing a machine might be.
In our history of using fdb we only ever saw two kinds of Sev 40s. One was during disk failures and the other one was the one I mentioned in the original question. We are fine with having a few false positives and fixing them as they take place. But importantly we would like to have an agreement on the definition of tracing levels.
We have a mechanism for injecting faults into our tests that induces errors but for which we don’t fail the tests. I believe in these case we’re trying to ensure that these error conditions don’t lead to unexpected behavior. An example of this would be IO errors. I think it’s also reasonable to fail a test for a condition that we don’t want, even if it’s something that doesn’t require immediate attention in a running production cluster.
Let me raise this idea with the team here. I’ve at times thought it might be useful to have a severity above Error (Critical?) that represented something that did need more urgent attention. I don’t know that it really would have changed much about any past incidents I’ve encountered (i.e. I don’t know if it would have given us much additional lead time on any issues), but at the very least I’ve thought it could be helpful in the triaging process. And as you point out, there are imaginable scenarios where it could indicate a problem that isn’t reflected through other monitoring channels.
Instead of 3 categories, it’s possible that everything could fit well into 2 divided up in the way you describe. We’ve tended to use SevWarnAlways as a category of things that we should be aware of but that may not need attention. In other words, we treat SevWarnAlways as a monitorable but not necessarily actionable severity. At least at the moment, it tends to be a bit noisy (although I wish that weren’t the case), and there may be some events that don’t fit well into your SevError category but which may not be a great fit for being in the noisier SevWarnAlways category.
I am not sure another warning level is needed. Or asked differently: what is the difference between SevWarn and SevWarnAlways? We treat them pretty much the same way (maybe I am missing something here).
All current SevWarnAlways should be downgraded to SevWarn
Some SevError should be downgraded to SevWarnAlways. Currently I only see one, but an easy way of doing this would be if you could also grep your production logs and see what kind of errors actually happen in production. I would imagine the set of Sev 40s in your logs are very small.
Use SevError only for stuff where the system really is expected to need human help (and assertions - these are bugs and I am really really afraid of non-defined behavior).
But in general I don’t feel too strongly. The two main things that we would like to have is an error level we can alert on to on-call and, maybe even more important, a clear definition what the log levels actually mean and how they should be treated.
We use SevWarn as a non-monitoring severity, but of things that are more significant than SevInfo. The distinction between current SevWarn and SevInfo may not be significant enough to maintain, but the distinction between SevWarnAlways and SevError probably is, at least for now. The three possible monitoring classes I’m imagining are:
Things that require immediate attention
Things that require eventual attention
Things that are good to be aware of but may not require any additional attention
The two classes we have now roughly correspond to (#1+#2) and #3, and your proposal gives us #1 and #2 but not #3. Or, perhaps SevWarn could be #3, but I think pretty much everything currently in SevWarn would need to be downgraded.
If we cleaned things up enough that those things in #3 that didn’t actually require additional attention were moved down a severity, then we could probably combine #2 and #3. However that’s not super straightforward to do because a number of the things that show up there require extra context to determine whether any further action is warranted. As it currently stands, I think combining #2 and #3 would hide the SevError events in too much noise from the SevWarnAlways events.
I usually try to think of log-levels solely in terms of alerting and limiting verbosity (that is what SevDebug is useful for) - which means that I would prefer few log levels and most messages having simply the info level. As soon as a process produces hundreds of warnings per hour, they are imho not useful anymore. This is already the case in fdb - we basically never use the log levels (except Sev 40) when investigating problems - instead we filter by type to see for example whether there was a networking issue during the time of the incident.
But that said I think this is more of a philosophical question and we could discuss that with no end. I would be happy to get what you are proposing as long as there are clear definitions for the levels and the two highest levels occur rarely enough. We can then treat the warning levels all the same and you can treat them differently.
The intent of our current SevWarnAlways is that it doesn’t reach this level of verbosity, while SevWarn has no such limits. In practice, that isn’t always the case (which is part of what I alluded to with the noise at this level), but the SevWarnAlways events do still summarize pretty easily. I’ve actually found good value in keeping tabs on these events in FoundationDB, for what it’s worth. I’ve gotten less mileage out of SevWarn vs. SevInfo, but I have on occasion used these to limit the number of events that I’ve had to look at (sometimes looking at a subset of the events with higher severities can be helpful when you don’t have a good starting point in an investigation).
After some discussion about this proposal, I think the idea of having a more serious severity (whether that be a redefinition of SevError or a new SevCritical) could potentially make sense, but having the goal that it be intended for paging an on-call person is much harder. In particular, we didn’t really envision this being something that we would want to use for paging ourselves, mainly because most such events are either already detected by other monitoring mechanisms in status or they are not events worthy of being paged about but could be hard to segregate (e.g. with many assertions). If there is some type of problem that represents an urgent issue that isn’t being detected by status, I think we’d want to address that by finding a way to get it into status.