How to prevent tlogs from overcommitting

This is something we see in 5.2.x where it’s possible for a cluster’s tlogs to overcommit and go into a state where it cannot recover from:

10.42.2.48:4655 ( 4% cpu; 4% machine; 0.037 Gbps; 0% disk IO; 5.2 GB / 14.0 GB RAM )
Last logged error: KeyValueStoreMemory_OutOfSpace at Wed Sep 12 07:41:30 2018

This is a dedicated transaction process and it’s seemingly able to write more than it should and go into a bad state.

Ratekeeper is supposed to be helping with this, and probably did attempt to stop allowing normal priority transactions. If so, that would mean that either immediate priority transactions consumed the rest of the space or enough regular transactions were still allowed to leak through ratekeeper. Assuming you weren’t running immediate priority transactions yourself, this is probably worth filing a GitHub issue for.

By running with the memory storage engine, your TLogs are also using the memory storage engine to spill data out of the disk queue. The error that you’ve pasted is from this memory storage engine that’s being used to hold spilled data filling up in memory.

I’ve filed configure memory configures the TLog to spill to memory #768 and Ratekeeper isn’t throttling when a TLog’s MemoryKeyValueStore is nearly full #769 about this.

To get out of this situation, you’ll need to temporarily feed your TLogs more RAM, via increasing their --storage_memory= flag value, so that your cluster can recover. Once your cluster has recovered, you can configure log_storage:=2 to have the TLogs use the ssd2 storage engine.

Yeah, that’s SOP for us with 3.x (changing storage_memory). Just thought maybe it could have changed. In the past, it doesn’t have an error message, we just detect that the files are getting too big.

I did not know there is a configure log_storage option, I assume that’s a 5.x thing. We will probably need to understand the performance implications of running tlogs with ssd instead of memory (as AJ commented on the issue).

Oh, and we do not have system immediate transactions against the cluster. This is a pretty common thing in 3.x during recovery when writes are still being attempted against it (at normal priority).

This problem may not be specific to the memory storage engine, though given the relative sizes of memory and disk it’s probably more likely to be seen in the memory storage engine. If ratekeeper is leaking transactions when the log’s persistent data store is full, then in the ssd engine we’d probably just end up filling the disk all the way. That poses its own difficulties, because unlike the memory storage engine where you might be able to configure the process’s storage memory higher, when the disk is full you may need to either find something to delete from the disk or move the files to a larger disk.

I am trying see in the code how the configure log_storage:=2 command works, it doesn’t seem to do what I would expect IO-wise against the drive after applying the change. Since we have dedicated volumes for our tlogs. It’s still doing pure writes against the volume (which is a MemoryKeyValueStore behavior) and the operating space for log servers still show basically what the storage_memory flag value is (unlike on ssd clusters where it shows the amount of disk space left).

In fact, with configure log_storage:=2, it seems like we can get tlogs to run out of space almost immediately (on a brand-new cluster that we configured as memory and then applied the command).

Hm, maybe it’s got nothing to do with log_storage:=2. The moment we stood up a large cluster (with logs/proxies/resolvers configured but without log_storage:=2 in another test) and start blasting it with test traffic, the cluster locks up immediately, fdbcli said CC is unresponsive, goes into reinit, unresponsive, reinit, at the same time log storage space drops (from what we configure at 6G to 0G) and then the cluster dies with “Initializing new transaction servers…” which it’ll never be able to do because the tlogs are already overcommitted. It seems like RK should be clamping down since it needs to redistribute the shards with a sudden onslaught of traffic but it isn’t.

@ajbeamon is it possible that the fix I made to the memory storage engine on reporting operating space isn’t doing the right thing for RK still?

You’re referring to this one, correct? I don’t spot any obvious problems with it, though if it is somehow failing to report the correct number for available space, it could certainly cause ratekeeper to do the wrong thing. One way we could get a better sense for what’s happening is to look at the RkUpdate trace log events leading up to the point where the cluster locked up. They will indicate whether ratekeeper was trying to restrict transactions and the reason it was doing so. It also reports the worst free space on both storage and log servers, so we can see if there is a problem with the numbers coming back.

A related interesting thing I just noticed is this line:

It seems that we’re intentionally ignoring cases where we exhaust all memory storage engine space in simulation, so it’s not too surprising that there are ways we could end up here. I’m not sure if there’s a legitimate reason for this (e.g. that some tests are using immediate priority transactions) or if perhaps it was determined too difficult to stop ratekeeper from leaking, but we may want to take a second look at whether we want simulation to be ignoring this scenario.

Alrighty, will check RK logs to see what’s going on.

@alexmiller any thoughts on using ssd2 engine for tlogs while using memory for storage? Just wanted to give that a try

Nothing stood out, we are testing commits from the latest 5.2 branch (also the memory rollback fix that was suggested). Tuning some knobs that we can find as well.

Setting tlogs to ssd2 did not do what we expected so far.

Is this completely blocking your transition to 5.x? I marked this to take a look at after trying to close out an ongoing project, but that’s probably going to take another couple weeks.

There are a number of things we are working to resolve:

  • TLogs overcommitting is a new thing (leading to KeyValueStoreMemory_OutOfSpace) and it requires upping storage_memory to resolve although this is not unheard of in the past we would simply up the storage_memory and it usually would work (in the past, it would simply get stuck at Initializing New Transaction Servers).
  • Rollback issues, basically killing a CC (kill -9) will almost always cause a long period of unavailability. That we tried the idea posted on another thread of picking the highest version to rollback to but that causes clusters to go into “configuration_unavailable” state AFAICT. On a hunch, I am testing this: https://github.com/panghy/foundationdb/commit/20cffc03c18a578a98a0d18d7a44ddc6beb40fbe
  • Frequent moving of resolution ranges on the master. This seems to have been alleviated by changing STORAGE_METRICS_AVERAGE_INTERVAL to 10.0 from 120.0. My understanding is that it affects the metrics that resolvers are collecting which in turn causes the master to decide whether to rebalance.

Other than that, we are just about to roll our custom 5.2.5 clients with our custom 3.x client and activate multi-version clients which stands-by the fleet for an upgrade once we know how to run 5.2.x at the same (or above) reliability as before.

So I believe I might have found a couple interesting things, I followed up another round of tests with another commit (https://github.com/panghy/foundationdb/commit/05bb4395ea1674926b261aca9b87be874e01cf7d) which essentially waits for up to 30 seconds or if all tlogs are responsive (again, rollback for us is 15 minutes so having up to 30s of unavailability to wait for logs to come back isn’t too bad).

One interesting thing I noticed is that where coordinators are affects how CC death is handled. Specifically if the coordinators are excluded (exclude IP:port) from the cluster, it doesn’t cause all memory storage processes to rollback at all and the system can recover from a CC kill (sudo kill -9) gracefully instead of exhibiting unavailability for N mins (where N scales with storage_memory of memory nodes). Since I changed two things, I am not sure if that’s also because adding the wait before figuring out knownCommittedVersion for 30s made it better.

Another thing for memory processes is that the cluster does seem to wait for them to recover from disk rather than going into healing with 5.2.x when memory processes are restarted rapidly. This just leads to tlogs filing up as the storage servers are seemingly not dequeueing from tlogs and will require a long wait until they can catch up. The fix here is seemingly to just kill the dead processes for a bit longer so that they can be properly removed.

I noticed that there are now dedicated classes for cluster_controller, master and log (the last seems to mean that it’s most likely to take tlogs only instead of also taking on as proxy/resolvers). i wonder if there can be a dedicated class for coordinators.

The logic does seem to help, as you can tell below:

<Event Severity="10" Time="1538815785.209264" Type="TLogLocked" Machine="10.42.2.10:4683" ID="8bfb28dc474f8e23" TLog="0c09b93d6932812d" end="27672122785" logGroup="default"/>
<Event Severity="10" Time="1538815785.209264" Type="DelayingMasterRecovery" Machine="10.42.2.10:4683" ID="8bfb28dc474f8e23" StatusCode="2" Status="locking_old_transaction_servers" MissingIDs="b2a01e059992bff38bff75465ae563c8, 4059bc9019d0c3b086f720e232db82eb, 16b0804b7f2b76c138c6afd2c1fe8995, 7ca18acd06cb634d5866fa62ca3e36ac" logGroup="default" TrackLatestType="Original"/>
<Event Severity="10" Time="1538815786.209325" Type="DelayingMasterRecovery" Machine="10.42.2.10:4683" ID="8bfb28dc474f8e23" StatusCode="2" Status="locking_old_transaction_servers" MissingIDs="b2a01e059992bff38bff75465ae563c8, 4059bc9019d0c3b086f720e232db82eb, 16b0804b7f2b76c138c6afd2c1fe8995, 7ca18acd06cb634d5866fa62ca3e36ac" logGroup="default" TrackLatestType="Original"/>
<Event Severity="10" Time="1538815786.733162" Type="ConnectionFrom" Machine="10.42.2.10:4683" ID="c4419d54e87a14ab" FromAddress="10.42.2.59:59396" SuppressedEventCount="35" logGroup="default"/>
<Event Severity="10" Time="1538815786.771376" Type="ConnectionEstablished" Machine="10.42.2.10:4683" ID="c4419d54e87a14ab" Peer="10.42.2.59:59396" ConnectionId="1" SuppressedEventCount="35" logGroup="default"/>
<Event Severity="10" Time="1538815786.771376" Type="IncomingConnection" Machine="10.42.2.10:4683" ID="c4419d54e87a14ab" FromAddr="10.42.2.59:59396" CanonicalAddr="10.42.2.59:4653" IsPublic="1" logGroup="default"/>
<Event Severity="10" Time="1538815786.771376" Type="ConnectionClosed" Machine="10.42.2.10:4683" ID="0000000000000000" PeerAddr="10.42.2.59:4653" Error="operation_cancelled" ErrorDescription="Asynchronous operation cancelled" ErrorCode="1101" SuppressedEventCount="35" logGroup="default"/>
<Event Severity="10" Time="1538815786.771376" Type="TLogJoinedMeUnknown" Machine="10.42.2.10:4683" ID="8bfb28dc474f8e23" TLog="ca88d97401ad13cf" Address="10.42.2.59:4653" logGroup="default"/>
<Event Severity="10" Time="1538815786.771376" Type="TLogJoinedMe" Machine="10.42.2.10:4683" ID="8bfb28dc474f8e23" TLog="7ca18acd06cb634d" Address="10.42.2.59:4653" logGroup="default"/>
<Event Severity="10" Time="1538815786.772171" Type="TLogLocked" Machine="10.42.2.10:4683" ID="8bfb28dc474f8e23" TLog="7ca18acd06cb634d" end="27672122785" logGroup="default"/>
<Event Severity="10" Time="1538815787.035086" Type="IncomingConnection" Machine="10.42.2.10:4683" ID="f57166692cca38e1" FromAddr="10.42.2.59:59398" CanonicalAddr="10.42.2.59:4650" IsPublic="1" logGroup="default"/>
<Event Severity="10" Time="1538815787.036583" Type="TLogJoinedMe" Machine="10.42.2.10:4683" ID="8bfb28dc474f8e23" TLog="b2a01e059992bff3" Address="10.42.2.59:4650" logGroup="default"/>
<Event Severity="10" Time="1538815787.036583" Type="TLogJoinedMeUnknown" Machine="10.42.2.10:4683" ID="8bfb28dc474f8e23" TLog="a7b2380a1d47c806" Address="10.42.2.59:4650" logGroup="default"/>
<Event Severity="10" Time="1538815787.037139" Type="TLogLocked" Machine="10.42.2.10:4683" ID="8bfb28dc474f8e23" TLog="b2a01e059992bff3" end="27672122785" logGroup="default"/>
<Event Severity="10" Time="1538815787.129738" Type="IncomingConnection" Machine="10.42.2.10:4683" ID="67e25797c8b3551f" FromAddr="10.42.2.59:59400" CanonicalAddr="10.42.2.59:4652" IsPublic="1" logGroup="default"/>
<Event Severity="10" Time="1538815787.132086" Type="TLogJoinedMe" Machine="10.42.2.10:4683" ID="8bfb28dc474f8e23" TLog="16b0804b7f2b76c1" Address="10.42.2.59:4652" logGroup="default"/>
<Event Severity="10" Time="1538815787.132086" Type="TLogJoinedMeUnknown" Machine="10.42.2.10:4683" ID="8bfb28dc474f8e23" TLog="9e23d610a8de4849" Address="10.42.2.59:4652" logGroup="default"/>
<Event Severity="10" Time="1538815787.132872" Type="TLogLocked" Machine="10.42.2.10:4683" ID="8bfb28dc474f8e23" TLog="16b0804b7f2b76c1" end="27672122785" logGroup="default"/>
<Event Severity="10" Time="1538815787.209444" Type="DelayingMasterRecovery" Machine="10.42.2.10:4683" ID="8bfb28dc474f8e23" StatusCode="2" Status="locking_old_transaction_servers" MissingIDs="4059bc9019d0c3b086f720e232db82eb" logGroup="default" TrackLatestType="Original"/>
<Event Severity="10" Time="1538815787.296189" Type="IncomingConnection" Machine="10.42.2.10:4683" ID="7a609d4744a84de5" FromAddr="10.42.2.59:59402" CanonicalAddr="10.42.2.59:4651" IsPublic="1" logGroup="default"/>
<Event Severity="10" Time="1538815787.296522" Type="TLogJoinedMe" Machine="10.42.2.10:4683" ID="8bfb28dc474f8e23" TLog="4059bc9019d0c3b0" Address="10.42.2.59:4651" logGroup="default"/>
<Event Severity="10" Time="1538815787.296522" Type="TLogJoinedMeUnknown" Machine="10.42.2.10:4683" ID="8bfb28dc474f8e23" TLog="fe60a433f5e28635" Address="10.42.2.59:4651" logGroup="default"/>
<Event Severity="10" Time="1538815787.296953" Type="TLogLocked" Machine="10.42.2.10:4683" ID="8bfb28dc474f8e23" TLog="4059bc9019d0c3b0" end="27672122785" logGroup="default"/>
<Event Severity="10" Time="1538815788.209522" Type="LogSystemRecovery" Machine="10.42.2.10:4683" ID="8bfb28dc474f8e23" Cycles="40" TotalServers="40" Present="40" Available="40" Absent="0" ServerState="aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" ReplicationFactor="2" AntiQuorum="0" Policy="zoneid^2 x 1" TooManyFailures="0" LastVersion="-1" RecoveryVersion="27671204392" EndVersion="27671204392" SafeBegin="0" SafeEnd="2" NewSafeBegin="0" LogZones="a188c033405bf0c7e3e3f4ab4db7619d,5f207adfb8bce3482a137e74236271f9,08e17799c6882ca97ffd786f10738e2e,5f207adfb8bce3482a137e74236271f9,6034975fa591c43bbd18532021638a03,b98e961a7cd5aca9ccf113d9dc902674,0439c79a19ca33c549459b4190c83245,f1659782186ec6fd118af10c247de0b0,7b3a48e3dde447444009d7c0c54f88d1,b98e961a7cd5aca9ccf113d9dc902674,0439c79a19ca33c549459b4190c83245,0439c79a19ca33c549459b4190c83245,5f207adfb8bce3482a137e74236271f9,0439c79a19ca33c549459b4190c83245,08e17799c6882ca97ffd786f10738e2e,\..." LogDataHalls="[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset],[unset]" tLogs="40" oldTlogsSize="0" logSystemType="2" At="27671204392" AvailableServers="40" knownCommittedVersion="27671202579" logGroup="default"/>

Specifically, because of the delay, it waited until all logs have joined before attempting recovery which meant it likely had a higher knownCommittedVersion than before (I should have logged that information when the code attempted the delay).

Obviously, I think for SSD storage engines, this should not be the default (meaning the knob should default to zero) but for memory, since the trade-off is huge (we still had some storage servers rollback in our tests), it’s probably a good idea to wait a couple seconds (up to 10 perhaps in limited experiments I just did) just in case the logs were just restarted or a temporary issue prevented them from responding.

Actually, I don’t believe where coordinators are (whether they are actual storage servers) do not affect whether rollback happens (enough tests show that the rollback can still happen). I am wondering (and testing) whether we can choose to not reload the memory KeyValueStore on rollback and instead just reuse the same instance. I have looked at the code a bit and I don’t seem to see why that would break (the MVCC piece doesn’t exist in the IKeyValueStore, whatever that is set(), clear() or committed() is flushed to disk and recovery just reads them and adds a rollback).

Alright: https://github.com/apple/foundationdb/pull/821

This basically eliminated the dreaded N minutes of recovery when the memory storage servers are asked to roll back. The change is tested and the cluster comes back to life instantly.