Quick question on tlog disk space for large clusters

Hello all! This is Kyle over at IBM Cloudant.

We’re scaling out the storage on an upcoming three_data_hall cluster featuring ~30.4TB of 3x replicated storage (so ~91.2TB raw disk space).

Now, we’ve been using a single model of SSD after shaking down the performance of a few, which happens to be the smallest/cheapest/fastest we can purchase at 960GB. For our scale-up we’ve chosen denser disks which are more expensive but just as fast.

Naively I scaled up the tlog disks too but Adam Kocoloski got me thinking – do we need much tlog disk space generally speaking?

We have two good options to move forward:

  • scale up the tlogs: ~6.7TB of 4x replicated tlog space
  • keep the 960GB base SSDs: ~1.7TB of 4x replicated tlog space

We run 12 tlog processes, 7 active and 5 for redundancy. The sizes above are active sizes. Each process gets its own disk.

I looked through our historical metrics on test servers but haven’t seen the logs using much disk space, though that could be a complete testing gotcha. I’m not seeing evidence pointing me to the larger option so any guidance on this topic is welcome.

Cheers,
Kyle

If it’s not obvious we’ve been using baremetal machines to host FoundationDB. Out of curiosity I spec’d out a VM based layer for our tlogs and stateless+ processes. I can build out a nice three_data_hall setup with 525GB of 4x replicated tlog space, or even 263GB.

We’re going to transition to VMs for these non-storage processes in the long term so I’m glad I checked it out. I thought I’d offer up these sizes as well to contrast with 30.4 TB of 3x replicated storage.

The steady-state disk usage of a TLog will be ~8GB, IIRC. Smaller file size is actually better here, because it means your SSD will wearlevel better. You’re mainly purchasing IOPS for TLog disk, and not space.

… Except when things fail. Giving TLogs large disks is useful in the case of storage server failures (or region failures, but you’re not there yet), as the current (6.1+) implementation of TLogs means that they’ll stop being able to recycle previous pieces of the file, and will start growing the file size instead. Once data distribution removes the failed storage server from all storage teams, the TLogs will throw all of the old data away, and return back to its ~8GB steady state.

Roughly, you’ll want to roughly calculate out what your TLog write rate will be (Client Write Rate in MB/s * Replication Factor / Number of TLogs), and then make sure that your TLog disks are large enough to survive O(hours) of that. And you gave me all but the client write rate, so assuming a moderately high write rate, (((50 (MB / s)) * 4) / 7) * (8 hours) = ~822 gigabytes. It looks like your current disks give you a bit over 8 hours for data distribution to finish, which should be sufficiently safe.

Or instead of using math, you could look at your TLogMetrics trace events, and do a rough average of first of the three elements of BytesInput. (The three elements of it are “difference variance total”.)

If you’re choosing between an SSD with and without supercapacitors, your TLog will love you far more if you choose the one with them though. TLogs basically just call fsync() in a loop, so making those free is pretty nice.

Alex, what a wonderful breakdown, I will try to reason out a safe size as you had. I was curious about the failing storage scenario so I’m glad you covered it here.