FoundationDB

Help me understand this status output


(Caleb Spare) #1

I’m having trouble understanding a few aspects of my cluster status. I’m running 5.2.5. I have read https://apple.github.io/foundationdb/administration.html#monitoring-cluster-status.

Here’s my status:

fdb> status

Using cluster file `fdb.cluster'.

Configuration:
  Redundancy mode        - single
  Storage engine         - ssd-2
  Coordinators           - 5

Cluster:
  FoundationDB processes - 35
  Machines               - 5
  Memory availability    - 4.6 GB per process on machine with least available
  Retransmissions rate   - 1 Hz
  Fault Tolerance        - 0 machines
  Server time            - 10/03/18 17:52:49

Data:
  Replication health     - Healthy (Rebalancing)
  Moving data            - 0.040 GB
  Sum of key-value sizes - 47.517 GB
  Disk space used        - 184.154 GB

Operating space:
  Storage server         - 227.7 GB free on most full server
  Log server             - 1.0 GB free on most full server

Workload:
  Read rate              - 105 Hz
  Write rate             - 5 Hz
  Transactions started   - 6 Hz
  Transactions committed - 1 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Client time: 10/03/18 17:52:49

I originally created this as triple ssd. I wanted to try various benchmarks with various storage schemes, so at one point I tried changing it to double memory and, when that became unhappy, single memory. That still wasn’t happy so I changed to single ssd which is where I am now.

Many things about the status output confuse me:

  1. What is the relationship between “disk space used” and “sum of key-value sizes”? When I was on triple ssd the numbers were roughly the same as now (47 GB / 189 GB). I had assumed that “disk space used” would account for the data replication, but it didn’t drop when changing from triple to single. How much memory would I need in order to change from ssd to memory? I had assumed that it would be 47 GB + some overhead, but it seems like maybe I would need 180+ GB of memory?
  2. Why does it say that the operating space for the log server is “1.0 GB free on most full server”? All of the servers have plenty of free disk space (100s of GBs) on the volume where the log directory lives.
  3. When I refresh status, the “Replication health” tends to cycle between “Healthy” and “Healthy (Rebalancing)” (plus a small nonzero amount of “Moving data”). This is a cluster that is completely idle. What is it doing?

(A.J. Beamon) #2

Sum of key-value sizes is the logical size of your data, excluding overhead and replication. It’s an estimate of the sum of the sizes of your keys and values.

Disk space used is a measure of how much space the files for your data take on disk. This includes overhead and replication, as you suspected.

In order to switch from ssd to memory, you would need enough space in your memory storage engine to store all of your data. A prerequisite of this is that you have sufficient memory on the hosts, but by default each process will only use 1 GB of memory for storage in the memory storage engine. With 35 processes, there wouldn’t have been sufficient space using the default parameters, which is probably why it was unhappy.

To configure the amount of space each process is willing to store, use the --storage_memory argument (or storage_memory configuration parameter in foundationdb.conf). You’ll likely also want to increase the value of the --memory parameter (default is 8GB) to accommodate the extra memory being used.

The amount of space used in memory will include some overhead (in addition to replication), which I believe is currently something like 72 bytes per kv-pair. The relative amount of overhead, then, will depend on the size of your individual kv-pairs.

A common reason for the disk space used number being higher than you expect is that the data files don’t immediately shrink when data is removed from them. When data is deleted, it is disconnected from the rest of the B-Tree and is then slowly cleaned up over time. This is 2 stage process, and the 2nd stage is responsible for actually shrinking the file. Therefore you may see that the size used on disk doesn’t change for a while and then eventually it starts slowly decreasing. It should be noted that even though the file isn’t shrinking during the 1st stage, pages within the file are being made available for reuse by the storage engine.

It’s also sometimes possible for files to get orphaned (usually if some of the data files go missing or get corrupted). In that case, you would have to remove them manually after confirming that you don’t need them (i.e. that you aren’t missing any data).

This is a number that appears to be coming from the memory storage engine based on the 1.0 GB being equal to the default value of --storage_memory. This may very well be a bug given that you’ve switched back to ssd and the cluster seems to be quiescent, so we may have to look into that. If you restart all of the processes in your cluster (kill; kill all in fdbcli), does the operating space number look better?

The cluster is constantly monitoring the distribution of data to try to keep the amount assigned to each server relatively even. It often does small amounts of data movement in the background to try to achieve this, but if the database is idle for long enough I think it would eventually get to a point where it doesn’t need to do any more rebalancing.


(Caleb Spare) #3

Thanks – I had missed this from the configuration documentation.

The fact that the status output also contains a separate “Memory availability” metric (which is a different thing) made this more confusing.

Memory availability    - 4.6 GB per process on machine with least available

Yeah – 20 hours later, the status looks a lot better:

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 47.517 GB
  Disk space used        - 59.449 GB

I’m not sure when it fixed itself, but it went away sometime in the last 20 hours:

Operating space:
  Storage server         - 251.4 GB free on most full server
  Log server             - 251.4 GB free on most full server

Yeah, this has also stopped happening sometime in the last 20 hours.

I guess that the main point of confusion was how long things would take to reach quiescence after major configuration changes. I figured that because I had stopped querying the DB and waited 30 minutes and because I couldn’t see all that much disk or CPU activity that the state had finished changing as much as it was going to change. That turned out not to be the case.

Thanks for your help @ajbeamon!


(Caleb Spare) #4

Hmm, now I have more questions.

I have another cluster with more or less the same data in it. It’s on different instance types. It’s a single memory cluster.

It has been totally idle for a couple of days now and the status says:

Configuration:
  Redundancy mode        - single
  Storage engine         - memory
  Coordinators           - 1

Cluster:
  FoundationDB processes - 32
  Machines               - 4
  Memory availability    - 7.9 GB per process on machine with least available
  Fault Tolerance        - 0 machines
  Server time            - 10/05/18 23:05:59

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 47.309 GB
  Disk space used        - 182.745 GB

Operating space:
  Storage server         - 4.3 GB free on most full server
  Log server             - 6.6 GB free on most full server

Workload:
  Read rate              - 17 Hz
  Write rate             - 0 Hz
  Transactions started   - 4 Hz
  Transactions committed - 0 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Client time: 10/05/18 23:05:59

Why is “Disk space used” nearly 4x “Sum of key-value sizes”? And just to be clear, does “Disk space used” really mean “memory used” given that this is using the memory storage engine?

While it’s been idle, the “Disk space used” number has actually been increasing a little over time.


(Alex Miller) #5

The memory storage engine means that we serve out of memory, but it’s still durably stored on disk. The format in which it is stored is that for each mutation to the database that’s made, we log the mutation and an equally sized chunk of a snapshot of the data stored in memory. Once we have a complete snapshot, we start writing a new one interleaved with mutations. Once we finish the second, we delete the first, but we don’t shrink the files in which we circularly write this data.

two (snapshots + mutation) files = 2 * ( 2 * snapshot size ) = 4 * snapshot size

so 4x sounds about right to me.

Growing by… how much? There’s things like timekeeper that run in the background and write tiny amounts of data into your database as an implementation detail (and eventually clean it up also). Like, are we talking some MB or some GB of data?


(David Scherer) #6

so 4x sounds about right to me.

If the documentation can be believed, it can go up to about 8x

The memory storage engine is for when your disks are too slow, not for when they are too small :-)

Dave


(A.J. Beamon) #7

It’s also worth mentioning that the memory storage engine’s files never shrink. This means that they can grow to about 8x the size of your storage_memory parameter, but if you then clear out the database they will remain the same size. For this reason, it may not be useful to compare the size of your files against the KV size.


(gaurav) #8

I tried installing a fresh FDB install on a linux machine and changed the storage engine to ssd fdb> configure ssd single. However, the status still shows Log Server operating space to be 1.0 GB free.

I have tried kill;kill all and also manually stopping and starting fdbmonitor via service commands.

Is this a known issue with FDB 5.2 (just a reporting issue or more serious) ?

status json output: status json

fdb> status details

Using cluster file `fdb.cluster'.

Configuration:
  Redundancy mode        - single
  Storage engine         - ssd-2
  Coordinators           - 1

Cluster:
  FoundationDB processes - 1
  Machines               - 1
  Memory availability    - 7.4 GB per process on machine with least available
  Fault Tolerance        - 0 machines
  Server time            - 11/06/18 08:56:08

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 0 MB
  Disk space used        - 0 MB

Operating space:
  Storage server         - 101.8 GB free on most full server
  Log server             - 1.0 GB free on most full server

Workload:
  Read rate              - 14 Hz
  Write rate             - 0 Hz
  Transactions started   - 4 Hz
  Transactions committed - 0 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  127.0.0.1:4500         (  2% cpu; 42% machine; 0.001 Gbps;  1% disk IO; 0.3 GB / 7.4 GB RAM  )

Coordination servers:
  127.0.0.1:4500  (reachable)

Client time: 11/06/18 08:56:08

(A.J. Beamon) #9

Hmm, someone else just reported something similar in Scripts to deploy, benchmark, and tinker with 1M operations/sec FoundationDB cluster on AWS. Off the top of my head I don’t think I’ve personally encountered this or know what the problem is, but let me see if I can reproduce.


(A.J. Beamon) #10

I can easily reproduce in both 5.2 and 6.0. It seems like it’s not actually converting the log’s persistent queue to the ssd storage engine, which means that you probably are legitimately limited to 1GB of data there. I’ll keep looking and let you know what I find.


Scripts to deploy, benchmark, and tinker with 1M operations/sec FoundationDB cluster on AWS
(A.J. Beamon) #11

We’ve found and fixed the issue in a commit on this PR, which will become a part of 6.0. In earlier versions, it’s possible to work around this by excluding the tlog processes running the wrong storage engine, which may necessitate first adding additional processes. If there is a strong desire for it, we could also probably backport the fix to 5.2.


(gaurav) #12

Thanks for fixing it. I don’t have an immediate need for it to be backported to 5.2.