FoundationDB

Extremely large logqueue file


#1

I’ve been running FDB in production since October, and just noticed a logqueue file this past weekend that was that was 120GB. I’m currently running two servers on one disk, with only one being a coordinator.

Status

fdb> status

Using cluster file `/etc/foundationdb/fdb.cluster'.

Configuration:
  Redundancy mode        - single
  Storage engine         - ssd-2
  Coordinators           - 1

Cluster:
  FoundationDB processes - 2
  Machines               - 1
  Memory availability    - 4.0 GB per process on machine with least available
  Fault Tolerance        - 0 machines
  Server time            - 01/23/19 17:36:25

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 57 MB
  Disk space used        - 82 MB

Operating space:
  Storage server         - 118.7 GB free on most full server
  Log server             - 118.7 GB free on most full server

Workload:
  Read rate              - 679 Hz
  Write rate             - 9 Hz
  Transactions started   - 337 Hz
  Transactions committed - 3 Hz
  Conflict rate          - 6 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

My workload is very read heavy, and a large majority of my write transactions never end up making writes to disk. My config file is default except for adding one more server and having the servers listen on the public interface.

## Default parameters for individual fdbserver processes
[fdbserver]
command = /usr/sbin/fdbserver
public_address = auto:$ID
listen_address = public
datadir = /var/lib/foundationdb/data/$ID
logdir = /var/log/foundationdb
# logsize = 10MiB
# maxlogssize = 100MiB
# machine_id =
# datacenter_id =
# class =
# memory = 8GiB
# storage_memory = 1GiB
# metrics_cluster =
# metrics_prefix =

## An individual fdbserver process with id 4500
## Parameters set here override defaults from the [fdbserver] section
[fdbserver.4500]
[fdbserver.4501]

My coordinator server is 4500, but 4501 is the server with the ballooned logqueue.

Here is an ls of the problem server’s data directory

root@fdb-1:/var/lib/foundationdb/data/4501# ls -lh
total 122G
-rw---S--- 1 foundationdb foundationdb  16K Oct 19 07:21 log-51b11ce3e25f3a29e4107aa5fb2ed583.sqlite
-rw---S--- 1 foundationdb foundationdb 104K Oct 19 07:21 log-51b11ce3e25f3a29e4107aa5fb2ed583.sqlite-wal
-rw---S--- 1 foundationdb foundationdb 122G Jan 23 17:46 logqueue-51b11ce3e25f3a29e4107aa5fb2ed583-0.fdq
-rw---S--- 1 foundationdb foundationdb 1.7M Oct 19 07:13 logqueue-51b11ce3e25f3a29e4107aa5fb2ed583-1.fdq
-rw---S--- 1 foundationdb foundationdb 4.0K Oct  7 09:19 processId
-rw---S--- 1 foundationdb foundationdb  39M Jan 23 17:46 storage-60bcc85c37134f573bd38d20b4ecc825.sqlite
-rw---S--- 1 foundationdb foundationdb 3.1M Jan 23 17:46 storage-60bcc85c37134f573bd38d20b4ecc825.sqlite-wal

And the good one

root@fdb-1:/var/lib/foundationdb/data/4500# ls -lh
total 40M
-rw---S--- 1 foundationdb foundationdb  28K Oct 19 07:21 coordination-0.fdq
-rw---S--- 1 foundationdb foundationdb  16K Oct 19 07:21 coordination-1.fdq
-rw---S--- 1 foundationdb foundationdb  76K Jan 23 17:48 log-9a7b5ec05e8401f3bf2ce59043f298f6.sqlite
-rw---S--- 1 foundationdb foundationdb 100K Jan 23 17:48 log-9a7b5ec05e8401f3bf2ce59043f298f6.sqlite-wal
-rw---S--- 1 foundationdb foundationdb 964K Jan 23 17:47 logqueue-9a7b5ec05e8401f3bf2ce59043f298f6-0.fdq
-rw---S--- 1 foundationdb foundationdb 2.1M Jan 23 17:48 logqueue-9a7b5ec05e8401f3bf2ce59043f298f6-1.fdq
-rw---S--- 1 foundationdb foundationdb 4.0K Oct  7 09:18 processId
-rw---S--- 1 foundationdb foundationdb  35M Jan 23 17:48 storage-f4607c08210c6c967767a903992c691a.sqlite
-rw---S--- 1 foundationdb foundationdb 2.2M Jan 23 17:48 storage-f4607c08210c6c967767a903992c691a.sqlite-wal

Does anyone have any thoughts on this? It seems to be growing at about 1GB a day which doesn’t seem sustainable.


(Alex Miller) #2

That is… odd. I’m going to guess that you probably send about 1GB per day of writes to this cluster?

To debug this, I’d require the smaller file:
1.7M Oct 19 07:13 logqueue-51b11ce3e25f3a29e4107aa5fb2ed583-1.fdq
and the last 1MB or so of the large file:
122G Jan 23 17:46 logqueue-51b11ce3e25f3a29e4107aa5fb2ed583-0.fdq

because the end of the large file will give a pointer into the small file as to what commit it is that it can’t remove yet.

However, you mention that this is a production system, are you allowed to share the files (ie. they don’t contain corporate data or PII)? If not, I can try to put together a couple scripts that’d do the same checks that I’d end up doing by hand.

Or, if this situation becomes an emergency for you, I’d suspect that if you just kill -9 the process with the growing log queue file, the situation will eventually fix itself. FDB will have to read and process a 120GB file, which will take some time, so expect FDB to be unavailable for a few minutes after killing it, but it should eventually come back fine. I would greatly appreciate being able to first hunt down why this is happening for you, though.


#3

Nothing confidential, I’ll gladly send it to you. I doubled the disk’s size last weekend so it should be fine for a while as well. Do you have a preferred email I can send the files to?


(Alex Miller) #4

[Transfer of files was handled over private message and email.]

Okay, so. I think I’ve extracted about all I can from your files, and it didn’t turn out to be quite what I thought or expected…

Your 122GB file has ~1.5MB of actually “active” data in it. Your 1.7MB file is entirely 0x00 bytes, which suggests that it was cleared as part of recovering the database when the process last started, and then never written to again. It appears that something caused the logic of when to switch files to break, so one file is just infinitely growing instead. I fully expect that FDB should go back to being sane if you restarted the processes, but transaction logs currently never shrink their disk queue files, so this wouldn’t return your disk space.

To get your disk space back, you’ll need to:

  1. Add another process to your fdbmonitor.conf: e.g. fdbserver.4502
  2. fdbcli> exclude $IP:4501
  3. Wait for the exclude to finish
  4. Remove [fdbserver.4501] from fdbmonitor.conf
  5. fdbcli> include $IP:4501

Which basically is to request FDB to move to a new process, which thus creates a new disk queue for the new process, and removing the old process lets you remove the old files.

Our simulation tests don’t have much that enforces anything about disk queue file sizes, so I’m not actually certain that it would catch any issues that could cause this. I’ll go look into seeing if I can add something to our simulation tests that would catch this, and see if I can get them to figure out how this happens.