Mac version of FoundationDB seems unhealthy

(rob rodgers) #1

Using storage engine = memory and a single fdb instance on Mac OS X, I notice that our development instances of fdb seem to get into an interesting state on a relatively frequent basis and I’m trying to learn enough to understand what’s going on.

I have pasted status from fdbcli below and was hoping someone could provide some tips. The state does not resolve itself; the database is quite wedged at this point. Advice on how to unwedge it would be nice.

Sometimes, for reasons we haven’t figured out, the fdb instance becomes unhealthy. On occasion what we see is that waits on a FutureNil as returned by tr.Watch(key) will start seeing “FoundationDB error code 1009 (Request for future version)” errors. This is relatively easy to reproduce as it will occur after approximately 219 iterations of a test program at a very low rate of change (measured in ~1 key write per second); after that the database itself becomes unresponsive. i/o-wise, the laptop is idle.

fdb> status

WARNING: Long delay (Ctrl-C to interrupt)

Using cluster file `/usr/local/etc/foundationdb/fdb.cluster’.

Unable to start default priority transaction after 5 seconds.

Unable to start batch priority transaction after 5 seconds.

Unable to retrieve all status information.

Redundancy mode - single
Storage engine - memory
Coordinators - 1

FoundationDB processes - 1 (less 0 excluded; 1 with errors)
Machines - 1
Memory availability - 7.9 GB per process on machine with least available
Fault Tolerance - 0 machines
Server time - 08/08/18 17:53:46

Replication health - Healthy (Repartitioning.)
Moving data - 0.237 GB
Sum of key-value sizes - 544 MB
Disk space used - 1.292 GB

Operating space:
Storage server - 399.5 GB free on most full server
Log server - 399.5 GB free on most full server

Read rate - 1 Hz
Write rate - 0 Hz
Transactions started - 0 Hz
Transactions committed - 0 Hz
Conflict rate - 0 Hz
Performance limited by process: Storage server performance (storage queue).
Most limiting process:

Backup and DR:
Running backups - 0
Running DRs - 0

Client time: 08/08/18 17:53:38

(Alex Miller) #2

The in-memory storage engine defaults to being willing to store ~1GB of data before filling up. It looks like your continuous runs of a test are filling up >1GB, so your storage server is no longer accepting more data, and thus your FDB grinds to a halt.

You have a few options of how to get out of this:

  1. Edit /etc/foundationdb/foundationdb.conf to uncomment storage_memory = 1GiB, and change the 1 to a 2, or something high enough to fit all the data for your tests
  2. Change your tests to clear the database in between runs.
  3. Run configure ssd via fdbcli to switch to using a disk-based storage engine rather than an in-memory storage engine, and then only have to worry about disk size limitations.

None of them are really better than the others. If you just want to not deal with this problem anymore, then probably configure ssd is easiest (but I’ve forgotten if you’ll need to unbreak FDB before changing the storage engine or not).

(rob rodgers) #3

Thanks much.

I will reproduce it first by driving to a reasonable amount of data then switch to ssd.

The OOM behavior here is pretty amazing - everything falls apart. To recover you pretty much have to rototill the whole thing or it will be stuck in this state.

(Alex Miller) #4

Yeah, it’s unfortunately the roughest when you have just one process being a storage server, because then your only option is to feed the hungry storage server more memory.

Oh. I guess there’s also option 4) Add a [fdbserver.4501] line to /etc/foundationdb/foundationdb.conf to give it a second storage server, and thus another 1GB of ram, to rebalance data across. And then clear your database.

Edited: Removed wrong text and I should go read what different semantics high priority transactions have.

(A.J. Beamon) #5

What version are you running? There was a bug in 5.1 that made this behavior worse (I think it was fixed in 5.2). The expected behavior as the database fills up is to slow down and eventually stop accepting normal priority writes, but still allow high priority ones that can be used to clear space as needed. It leaves a small buffer of space to support these operations. Also, status would indicate that you are running out of space in this case.

The bug is that in the memory storage engine we were accounting for the amount of remaining space incorrectly, so the database would happily use all available space, at which point it basically locks up. It also prevents status from recognizing what’s happening until you’ve hit this state.

The status message you posted looks like it’s from the old version, and if that’s the case it would probably be worth upgrading.