WARNING: A single process is both a transaction log and a storage server

I know cluster setups were covered a lot, but I’m running into (to me) illogical issue and was hoping to get some advice.

I drained our old cluster and replaced it with a larger one. Everything is working as it should, however getting this warning

WARNING: A single process is both a transaction log and a storage server.
  For best performance use dedicated disks for the transaction logs by setting process classes.

It’s confusing, because the old cluster had similar settings and was fine. Is there a way to find out which process is it? I tried to poke in the json file, but can’t to find it.

Here is the setup

24x 4cpu, 16GB RAM, 1.2TB space

  Redundancy mode        - triple
  Storage engine         - ssd-2
  Coordinators           - 3
  Exclusions             - 26 (type `exclude' for details)
  Desired Proxies        - 7
  Desired Resolvers      - 1
  Desired Logs           - 4
10.128.0.104:4500,stateless
10.128.0.104:4501,storage
10.128.0.104:4502,storage
10.128.0.104:4503,stateless
10.128.0.122:4500,stateless
10.128.0.122:4501,storage
10.128.0.122:4502,storage
10.128.0.122:4503,stateless
10.128.0.215:4500,stateless
10.128.0.215:4501,storage
10.128.0.215:4502,storage
10.128.0.215:4503,stateless
10.128.1.166:4500,stateless
10.128.1.166:4501,storage
10.128.1.166:4502,storage
10.128.1.166:4503,stateless
10.128.1.222:4500,stateless
10.128.1.222:4501,storage
10.128.1.222:4502,storage
10.128.1.222:4503,stateless
10.128.1.250:4500,stateless
10.128.1.250:4501,storage
10.128.1.250:4502,storage
10.128.1.250:4503,stateless
10.128.2.102:4500,stateless
10.128.2.102:4501,storage
10.128.2.102:4502,storage
10.128.2.102:4503,stateless
10.128.2.245:4500,stateless
10.128.2.245:4501,log
10.128.2.245:4502,log
10.128.2.245:4503,stateless
10.128.2.44:4500,stateless
10.128.2.44:4501,storage
10.128.2.44:4502,storage
10.128.2.44:4503,stateless
10.128.2.50:4500,stateless
10.128.2.50:4501,log
10.128.2.50:4502,log
10.128.2.50:4503,stateless
10.128.2.84:4500,stateless
10.128.2.84:4501,storage
10.128.2.84:4502,storage
10.128.2.84:4503,stateless
10.128.3.142:4500,stateless
10.128.3.142:4501,storage
10.128.3.142:4502,storage
10.128.3.142:4503,stateless
10.128.3.231:4500,stateless
10.128.3.231:4501,storage
10.128.3.231:4502,storage
10.128.3.231:4503,stateless
10.128.3.24:4500,stateless
10.128.3.24:4501,storage
10.128.3.24:4502,storage
10.128.3.24:4503,stateless
10.128.3.32:4500,stateless
10.128.3.32:4501,storage
10.128.3.32:4502,storage
10.128.3.32:4503,stateless
10.128.4.149:4500,stateless
10.128.4.149:4501,storage
10.128.4.149:4502,storage
10.128.4.149:4503,stateless
10.128.4.229:4500,stateless
10.128.4.229:4501,storage
10.128.4.229:4502,storage
10.128.4.229:4503,stateless
10.128.4.50:4500,stateless
10.128.4.50:4501,storage
10.128.4.50:4502,storage
10.128.4.50:4503,stateless
10.128.5.105:4500,stateless
10.128.5.105:4501,storage
10.128.5.105:4502,storage
10.128.5.105:4503,stateless
10.128.6.122:4500,stateless
10.128.6.122:4501,storage
10.128.6.122:4502,storage
10.128.6.122:4503,stateless
10.128.6.253:4500,stateless
10.128.6.253:4501,storage
10.128.6.253:4502,storage
10.128.6.253:4503,stateless
10.128.6.6:4500,stateless
10.128.6.6:4501,storage
10.128.6.6:4502,storage
10.128.6.6:4503,stateless
10.128.7.2:4500,stateless
10.128.7.2:4501,storage
10.128.7.2:4502,storage
10.128.7.2:4503,stateless
10.128.7.96:4500,stateless
10.128.7.96:4501,storage
10.128.7.96:4502,storage
10.128.7.96:4503,stateless

If I’m not completely blind, then we are assigning 4 processes on 2 separate servers log and 2 other processes are kept stateless. Then we are requesting 4 log processes in the cluster configuration.

Any pointers are greatly appreciated.

It seems your problem is that you only run tlogs on two machines:

10.128.2.50:4501,log
10.128.2.50:4502,log

and

10.128.2.50:4501,log
10.128.2.50:4502,log

However, you run with triple replication. Therefore fdb has to recruit tlogs on at least three machines - so it needs to recruit one on a storage server (which is not optimal but the only way to not violate the contract that the data is replicated across three machines).

With triple replication you probably want to run log processes on at least 3, better 5 machines. If you lose a machine and only run logs on three, you will find a tlog on a storage again. With triple replication you can survive two machine failures and therefore 5 is probably a good number.

If you’re starved for resources, I would do the following:

  1. Set disired logs to 5
  2. Run a log process on 5 machines
  3. If a machine fails, you will run with 2 logs, if 2 fail you will run with 3 (which is fine and you can bring back the machines later and fdb will redistribute the log load).

If resources is not an issue for you, I would do the following:

  1. Run 1 log on each machine
  2. set desired logs to something reasonable (for this cluster-size probably 12? not sure…).
  3. Keep the same configuration on each machine: for example 1 log, 1 stateless, 2 storage

I don’t know how your disk layout looks though - if not all machines are the same, you might want to use another topology.

1 Like

Ah, that’s the missing piece.

It’s 3 NVME disks merged into raid 0.

Wait, I thought logs and storage are not supposed to be mixed together.

Somehow my understanding was that it’s better to have log processes separated and leave some space for breading. So 2 log, 2 stateless. But if 1 log, 1 storage and 2 stateless works better, I can certainly do that.

Having logs and storages share a process is a bad idea - however, having log and storage processes sharing a machine is completely fine (we do that for years now on most of our clusters, on the ones we didn’t the motivation had more to do with different VM-types on AWS which gives us more flexibility).

The main reason for not having them share a process is CPU: each fdbserver process is single threaded. But log and storage can both consume quite some CPU and if a log process is starved for CPU (for example because you run some range queries) your commit latency will go up which can have all kind of undesired effects.

We also recommend not sharing disks between the logs and storage processes, as that can be a performance cost as well.

This is news to me. But I will adjust the cluster and see.

Hm, that will make things quite complicated. But will adjust to it.

When we are on this, what’s the right size for a log HDD and what kind of settings should we use? We generally use NVME disks that are 375GB

This is what we have in settings

command = /usr/sbin/fdbserver
public_address = auto:$ID
listen_address = public
datadir = /var/lib/foundationdb/data/$ID
logdir = /var/log/foundationdb
logsize = 50MiB
maxlogssize = 10GiB
machine_id = {{machine_id}}
# datacenter_id =
# class =
memory = 13GiB
storage_memory = 1GiB
# metrics_cluster =
# metrics_prefix =

To determine the size of the disk, I would recommend something like expected write-load in bytes per second and multiply this number by something that makes you comfortable (I would say at least 6 hours).

Basically, if you lose a machine, the log needs to be able to keep all writes to this lost machine on disk. You also should alert if this disk starts running full.

But your config looks wrong to me (or it looks like it doesn’t do what you want it to do):

This means that each process can use up to 13GiB of memory - but your machine only has 16GiB? I think what you want is that all processes together are not allowed to use more than 13GiB of memory? If that is the case you need to divide this number by number of processes.

Ah! It’s quite confusing as some settings are global, some are “local” per process.

Ok, I will keep it on 375GB NVME. That should be fine.

BTW here is our config. If you spot something we should be doing differently, please let me know. It quite standard.

## foundationdb.conf
##
## Configuration file for FoundationDB server processes
## Full documentation is available at
## https://apple.github.io/foundationdb/configuration.html#the-configuration-file

[fdbmonitor]
user = foundationdb
group = foundationdb

[general]
restart_delay = 60
## by default, restart_backoff = restart_delay_reset_interval = restart_delay
# initial_restart_delay = 0
# restart_backoff = 60
# restart_delay_reset_interval = 60
cluster_file = /etc/foundationdb/fdb.cluster
# delete_envvars =
# kill_on_configuration_change = true

## Default parameters for individual fdbserver processes
[fdbserver]
command = /usr/sbin/fdbserver
public_address = auto:$ID
listen_address = public
datadir = /var/lib/foundationdb/data/$ID
logdir = /var/log/foundationdb
logsize = 50MiB
maxlogssize = 10GiB
machine_id = {{ instance }}
# datacenter_id =
# class =
memory = 4GiB
storage_memory = 1GiB
# metrics_cluster =
# metrics_prefix =

## An individual fdbserver process with id 4500
## Parameters set here override defaults from the [fdbserver] section
[fdbserver.4500]
class=log
[fdbserver.4501]
class=storage
[fdbserver.4502]
class=storage
[fdbserver.4503]
class=stateless

[backup_agent]
command = /usr/lib/foundationdb/backup_agent/backup_agent
logdir = /var/log/foundationdb

[backup_agent.1]

This looks good to me. The only thing I would recommend:

I would set this lower - I think we have it at 10. 1 minute restart delay is quite long and will put some unnecessary load onto the cluster whenever a single process dies (and with only 4GiB of memory per process, one process will probably OOM on a semi-regular basis).

1 Like

What’s your setting per process?

BTW I can’t recall the last time we ran into OOM for any of the processes. And we have quite heavy load on the cluster (500k transactions/sec).

It depends on the cluster - but we generally run with a lot of memory (up to 100GiB per process). We used to run with 8GiB per process and saw quite many OOMs (although that was with FDB 3).

I think FoundationDB still has a weird bug somewhere (at least in 6.1 - don’t know whether this got accidentally fixed in 6.2) where a process will allocate memory like crazy (usually around 3GiB within a few seconds). It should be rare, but if you run into this issue, your process will OOM.

TBF, you ran with a 13GiB limit. It is very well possible that one process used more than 4GiB at one point but the other didn’t use up their memory…

It is possible that you won’t see OOMs, but it is better to be prepared to see a few which is why I suggested to set the timeout to a lower value.

1 Like

I think this is the correct observation I unintentionally dismissed. Will monitor it closely and see.

Just wow!

The 60 seconds will only kick in when you have two consecutive failures of a process (within the restart_delay_reset_interval, default 60s). With the parameters as they were set, a single process death will result in an immediate restart. I typically use the parameters as they were originally written (0s first restart, 60s second and subsequent, reset to 0 after 60s). There’s also an option to backoff more slowly from the initial delay to the max delay if you wanted to be more responsive during some sort of transient issue but avoid frequent repeated restarts.

1 Like

oh - Today I learned something new…

1 Like

It seems that these restart parameters aren’t even mentioned in our documentation with the exception of restart_delay, which has a slightly misleading description. That’s probably something we should fix.

1 Like