Changing public address of (non-coordinator) fdbserver corrupted cluster

Hey, I lost my FDB cluster today when I tried to change the public address. I’m running 6.2.25. I have a cluster running on a single EC2 instance, and I’m about to add a second EC2 instance to run some fdbservers from there as well. But the existing setup was just run on the single instance, so the public_address in foundationdb.conf was 127.0.0.1. I changed this to the IP of the instance for two unset-class fdbservers, neither of which were a coordinator. Then things failed:

Process performance details:
  <new-ip>:4510:tls  ( 44% cpu; 30% machine; 0.024 Gbps; 80% disk IO; 1.2 GB / 19.2 GB RAM  )
    Last logged error: StorageServerFailed: internal_error at Fri Mar 12 03:40:56 2021
   <new-ip>:4511:tls  ( 71% cpu; 30% machine; 0.024 Gbps; 80% disk IO; 1.4 GB / 19.2 GB RAM  )
  127.0.0.1:4500:tls     ( 87% cpu; 30% machine; 0.024 Gbps; 51% disk IO; 7.3 GB / 19.1 GB RAM  )
 [...]

The logs for the failed storage server showed:

<Event Severity="40"
Time="1615520455.464246"
Type="DeliveredToNotAssigned"
ID="0000000000000000"
Version="24615219935000"
Mutation="code: SetValue param1: <some bytes> param2: <some bytes>"
Backtrace="addr2line -e fdbserver.debug -p -C -f -i [...]
Roles="MP,SS,TL" />

Looks like this means the database metadata was corrupted: Fdbserver error in a cluster with double redundancy The 000 ID looks odd, is it supposed to be the ID for the fdbserver?

I changed the public address back to 127.0.0.1, but failed to get that storage server back up (it always went back to that same DeliveredToNotAssigned error. We also just tried restarting everything), and had to wipe the DB since we’re running rf1.

Note that this change was done on two fdbservers, but only one corrupted. If I shut off the corrupted fdbserver, everything else starts up fine (but data is missing)

A few questions:

  1. Can someone shed some light on this error? Also, the bytes in the error look like a regular write from a client - would there have been any way to just throw away this mutation to salvage the rest of the DB? And, why did this corrupt one fdbserver but not the other one?
  2. Most concerningly–This happened on a non-kubernetes cluster, but we also have a k8s setup where on startup, our FDB wrapper service finds the IP of its pod and sets that as the public address. We’re not using the official fdb k8s operator. Is what we’re doing unsafe - are we at risk of the same corruption there? The config we made change feels very similar to this.
  3. What is a safe way to do this (change the public address from 127.0.0.1 to the IP of the host)? The corruption happened on our dev cluster where I was testing this, but we still need to do this on one of our production (non-kubernetes) databases.

Thanks in advance!

I looked into this a bit more, it seems like the error is in the update path that’s hit on storage server startup: foundationdb/storageserver.actor.cpp at master · apple/foundationdb · GitHub

Seems like one of the mutations is in a “not assigned” shard. Initially one all-keys shard is added to each storage server, and it seems like then they are split and assigned: foundationdb/storageserver.actor.cpp at master · apple/foundationdb · GitHub

I’m still having issues though see how a not-assigned shard would end up on a storage server – I’m guessing the corruption already existed before the bad mutation.

I think Building a Cluster — FoundationDB 6.2 is the current recommended way of updating the ip from 127.0.0.1

Oh didn’t see that script before. Thanks! A few questions though–

  1. That looks like it’s for a single node - we have a bit of an odd setup of mimicking multiple FDB machines on our single instance. We have a microservice wrapping FDB and each instance of that service runs in a separate docker container, running fdbmonitor with a separate cluster file and multiple fdbserver processes. The docker containers are using host network mode. (So for example, in my dev stack I have 3 instances of the microservice, so 3 docker containers are each running 1 fdbmonitor process and 2 fdbserver processes, and the cluster file for each docker container is a different file with the same contents) I’m guessing I’d have to shut down all the fdbserver processes, run the script on all of the individual cluster files, and start them all back up?

  2. It seems like the script only changes the cluster file addresses, when I still have to update the public_address in all the foundationdb.conf files. Do you know if this has to be done before, after, or concurrently with the cluster file change?

I’m ok with making this a downtime process, I’m just concerned from not understanding why this change caused corruption and I want a safe way to do it

Is it possible for you to use auto:<port> syntax from Configuration — FoundationDB 7.1? I couldn’t find great documentation on what exactly this does but here’s a reference to the source: https://github.com/apple/foundationdb/blob/df90cc89de67ea4748c8cadd18e6fc4ce7fda12e/fdbclient/AutoPublicAddress.cpp#L31. I think this is the recommend practice (disclaimer I don’t actually administrate fdb clusters myself very much).

I’m also not sure why/how this caused a corruption. I’m going to try to repro locally

Unfortunately I wasn’t able to repro. I had a foundationdb.conf with three 6.2.25 fdbservers with public_address = 127.0.0.1:$ID, then changed the two non-coordinator processes to public_address = <ip>:$ID, but the database still worked

I think the two machines part might be key here.

I’ve been trying to find some sort of story that looks like…

  • Assume you had two machines, A and B
  • A cluster is running on A with 127.0.0.1:4500 and 127.0.0.1:4501
  • Two fdbserver processes are started on B as 127.0.0.1:4500 and 127.0.0.1:4501
  • The cluster on A is changed to use public IPs
  • The two fdbserver processes on B are changed to point to the cluster on A
  • The public address of 127.0.0.1:4500 from B resolves to the local fdbserver process on A instead of the “correct” process on B

Or something like that? I’m not sure that exact chain of events would work, but I think something that revolves around 127.0.0.1 being ambiguous if you manage to introduce it into a multi-host cluster has to be the key here.

Hm, the short answer is we’re running them on the same instance (so same 127.0.0.1) and they all have different ports. What you’re describing sounds like a faulty setup that wouldn’t have worked to start with, right?

That being said, I don’t understand a few parts of what you mentioned that might be relevant here.

  • The cluster on A is changed to use public IPs

To clarify, you mean the processes on A switch their public_address in foundationdb.conf from 127.0.0.1 to the IP of the instance right? (There isn’t some other boolean flag in the config stating whether or not this is a local vs public IP?)

  • The two fdbserver processes on B are changed to point to the cluster on A

Is this a configuration change needed on B? I thought A would just start up, hit the coordinator, be identified as the same existing processes due to having the same sqlite filenames, and then cluster controller would tell B what the addresses of the A processes are?

A more in-depth description of what happened:

Initial setup:

  • service 1 is running fdbmonitor that runs 127.0.0.1:4500 and 127.0.0.1:4501
  • service 2 is running fdbmonitor that runs 127.0.0.1:4502 and 127.0.0.1:4503
  • […]
  • service 6 is running fdbmonitor that runs 127.0.0.1:4510 and 127.0.0.1:4511

The services are just our wrapper service. Services 1-6 all run on the same host. All fdbservers are unset class. There’s one coordinator, 4500 on service 1.

Config change:

  • 4510 and 4511 (on service 6) change the public_address key in their foundationdb.conf from 127.0.0.1 to the IP of the instance
  • 4510 and 4511 restart

Corruption:

  • 4511 comes up healthy, 4510 does not (with the DeliveredToNotAssigned error). In status details, both of these processes are listed with the IP of the instance, rather than 127.0.0.1 (like all the other fdbservers are listed with)
  • There’s another error in 4503, on the resolver role.
  • All fdbservers are restarted, no change - same status details output, same errors
  • 4510 is turned off, all other fdbservers come up healthy (but we’re missing data now)

Would this timeline have caused an fdbserver to incorrectly resolve the address of another fdbserver?

Also regarding multiple machines - We do plan on expanding to a second machine, after the public IP config change on the current machine is done.

One other thing that could be helpful - I’m not sure if this is normal or not but after the config change (I didn’t check if this was also the case before the config change) I saw that there were two storage servers running on the failing fdbserver, 4510. One was very small (looking at the sqlite file of the same name as the storage server ID) and it started up fine, the big one is the one that failed. In status json the storage role details/metrics showed up for the new storage role ID, but for the old storage role ID it just showed the error message. I saw ID tags on the trace events from both storage servers. Is it possible that due to the address change, the fdbserver was identified as a totally new one and that’s why it spun up a new storage role?

And also, regardless of how it got in that state, is there anything actually incorrect about running multiple storage servers on one fdbserver? (Other than resource constraints and not being able to decommission one specific storage role) My understanding is that the shard assignment is tracked with the storage server ID, not by IP?

Update: Read the read-write path a bit more (by the way, this doc is awesome! A similar doc on the startup path would also be super great) and I’m curious about the SS to primary TLog mapping. So there should be a primary TLog for each of the two storage roles that showed up on 4510, is it possible the old SS was receiving mutations meant for the new SS (on the same fdbserver)? Is there something in that path that identifies by IP, rather than SS ID? We did however try removing the files for the new SS as a last resort to get the cluster working, and we still had the same error.