Hey, I lost my FDB cluster today when I tried to change the public address. I’m running 6.2.25. I have a cluster running on a single EC2 instance, and I’m about to add a second EC2 instance to run some fdbservers from there as well. But the existing setup was just run on the single instance, so the
public_address in foundationdb.conf was 127.0.0.1. I changed this to the IP of the instance for two unset-class fdbservers, neither of which were a coordinator. Then things failed:
Process performance details: <new-ip>:4510:tls ( 44% cpu; 30% machine; 0.024 Gbps; 80% disk IO; 1.2 GB / 19.2 GB RAM ) Last logged error: StorageServerFailed: internal_error at Fri Mar 12 03:40:56 2021 <new-ip>:4511:tls ( 71% cpu; 30% machine; 0.024 Gbps; 80% disk IO; 1.4 GB / 19.2 GB RAM ) 127.0.0.1:4500:tls ( 87% cpu; 30% machine; 0.024 Gbps; 51% disk IO; 7.3 GB / 19.1 GB RAM ) [...]
The logs for the failed storage server showed:
<Event Severity="40" Time="1615520455.464246" Type="DeliveredToNotAssigned" ID="0000000000000000" Version="24615219935000" Mutation="code: SetValue param1: <some bytes> param2: <some bytes>" Backtrace="addr2line -e fdbserver.debug -p -C -f -i [...] Roles="MP,SS,TL" />
Looks like this means the database metadata was corrupted: Fdbserver error in a cluster with double redundancy The 000 ID looks odd, is it supposed to be the ID for the fdbserver?
I changed the public address back to 127.0.0.1, but failed to get that storage server back up (it always went back to that same DeliveredToNotAssigned error. We also just tried restarting everything), and had to wipe the DB since we’re running rf1.
Note that this change was done on two fdbservers, but only one corrupted. If I shut off the corrupted fdbserver, everything else starts up fine (but data is missing)
A few questions:
- Can someone shed some light on this error? Also, the bytes in the error look like a regular write from a client - would there have been any way to just throw away this mutation to salvage the rest of the DB? And, why did this corrupt one fdbserver but not the other one?
- Most concerningly–This happened on a non-kubernetes cluster, but we also have a k8s setup where on startup, our FDB wrapper service finds the IP of its pod and sets that as the public address. We’re not using the official fdb k8s operator. Is what we’re doing unsafe - are we at risk of the same corruption there? The config we made change feels very similar to this.
- What is a safe way to do this (change the public address from 127.0.0.1 to the IP of the host)? The corruption happened on our dev cluster where I was testing this, but we still need to do this on one of our production (non-kubernetes) databases.
Thanks in advance!