How to restore cluster after accidentally dropping coordinators

This is the x-time I was replacing cluster but this time, I completely forgot to switch coordinators before removing old servers. Now the new cluster is up and running however all 3 coordinators were on the old clusters that was completely removed.

Is there a way to get things up and running? I tried to point to the new servers through the coordinators setting, but nothing is happening.

fdb> status details

Using cluster file `fdb.cluster'.

Unable to locate a cluster controller within 2 seconds.  Check that there are
server processes running.

Configuration:
  Redundancy mode        - unknown
  Storage engine         - unknown
  Coordinators           - unknown

Cluster:
  FoundationDB processes - unknown
  Machines               -
  Machines               - unknown

Data:
  Replication health     - unknown
  Moving data            - unknown
  Sum of key-value sizes - unknown
  Disk space used        - unknown

Operating space:
  Unable to retrieve operating space status

Workload:
  Read rate              - unknown
  Write rate             - unknown
  Transactions started   - unknown
  Transactions committed - unknown
  Conflict rate          - unknown

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:

Coordination servers:
  10.128.0.167:4500  (reachable)
  10.128.3.122:4500  (reachable)
  10.128.8.78:4500  (reachable)

Client time: 08/19/19 10:46:59

If I forcefully try to point to new cluster in the fdb.cluster file, this is what I get.

status details

Using cluster file `/etc/foundationdb/fdb.cluster'.

The coordinator(s) have no record of this database. Either the coordinator
addresses are incorrect, the coordination state on those machines is missing, or
no database has been created.

  10.128.0.167:4500  (reachable)
  10.128.3.122:4500  (reachable)
  10.128.8.78:4500  (reachable)

Unable to locate the data distributor worker.

Unable to locate the ratekeeper worker.

1 client(s) reported: Cluster file contents do not match current cluster connection string. Verify the cluster file and its parent directory are writable and that the cluster file has not been overwritten externally.
  10.128.0.2:57625

Have not found a solution and nobody is responding here.

I span new servers on the old IP addresses, which helped only with the coordinators, but nothing else.

status

Using cluster file `/etc/foundationdb/fdb.cluster'.

Could not communicate with all of the coordination servers.
  The database will remain operational as long as we
  can connect to a quorum of servers, however the fault
  tolerance of the system is reduced as long as the
  servers remain disconnected.

  10.128.2.67:4500  (reachable)
  10.128.3.55:4500  (unreachable)
  10.128.8.4:4500  (reachable)

The coordinator(s) have no record of this database. Either the coordinator
addresses are incorrect, the coordination state on those machines is missing, or
no database has been created.

Unable to locate the data distributor worker.

Unable to locate the ratekeeper worker.

Is there a way to recover from this or do I have to rebuild the cluster from scratch?

Oof. In general, losing the data on a majority of your coordinators is data loss.

If you have one coordinator’s data, you could probably save the situation via handing out a new clusterfile to everyone that only lists the one coordinator, and then when the cluster recovers, reconfigure to a new full set of coordinators. (This might cause data loss if your one coordinator’s data was stale.)

If you’ve lost all of your coordinators’ data, then… it’d be really hard to reconstruct the state. Theoretically this would be possible to do, but there’s nothing that exists to do so, and it’d probably take at least a few days of work for even us to achieve.

It looks like you’re spawned coordinators back on the machines, but the data that they had is gone, and thus they don’t know anything about the database that once existed. If you can find a coordinator data directory by hand, and you didn’t drop other processes’ data, then you might be able to do the one coordinator recovery path. Otherwise… hopefully you had a backup? :sweat_smile:

That’s not necessary what happened thought.

I had cluster 1 and cluster 2. I excluded all cluster 1 ips including the coordinators (although FDB says at the end that they are coordinators and they can’t be moved).

I then, instead of changing coordinators to the new IPs, just deleted the old servers including all the coordinators. So the data should be preserved in full on the new cluster, I’m wrong?

Coordinators are not relocated via exclusions, but I can see why that would seem sensible to expect. :confused:

Moving coordinators requires a separate command in fdbcli.

The reasoning behind this is that coordinators aren’t viewed as part of the cluster. Coordinators can, in fact, be shared between multiple clusters that don’t know anything about each other. This is why an exclude on one cluster doesn’t automatically relocate everyone’s coordinators. On the other hand, I’m not aware of anyone that actually uses coordinators in this way.

Shajt. Well, gotta rebuild everything from scratch. Fun times.

It seems that you’re neither the first nor the last person who made this mistake. This tells me that we need to do a better job of protecting users from running into this issue. I created https://github.com/apple/foundationdb/issues/2022 to track this.

I do appreciate that.

I think the main disconnect is that how Alex explained it

is not the way that many of us think about it. To me, coordinator is an entry point to the cluster, more like a proxy. Obviously I’m completely wrong about it, but that’s how I thought of it and so I was hoping, when I mistakenly dropped the servers, that I can just somehow connect to the data. Especially that we have triple replica and I dropped 3 servers of 17.

I think the most frustrating part is that the data is stored physically on the hdd but I can’t do anything about it. Would be great if I could somehow “copy&paste” them into a new cluster or whatever.

I must say FDB is the most resilient storage system I ever worked with. As such it trained me to be way less careful around it in comparison to other technologies. So your success is becoming also a weakness because it’s hard to protect the cluster from user’s stupidity.

Maybe you should consider adding a “roundabout”. In road systems roundabouts are considered to be fairly dangerous by drivers (which is also confirmed generally), however as drivers consider them dangerous, they actually pay more attention on them causing less accidents thus making them safer.

Hey, I hit a similar situation. I have a cluster running on kubernetes with 3 fdbservers, 1 coordinator (it’s a dev environment) and we accidentally did something that made the coordinator start up without its data files. We shut it down and put the data files back in the right place and turned it back on, but it seems like we totally lost the coordinator state. We’re getting the same The coordinator(s) have no record of this database error. status details does list our one coordinator as reachable.

I want to confirm that this scenario also counts as “losing the coordinator”. Is this recoverable at all? I don’t really understand why it’s not recoverable, since the coordinator files weren’t written to during the bad state. Wouldn’t this be similar to if I shut down the system, copied all the files to a different machine, and turned the whole thing back on? (That’s essentially how our k8s setup works with PVs, the FDB pods get recreated but attached to the same volume, and we have a wrapper service that watches for IP changes of the coordinator and updates the cluster files)

1 Like