Data integrity resiliency on single node deployments

(gaurav) #1


We have non-standard deployment model where we want to use FDB in place of Postgres, but without any data replication (i.e. a single node).

I wanted to check, how reliable is FDB itself in such a setting with respect to maintaining integrity of data on disk.

Assuming that the underlying storage itself is reliable and it does not corrupt bits once fsync’d, is it reasonable to assume that FDB will be resilient to data corruption (at a level comparable to Postgres)? It may not be uncommon to have abrupt machine reboots (one source of data corruption that I can think of).

It would be great to know some of the details on data-write path and the checks/methods implemented to overcome events like abrupt process kills/machine reboots etc.

And also - if for some reason the storage files get corrupted with some error (I do not know the kinds possible with the FDB storage files), are there any troubleshooting steps/tools to salvage data (to the extent possible) and get FDB back to healthy state?


(Ben Collins) #2

Process (or machine) kills are handled in all cases in FoundationDB to avoid data loss or corruption. You can take a look at some of the documentation about testing to get a sense of how we make sure this is the case. To your question – this safety will be the same for a single machine as for a cluster of machines. There should be no difference between the crash-safety here from any non-distributed database, such as Postgres.

There’s not currently any tools for getting back data that was corrupted by the disk hardware. One thing you could do is consider running a backup continuously for this DB. Backing up your database can help with this – if there is a hardware fault you can restore to a consistent copy of your data. If backup is run in a streaming mode, the delta between the live data and the data stored in the backup can be kept to a minimum.

(gaurav) #3

Thank you Ben! This is very helpful and it gives a lot of confidence. I will keep the community updated should we notice any errors in this mode of deployment.