Hi, while running an FDB cluster on 5 machines (10 processes, 5 SS (all 4500 id process), 2 Tlog, 2 Proxy, 3 Coordinators (from 4501 id process)), I ran into an unexpected state. It would be helpful if someone has any clues to this:
I am repeatedly bulk loading data into the cluster, and then doing a range delete to clear the db, in order to rerun the bulk load step (I am trying to fine-tune the bulk load process by running these experiments with different settings).
I noticed that if I
exclude an SS process, then the disk is reclaimed immediately, otherwise, it takes much longer for the disk to be given back to OS.
In order to speed up the bulk-load, I had also switched to ssd/single replication mode
(and had planned to switch to double replication post bulk load).
So, at some point in between one bulk load and the next, I was trying to reclaim disk by excluding one SS at a time, and then including it back, before moving on to next one.
This worked fine for first two SS, but when I did this for a third SS, then the cluster entered into some weird state where it was
Unable to read database configuration. All processes were running fine, though.
I have attached the
status details output at this stage, below.
In order to come out of this state, I tried various things like including the excluded process back etc. but it did not help. Finally, I tried running the command to setup an entirely new cluster using the command
configure new sdd single, but the cli rejected it saying that the db already exists. So, I randomly tried another variation without
configure sdd single, and that brought the cluster out of the situation and it was back to normal.
Given that there was no data that I cared about stored in the cluster at this point, it was okay. But, I do not understand the error I made here, and how should I have debugged this better? Is there anyway to recreate “database configuration” using the data files that are available on each SS in the cluster (this should be a useful tool to recover from situations when system configuration has been somehow corrupted; the “static” configurations - like engine, replication, coordinators can be provided by the user, where needed).
Could it’ve happened that
database configuration was stored on the excluded SS, and that it was not copied over to another SS before it was excluded by the fdbcli command (recall, that the replication made was ssd/single at this point)?
I tried replicating this situation, but could not get it to happen again. I would retry this after the next bulk upload to see how it goes.
fdb> status details Using cluster file `/etc/foundationdb/fdb.cluster'. Unable to read database configuration. Configuration: Redundancy mode - unknown Storage engine - unknown Coordinators - unknown Cluster: FoundationDB processes - 10 Machines - 5 Memory availability - 29.8 GB per process on machine with least available WARNING: Long delay (Ctrl-C to interrupt) Retransmissions rate - 1 Hz Server time - 02/02/19 08:50:37 Data: Replication health - unknown Moving data - unknown Sum of key-value sizes - unknown Disk space used - unknown Operating space: Unable to retrieve operating space status Workload: Read rate - unknown Write rate - unknown Transactions started - unknown Transactions committed - unknown Conflict rate - unknown Backup and DR: Running backups - 0 Running DRs - 0 Process performance details: 10.45.0.16:4500 ( 1% cpu; 4% machine; 0.000 Gbps; 0% disk IO; 0.3 GB / 29.8 GB RAM ) 10.45.0.16:4501 ( 1% cpu; 4% machine; 0.000 Gbps; 0% disk IO; 0.3 GB / 29.8 GB RAM ) 10.45.0.25:4500 ( 1% cpu; 6% machine; 0.000 Gbps; 1% disk IO; 4.6 GB / 30.6 GB RAM ) 10.45.0.25:4501 ( 2% cpu; 6% machine; 0.000 Gbps; 1% disk IO; 0.3 GB / 30.6 GB RAM ) 10.45.0.80:4500 ( 2% cpu; 4% machine; 0.000 Gbps; 4% disk IO; 4.1 GB / 30.8 GB RAM ) 10.45.0.80:4501 ( 1% cpu; 4% machine; 0.000 Gbps; 4% disk IO; 0.3 GB / 30.8 GB RAM ) 10.45.0.82:4500 ( 1% cpu; 4% machine; 0.000 Gbps; 0% disk IO; 4.3 GB / 30.7 GB RAM ) 10.45.0.82:4501 ( 2% cpu; 4% machine; 0.000 Gbps; 0% disk IO; 0.3 GB / 30.7 GB RAM ) 10.45.0.217:4500 ( 2% cpu; 3% machine; 0.000 Gbps; 6% disk IO; 4.0 GB / 30.2 GB RAM ) 10.45.0.217:4501 ( 1% cpu; 3% machine; 0.000 Gbps; 6% disk IO; 0.2 GB / 30.2 GB RAM ) Coordination servers: 10.45.0.16:4501 (reachable) 10.45.0.25:4501 (reachable) 10.45.0.80:4501 (reachable) Client time: 02/02/19 08:50:32