Hi, while running an FDB cluster on 5 machines (10 processes, 5 SS (all 4500 id process), 2 Tlog, 2 Proxy, 3 Coordinators (from 4501 id process)), I ran into an unexpected state. It would be helpful if someone has any clues to this:
I am repeatedly bulk loading data into the cluster, and then doing a range delete to clear the db, in order to rerun the bulk load step (I am trying to fine-tune the bulk load process by running these experiments with different settings).
I noticed that if I exclude
an SS process, then the disk is reclaimed immediately, otherwise, it takes much longer for the disk to be given back to OS.
In order to speed up the bulk-load, I had also switched to ssd/single replication mode
(and had planned to switch to double replication post bulk load).
So, at some point in between one bulk load and the next, I was trying to reclaim disk by excluding one SS at a time, and then including it back, before moving on to next one.
This worked fine for first two SS, but when I did this for a third SS, then the cluster entered into some weird state where it was Unable to read database configuration
. All processes were running fine, though.
I have attached the status details
output at this stage, below.
In order to come out of this state, I tried various things like including the excluded process back etc. but it did not help. Finally, I tried running the command to setup an entirely new cluster using the command configure new sdd single
, but the cli rejected it saying that the db already exists. So, I randomly tried another variation without new
-> configure sdd single
, and that brought the cluster out of the situation and it was back to normal.
Given that there was no data that I cared about stored in the cluster at this point, it was okay. But, I do not understand the error I made here, and how should I have debugged this better? Is there anyway to recreate “database configuration” using the data files that are available on each SS in the cluster (this should be a useful tool to recover from situations when system configuration has been somehow corrupted; the “static” configurations - like engine, replication, coordinators can be provided by the user, where needed).
Could it’ve happened that database configuration
was stored on the excluded SS, and that it was not copied over to another SS before it was excluded by the fdbcli command (recall, that the replication made was ssd/single at this point)?
I tried replicating this situation, but could not get it to happen again. I would retry this after the next bulk upload to see how it goes.
fdb> status details
Using cluster file `/etc/foundationdb/fdb.cluster'.
Unable to read database configuration.
Configuration:
Redundancy mode - unknown
Storage engine - unknown
Coordinators - unknown
Cluster:
FoundationDB processes - 10
Machines - 5
Memory availability - 29.8 GB per process on machine with least available
WARNING: Long delay (Ctrl-C to interrupt)
Retransmissions rate - 1 Hz
Server time - 02/02/19 08:50:37
Data:
Replication health - unknown
Moving data - unknown
Sum of key-value sizes - unknown
Disk space used - unknown
Operating space:
Unable to retrieve operating space status
Workload:
Read rate - unknown
Write rate - unknown
Transactions started - unknown
Transactions committed - unknown
Conflict rate - unknown
Backup and DR:
Running backups - 0
Running DRs - 0
Process performance details:
10.45.0.16:4500 ( 1% cpu; 4% machine; 0.000 Gbps; 0% disk IO; 0.3 GB / 29.8 GB RAM )
10.45.0.16:4501 ( 1% cpu; 4% machine; 0.000 Gbps; 0% disk IO; 0.3 GB / 29.8 GB RAM )
10.45.0.25:4500 ( 1% cpu; 6% machine; 0.000 Gbps; 1% disk IO; 4.6 GB / 30.6 GB RAM )
10.45.0.25:4501 ( 2% cpu; 6% machine; 0.000 Gbps; 1% disk IO; 0.3 GB / 30.6 GB RAM )
10.45.0.80:4500 ( 2% cpu; 4% machine; 0.000 Gbps; 4% disk IO; 4.1 GB / 30.8 GB RAM )
10.45.0.80:4501 ( 1% cpu; 4% machine; 0.000 Gbps; 4% disk IO; 0.3 GB / 30.8 GB RAM )
10.45.0.82:4500 ( 1% cpu; 4% machine; 0.000 Gbps; 0% disk IO; 4.3 GB / 30.7 GB RAM )
10.45.0.82:4501 ( 2% cpu; 4% machine; 0.000 Gbps; 0% disk IO; 0.3 GB / 30.7 GB RAM )
10.45.0.217:4500 ( 2% cpu; 3% machine; 0.000 Gbps; 6% disk IO; 4.0 GB / 30.2 GB RAM )
10.45.0.217:4501 ( 1% cpu; 3% machine; 0.000 Gbps; 6% disk IO; 0.2 GB / 30.2 GB RAM )
Coordination servers:
10.45.0.16:4501 (reachable)
10.45.0.25:4501 (reachable)
10.45.0.80:4501 (reachable)
Client time: 02/02/19 08:50:32