FoundationDB

FoundationDB @ Kubernetes having "issues"


(Chr1st0ph) #1

Hi,

I develop a setup for Kubernetes ( https://github.com/Chr1st0ph/foundationdb-kubernetes ).
I run a statefulset with 2 pods ( each having two processes, port 4500 and port 4501 respectively; see https://github.com/Chr1st0ph/foundationdb-docker/blob/master/foundationdb.conf ).

I got the following issue:

I connect to one pod using fdbcli ( kubectl exec -it foundationdb-0 fdbcli )
and there is a welcome message:
"
Using cluster file `/etc/foundationdb/fdb.cluster’.
The database is available, but has issues (type ‘status’ for more information).

"

fdb> writemode on
fdb> set test testvalue
Committed (1024780968)
fdb> get test

WARNING: Long delay (Ctrl-C to interrupt)

fdb> status

Using cluster file `/etc/foundationdb/fdb.cluster’.

Configuration:
Redundancy mode - single
Storage engine - memory
Coordinators - 1

Cluster:
FoundationDB processes - 3
Machines - 2
Memory availability - 4.8 GB per process on machine with least available
Fault Tolerance - 0 machines
Server time - 05/09/18 10:14:09

Data:
Replication health - unknown
Moving data - unknown
Sum of key-value sizes - unknown
Disk space used - 1 MB

Operating space:
Log server - 15.2 GB free on most full server

Workload:
Read rate - 0 Hz
Write rate - 0 Hz
Transactions started - 4 Hz
Transactions committed - 0 Hz
Conflict rate - 0 Hz

Backup and DR:
Running backups - 0
Running DRs - 0

Client time: 05/09/18 10:14:09

fdb> status details

Using cluster file `/etc/foundationdb/fdb.cluster’.

Configuration:
Redundancy mode - single
Storage engine - memory
Coordinators - 1

Cluster:
FoundationDB processes - 3
Machines - 2
Memory availability - 4.8 GB per process on machine with least available
Fault Tolerance - 0 machines
Server time - 05/09/18 10:01:55

Data:
Replication health - unknown
Moving data - unknown
Sum of key-value sizes - unknown
Disk space used - 1 MB

Operating space:
Log server - 15.3 GB free on most full server

Workload:
Read rate - 0 Hz
Write rate - 0 Hz
Transactions started - 2 Hz
Transactions committed - 0 Hz
Conflict rate - 0 Hz

Backup and DR:
Running backups - 0
Running DRs - 0

Process performance details:
172.17.0.4:4501 ( 1% cpu; 2% machine; 0.000 Gbps; 0% disk IO; 0.4 GB / 9.3 GB RAM )
172.17.0.5:4500 ( 1% cpu; 2% machine; 0.000 Gbps; 0% disk IO; 0.4 GB / 4.8 GB RAM )
172.17.0.5:4501 ( 2% cpu; 2% machine; 0.000 Gbps; 0% disk IO; 0.4 GB / 4.8 GB RAM )

Coordination servers:
172.17.0.4:4500 (reachable)

Client time: 05/09/18 10:01:54

kubectl exec -it foundationdb-0 – cat /etc/foundationdb/fdb.cluster
Eu6Qm2zv:PV3XWcSL@172.17.0.4:4500

kubectl exec -it foundationdb-1 – cat /etc/foundationdb/fdb.cluster
Eu6Qm2zv:PV3XWcSL@172.17.0.4:4500

I do not understand what is the reason for the issue . Coordinator is existing and reachable.
I also do not understand why 172.17.0.4:4500 does not appear in “Process performance details:”
( same as in How are ‘contributing_workers’ computed? ). That is the reason why I start two processes per pod (at the moment).

Thanks for your time and efforts in advance

Christoph


(A.J. Beamon) #2

A process can be a coordinator for a cluster without otherwise being a member of that cluster, which is why it can be a reachable coordinator but not show up in the process list.

The most simple explanation for a process not being in the cluster is that it’s using a different cluster file. Your cluster file looks correct, but if it was changed externally while the process was running, the process would need to be restarted to pick the change up. If restarting your process doesn’t fix it, then this is probably not the issue.

Then, you could try doing the same steps as in the other thread and see if that resolves the issue. You can move the coordinator to a new process, and then if it’s still broken you can try deleting the data files for that process (assuming you don’t care about them). You may also need to restart the process after you’ve done this. I don’t have a reproduction or logs for this scenario at the moment, so I don’t have a good explanation for this failure case at the moment.


(Chr1st0ph) #3

Thanks for your reply.

I replaced “127.0.0.1” by the external ip in the fdb.cluster file before I started the fdbmonitor process. This caused the issues described above.

Additionally deleting the data folder “/var/lib/foundationdb/data/4500” (before starting fdbmonitor) solved the issues.

Running /usr/lib/foundationdb/make_public.py as documented on https://apple.github.io/foundationdb/building-cluster.html#make-foundationdb-externally-accessible also worked, but does not fit in my Kubernetes scenario because I need the content of fdb.cluster in other pods, as well.


(A.J. Beamon) #4

See my reply to the other post for a possible explanation for this behavior.