Weird TLSPolicyFailure errors with self signed certificates

We use self signed CA for securing our FDB traffic in staging but not in production we have those weird error messages

teTime
2025-08-15T23:04:39Z
ID
0000000000000000
Machine
10.241.1.88:4502
Reason
preverification failed
Roles
SS
Severity
20
Type
TLSPolicyFailure
VerifyError
self signed certificate in certificate chain
__InvalidSuppression__

But from what I can tell it seems that everything is working how can it happen ?

Did you set tls-ca-file with the correct certificates?

I think so but I’m wondering if something is off somewhere

Are you able to share how to created those certificates? It could be possible that some flags are missing in the certificate (this is what we use for generating the certificate in Golang: fdb-kubernetes-operator/e2e/fixtures/certificate_generator.go at main · FoundationDB/fdb-kubernetes-operator · GitHub).

I used terraform to create the certificates:

resource “tls_private_key” “root_key” {
count = var.root_cluster ? 1 : 0
algorithm = “RSA”
rsa_bits = 2048
}

resource “tls_private_key” “server_key” {
count = var.root_cluster ? 1 : 0
algorithm = “RSA”
rsa_bits = 2048
}

resource “tls_private_key” “client_key” {
count = var.root_cluster ? 1 : 0
algorithm = “RSA”
rsa_bits = 2048
}

resource “tls_self_signed_cert” “root_cert” {
count = var.root_cluster ? 1 : 0
private_key_pem = tls_private_key.root_key[0].private_key_pem
is_ca_certificate = true
subject {
common_name = “FDB ${var.phase} root CA”
organization = “myorg”
}

validity_period_hours = 24 * 365 * 10
allowed_uses = [“cert_signing”, “key_encipherment”, “digital_signature”]
}

resource “tls_cert_request” “server_cert_req” {
count = var.root_cluster ? 1 : 0
private_key_pem = tls_private_key.server_key[0].private_key_pem
subject {
common_name = “FDB ${var.phase} server”
organization = “myorg”
}
dns_names = local.dns_names
}

resource “tls_cert_request” “client_cert_req” {
count = var.root_cluster ? 1 : 0
private_key_pem = tls_private_key.client_key[0].private_key_pem
subject {
common_name = “FDB ${var.phase} client”
organization = “myorg”
}
}

So it turns out that the reason for the issue is fairly simple:
In the same k8s I have 2 FDB clusters each with their own CA, it happens that sometime a given pod in the cluster fdb-1 tries to talk to a pod in the cluster fdb-2 because the IP of this pod used to be owned by a pod in the first cluster.

This is happening because we use DNS not cluster IP because of constraints from our cloud provider and so upon pod restarts the IP of the pod will most probably change and the old processes in the other cluster will still try to connect to the old IP.
Because we tend to restart pods because nodes get updated regularly to deal with security updates we face this situation quite a lot actually.

1 Like

That makes sense, thanks for sharing your finding. The stale connection issue is causing quite some issues, I know that some people are working on this issue, but it seems a bit more complex.