Weird TLSPolicyFailure errors with self signed certificates

mpatou_openai · August 15, 2025, 11:44pm

We use self signed CA for securing our FDB traffic in staging but not in production we have those weird error messages

teTime
2025-08-15T23:04:39Z
ID
0000000000000000
Machine
10.241.1.88:4502
Reason
preverification failed
Roles
SS
Severity
20
Type
TLSPolicyFailure
VerifyError
self signed certificate in certificate chain
__InvalidSuppression__

But from what I can tell it seems that everything is working how can it happen ?

Semisol · August 16, 2025, 11:41pm

Did you set tls-ca-file with the correct certificates?

mpatou_openai · September 8, 2025, 5:11am

I think so but I’m wondering if something is off somewhere

johscheuer · September 8, 2025, 10:55am

Are you able to share how to created those certificates? It could be possible that some flags are missing in the certificate (this is what we use for generating the certificate in Golang: fdb-kubernetes-operator/e2e/fixtures/certificate_generator.go at main · FoundationDB/fdb-kubernetes-operator · GitHub).

mpatou_openai · September 12, 2025, 4:31am

I used terraform to create the certificates:

resource “tls_private_key” “root_key” {
count = var.root_cluster ? 1 : 0
algorithm = “RSA”
rsa_bits = 2048
}

resource “tls_private_key” “server_key” {
count = var.root_cluster ? 1 : 0
algorithm = “RSA”
rsa_bits = 2048
}

resource “tls_private_key” “client_key” {
count = var.root_cluster ? 1 : 0
algorithm = “RSA”
rsa_bits = 2048
}

resource “tls_self_signed_cert” “root_cert” {
count = var.root_cluster ? 1 : 0
private_key_pem = tls_private_key.root_key[0].private_key_pem
is_ca_certificate = true
subject {
common_name = “FDB ${var.phase} root CA”
organization = “myorg”
}

validity_period_hours = 24 * 365 * 10
allowed_uses = [“cert_signing”, “key_encipherment”, “digital_signature”]
}

resource “tls_cert_request” “server_cert_req” {
count = var.root_cluster ? 1 : 0
private_key_pem = tls_private_key.server_key[0].private_key_pem
subject {
common_name = “FDB ${var.phase} server”
organization = “myorg”
}
dns_names = local.dns_names
}

resource “tls_cert_request” “client_cert_req” {
count = var.root_cluster ? 1 : 0
private_key_pem = tls_private_key.client_key[0].private_key_pem
subject {
common_name = “FDB ${var.phase} client”
organization = “myorg”
}
}

mpatou_openai · September 13, 2025, 9:38pm

So it turns out that the reason for the issue is fairly simple:
In the same k8s I have 2 FDB clusters each with their own CA, it happens that sometime a given pod in the cluster fdb-1 tries to talk to a pod in the cluster fdb-2 because the IP of this pod used to be owned by a pod in the first cluster.

This is happening because we use DNS not cluster IP because of constraints from our cloud provider and so upon pod restarts the IP of the pod will most probably change and the old processes in the other cluster will still try to connect to the old IP.
Because we tend to restart pods because nodes get updated regularly to deal with security updates we face this situation quite a lot actually.

johscheuer · September 16, 2025, 4:35pm

That makes sense, thanks for sharing your finding. The stale connection issue is causing quite some issues, I know that some people are working on this issue, but it seems a bit more complex.

Topic		Replies	Views
TLS implemenation for foundationdb Using FoundationDB	2	644	August 18, 2022
`fdbbackup` with TLS not working Kubernetes Operator	6	700	May 18, 2022
Enabling peerVertificationRules with self-signed certificate on ocp/k8s Running FoundationDB	8	688	February 23, 2022
FoundationDB fails to verify valid TLS certificates Using FoundationDB	5	718	July 8, 2020
Backup to GCS failing TLS preverification Using FoundationDB	9	513	September 11, 2025

Weird TLSPolicyFailure errors with self signed certificates

Related topics