vrusinov
(Vladimir Rusinov)
January 31, 2023, 1:55pm
1
Hey, I was playing with FDB operator and found an minor but confusing issue. Can’t report in github, so reporting here:
I was following getting started guide and got stuck on FDB operator crashing with the following confusing error:
{"level":"info","ts":1675163849.8685071,"logger":"fdbclient","msg":"Fetch values from FDB","namespace":"fdb","cluster":"test-cluster","key":"\ufffd\ufffd/status/json"}
SIGILL: illegal instruction
PC=0x7fdabe0a74ff m=4 sigcode=2
signal arrived during cgo execution
instruction bytes: 0xc5 0xf9 0xef 0xc0 0x48 0x8d 0x15 0xb6 0xb 0x8 0x1 0x48 0x8d 0x35 0x8f 0xc0
Instructions translate to the following but it’s probably irrelevant:
0: c5 f9 ef c0 vpxor xmm0,xmm0,xmm0
4: 48 8d 15 b6 0b 08 01 lea rdx,[rip+0x1080bb6] # 0x1080bc1
b: 48 rex.W
c: 8d .byte 0x8d
d: 35 .byte 0x35
e: 8f c0 pop rax
This was on modern-ish low-power Intel CPU (I’m fairly sure I have sse).
Turns out like many things cgo doesn’t handle running out of memory well, and increasing memory limit even to 280Mi alleviated the problem. I don’t even know how to approach fixing the root cause, but I’d suggest increasing memory limit in sample config (fdb-kubernetes-operator/deployment.yaml at main · FoundationDB/fdb-kubernetes-operator · GitHub ) so other poor saps don’t spend time debugging this.
vrusinov
(Vladimir Rusinov)
January 31, 2023, 1:58pm
2
Repro:
deployment.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: fdb-kubernetes-operator-controller-manager
namespace: fdb
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
creationTimestamp: null
name: fdb-kubernetes-operator-manager-role
namespace: fdb
rules:
- apiGroups:
- ""
resources:
- configmaps
- events
- persistentvolumeclaims
- pods
- secrets
- services
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- apps
resources:
- deployments
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- apps.foundationdb.org
resources:
- foundationdbbackups
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- apps.foundationdb.org
resources:
- foundationdbbackups/status
verbs:
- get
- patch
- update
- apiGroups:
- apps.foundationdb.org
resources:
- foundationdbclusters
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- apps.foundationdb.org
resources:
- foundationdbclusters/status
verbs:
- get
- patch
- update
- apiGroups:
- apps.foundationdb.org
resources:
- foundationdbrestores
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- apps.foundationdb.org
resources:
- foundationdbrestores/status
verbs:
- get
- patch
- update
- apiGroups:
- coordination.k8s.io
resources:
- leases
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
creationTimestamp: null
name: fdb-kubernetes-operator-manager-rolebinding
namespace: fdb
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: fdb-kubernetes-operator-manager-role
subjects:
- kind: ServiceAccount
name: fdb-kubernetes-operator-controller-manager
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: fdb-kubernetes-operator-controller-manager
control-plane: controller-manager
name: fdb-kubernetes-operator-controller-manager
namespace: fdb
annotations:
kube-score/ignore: "container-security-context,pod-probes,pod-networkpolicy,container-security-context-readonlyrootfilesystem,container-security-context-user-group-id,container-resources"
spec:
replicas: 1
selector:
matchLabels:
app: fdb-kubernetes-operator-controller-manager
template:
metadata:
labels:
app: fdb-kubernetes-operator-controller-manager
control-plane: controller-manager
spec:
containers:
- command:
- /manager
env:
- name: WATCH_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
image: foundationdb/fdb-kubernetes-operator:v1.12.0
name: manager
ports:
- containerPort: 8080
name: metrics
resources:
requests:
cpu: 500m
memory: 256Mi
limits:
cpu: 500m
memory: 256Mi
securityContext:
allowPrivilegeEscalation: false
privileged: false
readOnlyRootFilesystem: true
volumeMounts:
- mountPath: /tmp
name: tmp
- mountPath: /var/log/fdb
name: logs
- mountPath: /usr/bin/fdb
name: fdb-binaries
initContainers:
- args:
- --copy-library
- "6.2"
- --copy-binary
- fdbcli
- --copy-binary
- fdbbackup
- --copy-binary
- fdbrestore
- --output-dir
- /var/output-files/6.2.30
- --init-mode
image: foundationdb/foundationdb-kubernetes-sidecar:6.2.30-1
name: foundationdb-kubernetes-init-6-2
volumeMounts:
- mountPath: /var/output-files
name: fdb-binaries
- args:
- --copy-library
- "6.3"
- --copy-binary
- fdbcli
- --copy-binary
- fdbbackup
- --copy-binary
- fdbrestore
- --output-dir
- /var/output-files/6.3.24
- --init-mode
image: foundationdb/foundationdb-kubernetes-sidecar:6.3.24-1
name: foundationdb-kubernetes-init-6-3
volumeMounts:
- mountPath: /var/output-files
name: fdb-binaries
- args:
- --copy-library
- "7.1"
- --copy-binary
- fdbcli
- --copy-binary
- fdbbackup
- --copy-binary
- fdbrestore
- --output-dir
- /var/output-files/7.1.15
- --init-mode
image: foundationdb/foundationdb-kubernetes-sidecar:7.1.15-1
name: foundationdb-kubernetes-init-7-1
volumeMounts:
- mountPath: /var/output-files
name: fdb-binaries
securityContext:
fsGroup: 4059
runAsGroup: 4059
runAsUser: 4059
serviceAccountName: fdb-kubernetes-operator-controller-manager
terminationGracePeriodSeconds: 10
volumes:
- emptyDir: {}
name: tmp
- emptyDir: {}
name: logs
- emptyDir: {}
name: fdb-binaries
cluster.yaml (taken from https://raw.githubusercontent.com/foundationdb/fdb-kubernetes-operator/main/config/samples/cluster.yaml with some minor modifications)
apiVersion: apps.foundationdb.org/v1beta2
kind: FoundationDBCluster
metadata:
name: test-cluster
namespace: fdb
spec:
automationOptions:
replacements:
enabled: true
podUpdateStrategy: "Replace"
faultDomain:
key: foundationdb.org/none
labels:
filterOnOwnerReference: false
matchLabels:
foundationdb.org/fdb-cluster-name: test-cluster
processClassLabels:
- foundationdb.org/fdb-process-class
processGroupIDLabels:
- foundationdb.org/fdb-process-group-id
minimumUptimeSecondsForBounce: 60
# processCounts:
# cluster_controller: 1
# storage: 3
# log: 3
# #stateless: 4
# stateless: -1
# https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/docs/manual/scaling.md#growing-a-cluster
databaseConfiguration:
storage: 3
logs: 1 # default is 3
proxies: 1 # default is 3
resolvers: 1 # default is 1
redundancy_mode: single
processes:
general:
customParameters:
- knob_disable_posix_kernel_aio=1
podTemplate:
spec:
containers:
- name: foundationdb
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 100m
memory: 170Mi
securityContext:
runAsUser: 0
- name: foundationdb-kubernetes-sidecar
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 100m
memory: 128Mi
securityContext:
runAsUser: 0
initContainers:
- name: foundationdb-kubernetes-init
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 100m
memory: 128Mi
securityContext:
runAsUser: 0
volumeClaimTemplate:
spec:
# RBD would have worked better
storageClassName: rook-cephfs
resources:
requests:
storage: 16G
routing:
headlessService: true
sidecarContainer:
enableLivenessProbe: true
enableReadinessProbe: false
useExplicitListenAddress: true
version: 6.3.22
(note I also had to up the memory limit for foundationdb
container)
johscheuer
(Johannes Scheuermann)
January 31, 2023, 4:25pm
3
Could you check if your cpu supports AVX instructions? If not can you please change the image foundationdb/foundationdb-kubernetes-sidecar:7.1.15-1
to foundationdb/foundationdb-kubernetes-sidecar:7.1.14-1
(the even numbers are build without avx support), that should solve your problem. We probably should write that somewhere more visible.
vrusinov
(Vladimir Rusinov)
January 31, 2023, 5:11pm
4
Ah, indeed. I did check SSE but not AVX.
The host it was originally running at does not support AVX. After I’ve bumped memory limit the pod moved to a different host which does support AVX (and I didn’t notice).