Backup Agent Leaking handles in CLOSE_WAIT

Hey all, we’ve noticed that if we keep our backup agent running for a long time, it slowly leaks sockets in ‘CLOSE_WAIT’ until running out of available handles.
We upload our backup to an S3 storage bucket.

Example:

$ pstree -p | grep backup
           |-containerd(5834)-+-containerd-shim(12582)-+-storage(12615)-+-fdbmonitor(12650)-+-backup_agent(12657)-+-{backup_agent}(+
           |                  |                        |                |                   |                     |-{backup_agent}(+
           |                  |                        |                |                   |                     |-{backup_agent}(+
           |                  |                        |                |                   |                     |-{backup_agent}(+
           |                  |                        |                |                   |                     |-{backup_agent}(+
           |                  |                        |                |                   |                     `-{backup_agent}(+
ls -l /proc/12657/fd/
...
lrwx------. 1 root root 64 Sep  4 00:08 64 -> socket:[191177707]
lrwx------. 1 root root 64 Sep  4 08:17 65 -> socket:[190092237]
lrwx------. 1 root root 64 Sep  4 08:17 66 -> socket:[190761957]
lr-x------. 1 root root 64 Sep  2 19:23 7 -> pipe:[185282974]
lrwx------. 1 root root 64 Sep  4 08:17 70 -> socket:[190464326]
lrwx------. 1 root root 64 Sep  4 12:22 71 -> socket:[190464327]
lrwx------. 1 root root 64 Sep  4 12:22 72 -> socket:[190464328]
lrwx------. 1 root root 64 Sep  4 12:22 73 -> socket:[190967685]
lrwx------. 1 root root 64 Sep  4 20:32 74 -> socket:[191421019]
lrwx------. 1 root root 64 Sep  5 00:37 75 -> socket:[190977564]
lrwx------. 1 root root 64 Sep  5 00:37 76 -> socket:[191425533]
lrwx------. 1 root root 64 Sep  5 00:37 77 -> socket:[191417785]
l-wx------. 1 root root 64 Sep  2 19:23 8 -> pipe:[185282974]
lrwx------. 1 root root 64 Sep  5 04:43 81 -> socket:[191845011]
lrwx------. 1 root root 64 Sep  5 04:43 82 -> socket:[191853202]
lrwx------. 1 root root 64 Sep  2 19:23 9 -> anon_inode:[eventfd]
lsof -i -P | grep 12657
...
backup_ag 12657    root   70u  IPv4 190464326      0t0  TCP <our-ip>:40498->aws.amazonaws.com:443 (CLOSE_WAIT)
backup_ag 12657    root   71u  IPv4 190464327      0t0  TCP <our-ip>:40500->aws.amazonaws.com:443 (CLOSE_WAIT)
backup_ag 12657    root   72u  IPv4 190464328      0t0  TCP <our-ip>:40502->aws.amazonaws.com:443 (CLOSE_WAIT)
backup_ag 12657    root   73u  IPv4 190967685      0t0  TCP <our-ip>:34920->aws.amazonaws.com:443 (CLOSE_WAIT)
backup_ag 12657    root   74u  IPv4 191421019      0t0  TCP <our-ip>:41056->aws.amazonaws.com:443 (CLOSE_WAIT)
backup_ag 12657    root   75u  IPv4 190977564      0t0  TCP <our-ip>:59910->aws.amazonaws.com:443 (CLOSE_WAIT)
backup_ag 12657    root   76u  IPv4 191425533      0t0  TCP <our-ip>:41058->aws.amazonaws.com:443 (CLOSE_WAIT)
backup_ag 12657    root   77u  IPv4 191417785      0t0  TCP <our-ip>:58702->aws.amazonaws.com:443 (CLOSE_WAIT)
backup_ag 12657    root   81u  IPv4 191845011      0t0  TCP <our-ip>:47550->aws.amazonaws.com:443 (CLOSE_WAIT)
backup_ag 12657    root   82u  IPv4 191853202      0t0  TCP <our-ip>:40254->aws.amazonaws.com:443 (CLOSE_WAIT)

We’re getting a few BlobStoreEndpointRequestFailedRetryable errors in our log.

<Event Severity="20" Time="1630846418.525511" Type="BlobStoreEndpointRequestFailedRetryable" ID="0000000000000000" Error="connection_failed" ErrorDescription="Network connection failed" ErrorCode="1026" SuppressedEventCount="1" ConnectionEstablished="1" RemoteEndpoint="" Verb="PUT" Resource="" ThisTry="1" RetryDelay="2" Machine="" LogGroup="default" />

Anybody have any thoughts on what the issue could be, or how we could gather more data?

2 Likes