Hey all, we’ve noticed that if we keep our backup agent running for a long time, it slowly leaks sockets in ‘CLOSE_WAIT’ until running out of available handles.
We upload our backup to an S3 storage bucket.
Example:
$ pstree -p | grep backup
|-containerd(5834)-+-containerd-shim(12582)-+-storage(12615)-+-fdbmonitor(12650)-+-backup_agent(12657)-+-{backup_agent}(+
| | | | | |-{backup_agent}(+
| | | | | |-{backup_agent}(+
| | | | | |-{backup_agent}(+
| | | | | |-{backup_agent}(+
| | | | | `-{backup_agent}(+
ls -l /proc/12657/fd/
...
lrwx------. 1 root root 64 Sep 4 00:08 64 -> socket:[191177707]
lrwx------. 1 root root 64 Sep 4 08:17 65 -> socket:[190092237]
lrwx------. 1 root root 64 Sep 4 08:17 66 -> socket:[190761957]
lr-x------. 1 root root 64 Sep 2 19:23 7 -> pipe:[185282974]
lrwx------. 1 root root 64 Sep 4 08:17 70 -> socket:[190464326]
lrwx------. 1 root root 64 Sep 4 12:22 71 -> socket:[190464327]
lrwx------. 1 root root 64 Sep 4 12:22 72 -> socket:[190464328]
lrwx------. 1 root root 64 Sep 4 12:22 73 -> socket:[190967685]
lrwx------. 1 root root 64 Sep 4 20:32 74 -> socket:[191421019]
lrwx------. 1 root root 64 Sep 5 00:37 75 -> socket:[190977564]
lrwx------. 1 root root 64 Sep 5 00:37 76 -> socket:[191425533]
lrwx------. 1 root root 64 Sep 5 00:37 77 -> socket:[191417785]
l-wx------. 1 root root 64 Sep 2 19:23 8 -> pipe:[185282974]
lrwx------. 1 root root 64 Sep 5 04:43 81 -> socket:[191845011]
lrwx------. 1 root root 64 Sep 5 04:43 82 -> socket:[191853202]
lrwx------. 1 root root 64 Sep 2 19:23 9 -> anon_inode:[eventfd]
lsof -i -P | grep 12657
...
backup_ag 12657 root 70u IPv4 190464326 0t0 TCP <our-ip>:40498->aws.amazonaws.com:443 (CLOSE_WAIT)
backup_ag 12657 root 71u IPv4 190464327 0t0 TCP <our-ip>:40500->aws.amazonaws.com:443 (CLOSE_WAIT)
backup_ag 12657 root 72u IPv4 190464328 0t0 TCP <our-ip>:40502->aws.amazonaws.com:443 (CLOSE_WAIT)
backup_ag 12657 root 73u IPv4 190967685 0t0 TCP <our-ip>:34920->aws.amazonaws.com:443 (CLOSE_WAIT)
backup_ag 12657 root 74u IPv4 191421019 0t0 TCP <our-ip>:41056->aws.amazonaws.com:443 (CLOSE_WAIT)
backup_ag 12657 root 75u IPv4 190977564 0t0 TCP <our-ip>:59910->aws.amazonaws.com:443 (CLOSE_WAIT)
backup_ag 12657 root 76u IPv4 191425533 0t0 TCP <our-ip>:41058->aws.amazonaws.com:443 (CLOSE_WAIT)
backup_ag 12657 root 77u IPv4 191417785 0t0 TCP <our-ip>:58702->aws.amazonaws.com:443 (CLOSE_WAIT)
backup_ag 12657 root 81u IPv4 191845011 0t0 TCP <our-ip>:47550->aws.amazonaws.com:443 (CLOSE_WAIT)
backup_ag 12657 root 82u IPv4 191853202 0t0 TCP <our-ip>:40254->aws.amazonaws.com:443 (CLOSE_WAIT)
We’re getting a few BlobStoreEndpointRequestFailedRetryable
errors in our log.
<Event Severity="20" Time="1630846418.525511" Type="BlobStoreEndpointRequestFailedRetryable" ID="0000000000000000" Error="connection_failed" ErrorDescription="Network connection failed" ErrorCode="1026" SuppressedEventCount="1" ConnectionEstablished="1" RemoteEndpoint="" Verb="PUT" Resource="" ThisTry="1" RetryDelay="2" Machine="" LogGroup="default" />
Anybody have any thoughts on what the issue could be, or how we could gather more data?