I have this very peculiar problem that I believe might be a bug in the Connect framework.
Background: My app is an Atlassian Connect SpringBoot (2.2.3) Jira app running on AWS Fargate in a Docker container running an Alpine Linux base image.
As soon as the container boots up, it starts accumulating open sockets (as File Descriptors). The max allowed file descriptor count on my Linux config is 4096 at the moment. It takes about 8-10 days for my service to reach that level and then the container starts getting the error below because it can’t create new FDs. Fargate health checks start to fail. Then Fargate kills the container and spawns a new one. The cycle starts all over again.
2022-02-18T00:40:24.536+03:00 2022-02-17 21:40:24.535 ERROR 1 — [o-8080-Acceptor] org.apache.tomcat.util.net.Acceptor : Socket accept failed
2022-02-18T00:40:24.536+03:00 java.io.IOException: No file descriptors available
2022-02-18T00:40:24.536+03:00 at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) ~[na:1.8.0_275]
Below are my findings:
- An investigation with “lsof” command shows that it is not files that are accumulating but network sockets. Something opens network sockets and doesn’t close them.
- Deeper investigation with “ss -epr” shows that the open sockets are ONLY to these specific URLs below:
*** ec2-18-246-31-137.us-west-2.compute.amazonaws.com:https
*** ec2-18-246-31-138.us-west-2.compute.amazonaws.com:https
*** ec2-18-246-31-139.us-west-2.compute.amazonaws.com:https - The sockets are open but the received and send packet counts do not change over time. Received 32, Send 0.
- Trying to navigate to these URLs revealed that these are 3 decommissioned Jira Cloud instances. I don’t know who they used to belong to because I get the usual error for all of them “Your Atlassian Cloud site is currently unavailable.”
- HTTP logs show no requests coming from these IPs.
- I can’t reproduce the problem in our test environment. (Believe me, we tried hard)
Since (or should I say if) these are decommissioned instances, they can’t be sending me requests. Right? These connections must have been initiated from our side. But why? And why only these three instances? When I spawn a new container, it is a new container with a clean OS and clean file system. Only the DB is old so it must be data-dependent. Where does it get the idea of connecting to these instances? What is it sending?
I tend to think that this is a Connect framework bug. These were probably old customer instances. For some reason, my service is (probably) sending these instances some requests periodically and leaving open sockets.
But why? I have no way to debug this further. Any help or direction will be greatly appreciated.