Boundary, Traefik and haproxy
While extensively testing our Boundary setup, we realized that SSH sessions got terminated after about 60 seconds of inactivity. Of course, this is far from ideal so we had to figure out the problem and then fix it.
Since Traefik is proxying all of our requests to the applications deployed in our Nomad cluster, we assumed this could be the source of the error. We upgraded to the latest Traefik version and checked the docs to see if timeouts can be configured. We came across the respondingTimeouts
configuration and configured it for both of our Boundary entrypoints:
[entryPoints]
[entryPoints.boundary-controller]
address = ":9201"
[entryPoints.boundary-controller.transport]
[entryPoints.boundary-controller.transport.respondingTimeouts]
readTimeout = 0
writeTimeout = 0
idleTimeout = 0
[entryPoints.boundary-worker]
address = ":9202"
[entryPoints.boundary-worker.transport]
[entryPoints.boundary-worker.transport.respondingTimeouts]
readTimeout = 0
writeTimeout = 0
idleTimeout = 0
After restarting Traefik, the same error occurred again. After about 60 seconds, the connection dropped, and the local SSH client was terminated. We tried different settings for the timeout configuration but never managed to solve the issue.
Since Traefik isn't causing the issue, where to find the source of the problem? Since we are using TraefikEE in our setup, we also have a haproxy instance running in front of TrafikEE to load balance the requests to one of two TraefikEE proxy instances. Realizing this, I dug deeper into the haproxy configuration and realized that also haproxy has a timeout configuration for the client and server connections.
After raising the connection timeout (Client timeout, Server timeout) in haproxy, the disconnect problem disappeared.