Debugging a TraefikEE routing problem
Recently, we've been running into a weird problem. After restarting 2 nodes in our Nomad cluster, we could not properly access GitLab via SSH anymore. Web access was working fine, also cloning via https:// worked, but not via SSH which is what most of our developers use by default.
At first, I thought there was some issue on the GitLab side, I checked the docs and found a few tips but none of them really worked. That left me a bit clueless until I remembered a co-worker telling me that this could be a proxy issue. I checked our firewall appliance first and found no problem. Then I checked the haproxy instance we have running and found no problem. Last I checked TraefikEE and there it was: TraefikEE did not properly clean up the old nodes after the restart and had a bunch of unavailable data plane nodes still registered.
To get an overview of what nodes TraefikEE is aware of, you can use the teectl tool like this:
./teectl/teectl get nodes --socket=/var/run/traefikeev2/teectl.sock
Deleting nodes works like this. Pass the node id that should be removed to the command:
./teectl/teectl delete node --id=gj4t352y0fowxraan1scmfddf --socket=/var/run/traefikeev2/teectl.sock
I had to run the command 5 or 6 times to get rid of all the unavailable data plane nodes and after a few seconds, the Web UI of TraefikEE did not show any errors. I then tried to pull and push to GitLab via SSH and everything was working again.