sni-router: host-net HAProxy to preserve real client IPs#522
Conversation
|
@bam80 friendly ping — could you re-run this on your Fedora + Docker 29 setup before it leaves draft? The shape changed slightly from the version you tested:
Concretely, what I'd like to confirm:
If anything breaks I'll iterate. Thanks for the patience on the round-trip. |
| # host's net.ipv6.bindv6only sysctl. `v6only` on the v6 bind prevents it | ||
| # from also accepting v4-mapped connections, which would otherwise | ||
| # conflict with the explicit v4 bind on the same port. | ||
| bind 0.0.0.0:80 | ||
| bind [::]:80 v6only |
There was a problem hiding this comment.
| # host's net.ipv6.bindv6only sysctl. `v6only` on the v6 bind prevents it | |
| # from also accepting v4-mapped connections, which would otherwise | |
| # conflict with the explicit v4 bind on the same port. | |
| bind 0.0.0.0:80 | |
| bind [::]:80 v6only | |
| # host's net.ipv6.bindv6only sysctl. | |
| bind :80,[::]:80 |
*:, 0.0.0.0: and : are equivalent per the doc .
I don't have v6only here in my patch variant (which is pretty the same) and still didn't notice any conflicts (with net.ipv6.bindv6only = 0). Not sure if it's allowed in the one-line notation.
There was a problem hiding this comment.
You're right that *:, 0.0.0.0: and : are equivalent, and re: v6only — I checked the actual behavior: with SO_REUSEADDR (HAProxy's default) and bindv6only=0, the v6 bind succeeds alongside the v4 bind, and the kernel routes v4 packets to the more-specific AF_INET socket. So both forms produce identical runtime behavior on Linux. My earlier comment overstated the v6only/sysctl interaction — it's not load-bearing, it's self-documentation.
That makes the choice purely stylistic:
- Two binds +
v6only: spells out why two binds coexist for someone reading the cfg without having to reason aboutSO_REUSEADDRsemantics. - One-liner: shorter; the comment doesn't have to explain
v6onlybecause it's not there.
I have a mild preference for the explicit form for a contrib/ example, but you're the one actually running sni-router and closer to the audience copying this config — if you'd rather have the one-liner, I'll switch. Either way I'm fine.
On v6only in comma syntax: HAProxy docs say bind options apply to all sockets on the line, so bind :80,[::]:80 v6only would set IPV6_V6ONLY on the v4 socket too — no-op there, but cosmetically odd. If we go one-liner, I'd drop v6only entirely, as your suggestion does.
Either way, the gate I'd still like to clear before un-drafting is an actual compose up -d with this layout — v4 and v6 client landing in mtg + Caddy logs with real addresses. The bind nit is a quick swap after; that e2e run is the bit I can't reproduce from my side.
There was a problem hiding this comment.
Why I would prefer one-liner -
it makes adding new ports easier, and look better, e.g.:
bind :80,[::]:80
bind :8080,[::]:8080
I'm personally exploiting the multi-port configuration, I keep them all on one line but someone else might prefer just add a new line with the ports. I don't have a hard preference, though.
I'll test it tomorrow, thanks.
There was a problem hiding this comment.
Done in 2a63578 — switched both :80 and :443 blocks to bind :PORT,[::]:PORT, dropped v6only, trimmed the comment to one sentence (nothing about v6only to explain anymore). Multi-port-scaling point taken; future ports can just add another comma-separated line.
There was a problem hiding this comment.
I'll test it tomorrow, thanks.
Много крови мне попил этот тест (#525 (comment)), но вроде работает, спасибо.
Кстати, я так и не понял в чем проблема была протестировать самому.
Я всё равно не мог тестировать в обычном режиме (80 порт недоступен снаружи), пришлось тестировать с DOMAIN=localhost, но этого д.б. достаточно - обе версии IP видны нормально.
There was a problem hiding this comment.
Ну и масла подлило в огонь отсутствие #514 - тоже побился головой об стену.
There was a problem hiding this comment.
Перечитал свою отговорку про «не могу воспроизвести у себя» — ты прав, она не выдерживает критики. Реальная причина: ты был исходным тестером с уже верифицированной средой, и в голове это сложилось как «дешевле попросить ещё раз, чем поднимать чистую тачку». Но это ровно тот случай, когда «дешевле» = «свалить на другого». DOMAIN=localhost на любом dev-VPS — то, что нужно было сделать самому до того, как просить третий проход. Учту.
There was a problem hiding this comment.
Да, и это была не «соседняя» проблема, а жёсткая зависимость: nineseconds/mtg:2 без proxy-protocol-listener, без #514/#480 стэк объективно не работает end-to-end, отсюда твой ручной патч во время теста. Должен был либо явно зачейнить #514 в описании, либо включить bump образа сюда же. Сейчас #514 в master — после rebase следующий тестер получит рабочий стэк без ручной возни.
| bind *:443 | ||
| bind 0.0.0.0:443 | ||
| bind [::]:443 v6only |
| # Explicit v4 + v6 binds so IPv6 clients are accepted regardless of the | ||
| # host's net.ipv6.bindv6only sysctl. `v6only` on the v6 bind prevents it |
There was a problem hiding this comment.
Note: We could also just do bind [::]:80 v4v6 without explicit v4 and v6 ports but then we would get ffffffff:1.2.3.4 in the logs for IPv4 addresses.
There was a problem hiding this comment.
Right — that ::ffff:1.2.3.4 noise is exactly why I went with explicit dual binds rather than v4v6. Sticking with bind :PORT,[::]:PORT so v4 stays v4 in PROXY-v2 and downstream logs.
Switch to one-line `bind :80,[::]:80` and `bind :443,[::]:443` per review feedback in #522. The v6only flag was self-documentation, not load-bearing: with SO_REUSEADDR (HAProxy's default) and bindv6only=0 the kernel routes v4 packets to the more-specific AF_INET socket regardless. Comment trimmed to match — the v6only paragraph is gone because v6only itself is gone. The shorter form also scales more cleanly when adding ports later, e.g. `bind :8080,[::]:8080` on a new line.
Bridge ingress (Docker's docker-proxy userland forwarder, Podman's slirp4netns/pasta) rewrites the source IP of inbound connections on a published port to the bridge gateway address. HAProxy then stamps that gateway address into the PROXY v2 header it forwards to mtg and Caddy, so neither backend ever sees a real client IP. Move HAProxy into the host netns (network_mode: host) so it binds :443/:80 directly with no NAT in the path. mtg and Caddy stay on the compose bridge and are published on 127.0.0.1 only; HAProxy reaches them via host loopback and PROXY v2 carries the real client IP (v4 or v6) end-to-end. Also accept IPv6 clients explicitly on the HAProxy frontends — `bind *:443` is IPv4-only and missed v6 clients on hosts where the previous example happened to "work" only because of dual-stack quirks. Add 127.0.0.0/8 to Caddy's PROXY allow-list to cover the new loopback hop from HAProxy. README gains a short subsection explaining the host-mode choice and its trade-off (HAProxy occupies host :443/:80). Diagnosed and tested by @bam80 on Fedora + Docker 29. Fixes #498.
…rrow Caddy allow) - Caddy allow: 127.0.0.0/8 → 127.0.0.1/32 (only loopback peer is HAProxy). - haproxy.cfg: rewrite v6only comment to describe what it actually does (suppresses v4-mapped accept, preventing conflict with the v4 bind), not the symptom. - docker-compose.yml: trim the 8-line haproxy comment to 3 lines and defer the rationale to README. Add one-line note explaining why web uses host port 8080 (HAProxy owns :80). - README: condense the "Why network_mode: host" subsection. Spell out trade-offs as a list: own-the-host-ports, Linux-only (Docker Desktop doesn't make this layout reachable), userns-remap incompatibility. Note that mtg-config.toml stays as-is because mtg/web remain on the compose bridge.
Switch to one-line `bind :80,[::]:80` and `bind :443,[::]:443` per review feedback in #522. The v6only flag was self-documentation, not load-bearing: with SO_REUSEADDR (HAProxy's default) and bindv6only=0 the kernel routes v4 packets to the more-specific AF_INET socket regardless. Comment trimmed to match — the v6only paragraph is gone because v6only itself is gone. The shorter form also scales more cleanly when adding ports later, e.g. `bind :8080,[::]:8080` on a new line.
2a63578 to
a7febc2
Compare
Follow-up to discussion in #498 — bam80 reported that real client IPs never made it through
contrib/sni-routerdespite all four PROXY-protocol pieces being wired up correctly.Root cause
When the HAProxy container is on a bridge network and
:443/:80are published viaports:, the source IP of every inbound connection is rewritten to the bridge gateway before HAProxy sees it:docker-proxyaccepts on the host and re-opens the connection from the bridge gateway.userland-proxy: false: kernel DNAT should preserve the source, but on Docker 29 / Fedora the MASQUERADE rewrite (moby/moby#48854) intermittently drops or rewrites traffic.slirp4netns/pastauserspace forwarder, no equivalent flag.In every case HAProxy stamps the gateway address (e.g.
172.x.x.1) into the PROXY v2 header, so mtg and Caddy faithfully log the wrong IP. The fix has to lift HAProxy out of the rewrite path — no amount of backend-side configuration can recover what HAProxy never received.Change
network_mode: host. Binds:443/:80in the host netns directly. No NAT, no userspace forwarder, no source rewrite. Real client IPs (v4 and v6) propagate end-to-end via PROXY v2.127.0.0.1only (127.0.0.1:3128:3128,127.0.0.1:8080:80,127.0.0.1:8443:8443). HAProxy reaches them via host loopback.bind :443,[::]:443/bind :80,[::]:80).bind *:443is IPv4-only; the old example accepted IPv6 only on hosts where dual-stack quirks happened to cover for it.127.0.0.1/32to cover the new loopback hop from HAProxy. The RFC1918 ranges stay for the fronting path (mtg → Caddy on the compose bridge).mtg-config.tomlis intentionally unchanged — mtg and Caddy are still on the compose bridge, so fronting can keephost = "web"and resolve over compose-network DNS.Alternatives considered
{"userland-proxy": false}in/etc/docker/daemon.json. Smaller change (host config, no compose edits) but: (a) requires host-level daemon config that the contrib example can't ship, (b) flaky on Docker 29 (MASQUERADE rewrite), (c) doesn't help Podman rootless at all.network_mode: host. Cleaner network model but requires changingmtg-config.toml(frontinghostand listen address) and loses compose-network isolation. Bigger change for no functional gain over the current layout.Trade-offs / platform notes
:443and:80. Don't run anything else on those ports on the same host.network_mode: hostbinds inside the Linux VM, so external clients can't reach the proxy. Out of scope for this contrib example, which is server-deployment-oriented anyway.userns-remap, the in-container "root" loses the privilege to bind<1024. README documents the workaround.Test status
End-to-end validated by @bam80 on Fedora + Docker 29 — both the original revision (real client IPv4/IPv6 visible in mtg and Caddy logs) and the current revision (
DOMAIN=localhostrun after the review-fixups, see thread).Branch rebased on master after the 2026-05-20 batch — picks up #514 (image bump to
:master, hard dep so the stack actually exposesproxy-protocol-listenerend-to-end) and #525 (mtg-config.tomlrendered from tracked.example). A fresh checkout now brings up a working stack without manual patching.Closes #498.