Hugh’s Tech Blog

Kubernetes Networking Deep Dive, Part 4

Hugh Tipping — Fri, 27 Feb 2026 18:30:18 GMT

This is the fourth and final post in my blog post series tracing packets’ journey through a Kubernetes cluster. Parts 1-3 covered foundations, pod-to-pod traffic (east-west), and north-south traffic through load balancers. This post will get to that scary and often dreaded topic of encryption. This will keep it simple, doing encryption without a service mesh.

Encryption in Kubernetes is not a one-size-fits-all setup. It requires decisions about things like where to terminate TLS, what traffic should be encrypted, and how to manage all those certificates. Now that we have a better understanding of the path of a packet, we can make these decisions.

Encryption Boundaries

Traffic flowing into, out of, and through a Kubernetes cluster crosses several boundaries:

Client to Load Balancer: external network traffic
Load Balancer to Node: it could be external traffic or may be internal
Node to Ingress Controller Pod: this is the cluster network
Ingress Controller to a Backend Pod: also cluster network
Pod to Pod: also cluster network

Each boundary that you cross is a possible encryption termination point. But you need to ask which hops on this journey need encryption, and where should TLS terminate?

TLS Termination

There are three general patterns for handling TLS in a Kubernetes cluster.

Pattern 1: TLS Terminates at the Load Balancer itself

The load balancer has the TLS certificate and can terminate the encryption. The traffic within the cluster then travels unencrypted.

Client ══════════► Load Balancer ──────────► Node ──────────► Pod
         HTTPS          │           HTTP            HTTP
                   TLS terminates
                   here

This is the simplest configuration pattern. The load balancer handles all that pesky certificate stuff, and applications within the cluster receive plain ol’ HTTP.

Configuration example for an AWS ALB (Application Load Balancer) Ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-app
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:123456789:certificate/abc-123
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
    alb.ingress.kubernetes.io/ssl-redirect: '443'
spec:
  ingressClassName: alb
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: my-app
            port:
              number: 80

Pros:

Simple certificate management: in this case AWS handles certificate renewal
No certificate configuration happens inside cluster
Allows the load balancer to inspect traffic for use in things like rate limiting, or logging
Lower CPU usage on application pods since they don’t have to spend compute resources on encryption and decryption.

Cons

Traffic flows unencrypted within cluster network
Requires that you really trust the security of your internal network infrastructure
Not so good when you have compliance/regulatory requirements that mandate full end-to-end encryption

Use cases:

Internal applications where the cluster network is trusted
When the L7 load balancer features (WAF, header inspection) are required
Environments that are not subject to stricter regulatory requirements

Pattern 2: TLS Passthrough to Ingress Controller

In this case, instead of the load balancer terminating the encryption, it instead forwards the encrypted traffic as is. The TLS encryption will terminate at the Ingress Controller running inside the cluster.

Client ══════════► Load Balancer ══════════► Node ══════════► Ingress Pod ──────► Backend Pod
         HTTPS          │           HTTPS          HTTPS            │         HTTP
                   L4 passthrough                              TLS terminates
                   (no decryption)                             here

The load balancer operates at Layer 4 rather than Layer 7, forwarding the TCP connections without doing any payload inspection.

Configuration example for an NGINX Ingress with passthrough:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-app
  annotations:
    nginx.ingress.kubernetes.io/ssl-passthrough: "true"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - app.example.com
    secretName: app-tls-secret
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: my-app
            port:
              number: 443

Notice the ssl-passthrough is set to “true”. The TLS certificate is stored, in this case, in a Kubernetes Secret:

# Create TLS secret from certificate files (Bad idea. You should use a service)
kubectl create secret tls app-tls-secret \
  --cert=tls.crt \
  --key=tls.key

# View the secret
kubectl get secret app-tls-secret -o yaml # Blech

Pros:

TLS terminates within cluster boundary
The load balancer does not need any certificate access
The traffic remains encrypted traffic between the LB and ingress controller

Cons:

The load balancer is unable inspect traffic, preventing use of any L7 features the LB may have.
You will need to do some sort of certificate management within the cluster, a service that updates the certificate from some source.
Ingress controller to backend still unencrypted (by default)

Use cases:

When TLS must terminate within the cluster
When the load balancer should not have access to certificates (but for a managed service, there should already be sufficient security built-in)
When L4 load balancing is sufficient (it can be but it limits your options)

Pattern 3: End-to-End Encryption with Re-encryption at Each Hop (Oh… fun)

In this case, each hop uses its own TLS connection. Traffic is decrypted and re-encrypted at each hop.

Client ══════════► Load Balancer ══════════► Ingress Pod ══════════► Backend Pod
         HTTPS          │           HTTPS          │          HTTPS
                   TLS session 1              TLS session 2      TLS session 3
                   terminates                 terminates         terminates
                   re-encrypts                re-encrypts        here

This requires certificates for all hops. That’s a lot of certificates depending on the hops.

Configuration example (NGINX Ingress with backend HTTPS):

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-app
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
    nginx.ingress.kubernetes.io/proxy-ssl-verify: "on"
    nginx.ingress.kubernetes.io/proxy-ssl-secret: "default/backend-ca-secret"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - app.example.com
    secretName: ingress-tls-secret
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: my-app
            port:
              number: 8443  # Backend listens on HTTPS

The backend pod must serve TLS (note the mounted certs on volumes):

apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  containers:
  - name: app
    image: my-app:latest
    ports:
    - containerPort: 8443
    volumeMounts:
    - name: tls-certs
      mountPath: /etc/tls
      readOnly: true
  volumes:
  - name: tls-certs
    secret:
      secretName: backend-tls-secret

Pros:

Traffic is encrypted at every segment
It is part of Defense in Depth
It can satisfy stricter requirements

Cons:

More complicated certificate management: you have lots of certs to manage and expiration of just one of them can break the chain
Increased latency for the extra time to do TLS handshakes
Higher CPU usage because of encryption/decryption at each hop
Just a bigger operational pain

Use cases:

Regulatory compliance requiring end-to-end encryption (government, financial institutions, health care)
Zero-trust network architectures

CNI-Level Encryption

You can instead encrypt traffic at the network layer using your CNI plugin. Several CNI plugins can encrypt all pod traffic transparently. Here are some examples, though I haven’t practiced these but I present them for your education. I encourage you to experiment.

WireGuard Encryption

WireGuard is a VPN protocol built into the Linux kernel (5.6+). It encrypts traffic at Layer 3 so the applications don’t have to care about it..

How it works:

Every cluster node generates a WireGuard keypair
The CNI configures WireGuard tunnels between nodes
All pod-to-pod traffic between nodes is encrypted
Applications see plain TCP/UDP; encryption happens before it gets that far up the OSI stack.

Both Calico and Cilium support this.

IPsec Encryption

IPsec is an older protocol for encryption. It is also Layer .

IPsec requires more configuration than WireGuard, including key exchange (IKE) setup. WireGuard is generally preferred for new deployments due to simpler configuration and better performance.

Calico supports IPSec.

CNI Encryption

┌────────────────────────────────────────────────────────────────────────────────┐
│                         CNI-LEVEL ENCRYPTION                                   │
├────────────────────────────────────────────────────────────────────────────────┤
│                                                                                │
│   Pod A (10.244.0.5)                      Pod B (10.244.1.3)                   │
│   ┌──────────────────┐                    ┌──────────────────┐                 │
│   │  Application     │                    │  Application     │                 │
│   │  sends HTTP      │                    │  receives HTTP   │                 │
│   │  (plaintext)     │                    │  (plaintext)     │                 │
│   └────────┬─────────┘                    └────────▲─────────┘                 │
│            │                                       │                           │
│            ▼                                       │                           │
│   ┌──────────────────┐                    ┌───────┴──────────┐                 │
│   │  Kernel TCP/IP   │                    │  Kernel TCP/IP   │                 │
│   └────────┬─────────┘                    └────────▲─────────┘                 │
│            │                                       │                           │
│            ▼                                       │                           │
│   ┌──────────────────┐                    ┌───────┴──────────┐                 │
│   │  WireGuard       │                    │  WireGuard       │                 │
│   │  encrypts        │ ════════════════►  │  decrypts        │                 │
│   │  (Layer 3)       │    encrypted       │  (Layer 3)       │                 │
│   └──────────────────┘                    └──────────────────┘                 │
│                                                                                │
│   - Application code unchanged                                                 │
│   - All pod traffic encrypted automatically                                    │
│   - Encryption/decryption in kernel (fast)                                     │
│   - No certificate management per application                                  │
│                                                                                │
└────────────────────────────────────────────────────────────────────────────────┘

Pros of CNI encryption:

Transparent to applications
Encrypts all pod traffic, not just HTTP/HTTPS
Transparent certificate management
Faster since it happens in the kernel
Simpler to enable in the whole cluster

Cons of CNI encryption:

No mutual authentication at application level: the traffic is simply encrypted
No application-level identity: just IP address
It encrypts cross-node traffic by default but not traffic within the same-node (must set that)

Use cases:

When you want to encrypt all cluster traffic without a lot of fuss
When you have to have encryption (but not application-level mTLS)
Defense in depth along with application TLS

Verifying CNI Encryption

# Capture traffic between nodes (should be encrypted)
# On Node 1, capture traffic to Node 2
sudo tcpdump -i eth0 -nn host 192.168.1.11 and udp port 51820
# Output (WireGuard):
# 14:30:01.123 IP 192.168.1.10.51820 > 192.168.1.11.51820: UDP, length 128

# The payload is encrypted - you won't see pod IPs or application data

# Capture on WireGuard interface (sees decrypted traffic)
sudo tcpdump -i wireguard.cali -nn host 10.244.1.3
# Output:
# 14:30:01.123 IP 10.244.0.5.45678 > 10.244.1.3.8080: Flags [P.], seq 1:100

# Compare: without encryption, eth0 would show pod IPs directly

Application-Level TLS (Without Service Mesh)

For applications where you must have mutual TLS (mTLS) or certificate-based identity without a service mesh, you can just do the TLS directly in the application or a sidecar container in your pod.

Application-based TLS

The application handles the TLS itself and loads certificates from mounted Secrets (Again, secrets that are ideally managed by a service that updates them.)

apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  containers:
  - name: app
    image: my-app:latest
    ports:
    - containerPort: 8443
    env:
    - name: TLS_CERT_FILE
      value: /etc/tls/tls.crt
    - name: TLS_KEY_FILE
      value: /etc/tls/tls.key
    - name: TLS_CA_FILE
      value: /etc/tls/ca.crt
    volumeMounts:
    - name: tls-certs
      mountPath: /etc/tls
      readOnly: true
  volumes:
  - name: tls-certs
    secret:
      secretName: my-app-tls

Application code (Go example):

// Load certificates
cert, err := tls.LoadX509KeyPair("/etc/tls/tls.crt", "/etc/tls/tls.key")
caCert, err := ioutil.ReadFile("/etc/tls/ca.crt")
caCertPool := x509.NewCertPool()
caCertPool.AppendCertsFromPEM(caCert)

// Configure TLS with mutual authentication
tlsConfig := &tls.Config{
    Certificates: []tls.Certificate{cert},
    ClientCAs:    caCertPool,
    ClientAuth:   tls.RequireAndVerifyClientCert,
}

server := &http.Server{
    Addr:      ":8443",
    TLSConfig: tlsConfig,
}
server.ListenAndServeTLS("", "")

This does, however place the burden on the application developers to deal with TLS.

Certificate Management with cert-manager

cert-manager is a service that automates all the certificate management and renewal within Kubernetes.

Installing cert-manager, example:

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml

Create a Certificate Authority (self-signed for internal use):

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: selfsigned-issuer
spec:
  selfSigned: {}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: internal-ca
  namespace: cert-manager
spec:
  isCA: true
  commonName: internal-ca
  secretName: internal-ca-secret
  privateKey:
    algorithm: ECDSA
    size: 256
  issuerRef:
    name: selfsigned-issuer
    kind: ClusterIssuer
---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: internal-ca-issuer
spec:
  ca:
    secretName: internal-ca-secret

Issue certificates for applications:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: my-app-cert
  namespace: default
spec:
  secretName: my-app-tls
  duration: 2160h  # 90 days
  renewBefore: 360h  # 15 days before expiry
  subject:
    organizations:
    - my-company
  commonName: my-app.default.svc.cluster.local
  dnsNames:
  - my-app
  - my-app.default
  - my-app.default.svc
  - my-app.default.svc.cluster.local
  issuerRef:
    name: internal-ca-issuer
    kind: ClusterIssuer

# Verify certificate was issued
kubectl get certificate my-app-cert
# Output:
# NAME          READY   SECRET       AGE
# my-app-cert   True    my-app-tls   5m

# View certificate details
kubectl get secret my-app-tls -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -text -noout

For external certs Let’s Encrypt is a time-tested way to handle it:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: admin@example.com
    privateKeySecretRef:
      name: letsencrypt-prod-account
    solvers:
    - http01:
        ingress:
          class: nginx

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: app-example-com
  namespace: default
spec:
  secretName: app-example-com-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
  - app.example.com

Network Policies

Network Policies are a different level of security and are not encryption, but they are complementary. They determine which pods can communicate with which other pods at Layer 3/4. By default all pods can talk to all other pods and that is often undesirable.

How Kubernetes Network Policies Work

Network Policies are resources that define ingress and egress rules for pods. The CNI plugin will enforce these rules, for example using iptables rules.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: backend-policy
  namespace: default
spec:
  podSelector:
    matchLabels:
      app: backend
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: database
    ports:
    - protocol: TCP
      port: 5432
  - to:  # Allow DNS
    - namespaceSelector: {}
      podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53

What this policy does:

Gets applied to pods with the label app: backend
It allows ingress only from pods with label app: frontend on running on port 8080
Further, it allows egress only to pods with label app: database on running on port 5432
And it allows egress to kube-dns for DNS resolution which is kinda important!

Default Deny Policy

As mentioned earlier by default, pods will accept traffic from any source. You can set a default deny policy to restrict it and then allow only that that traffic you explicitly allow:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: default
spec:
  podSelector: {}  # Applies to all pods in namespace
  policyTypes:
  - Ingress
  - Egress

Network Policy Limits

NetworkPolicy operates at L3/L4 (IP addresses and ports)
There is no application-layer (L7) filtering (HTTP paths, headers)
Pod identity is IP-based, not certificate-based
It does not encrypt traffic
It requires CNI support (not all CNIs implement NetworkPolicy, notably Flannel)

Verifying Network Policy Enforcement

# Check if CNI supports NetworkPolicy
kubectl get pods -n kube-system -l k8s-app=calico-node
# or
kubectl get pods -n kube-system -l k8s-app=cilium

# Test connectivity (should be blocked by policy)
kubectl exec -it frontend-pod -- curl -m 5 http://backend:8080
# Output: curl: (28) Connection timed out

# Test allowed connectivity
kubectl exec -it allowed-pod -- curl -m 5 http://backend:8080
# Output: HTTP 200 OK

# View iptables rules created by NetworkPolicy (Calico)
sudo iptables -L cali-pi-xxxx -n -v

Combining Encryption Layers

To follow a true defense-in-depth approach, you should combine multiple layers:

TLS at the edge: HTTPS from clients to load balancer
CNI Encryption: WireGuard for all pod-to-pod traffic (it just does it)
Network Policies: Restricting which pods can communicate to each other
Application TLS (not always needed): mTLS for sensitive services and in higher regulatory environments

┌────────────────────────────────────────────────────────────────────────────────┐
│                    DEFENSE IN DEPTH: COMBINED APPROACH                         │
├────────────────────────────────────────────────────────────────────────────────┤
│                                                                                │
│   Internet                          Cluster                                    │
│                                                                                │
│   ┌────────┐     ┌──────────┐     ┌────────────────────────────────────────┐  │
│   │ Client │     │    LB    │     │                                        │  │
│   └───┬────┘     └────┬─────┘     │  ┌──────────┐      ┌──────────┐       │  │
│       │               │           │  │ Ingress  │      │ Backend  │       │  │
│       │   HTTPS       │  HTTPS    │  │   Pod    │      │   Pod    │       │  │
│       │◄═════════════►│◄═════════►│◄═│          │◄═════│          │       │  │
│       │   TLS 1.3     │  TLS 1.3  │  └──────────┘      └──────────┘       │  │
│       │               │           │       │                 │              │  │
│       │               │           │       │   WireGuard     │              │  │
│       │               │           │       │◄═══════════════►│              │  │
│       │               │           │       │   encrypted     │              │  │
│       │               │           │                                        │  │
│       │               │           │  NetworkPolicy: only frontend          │  │
│       │               │           │  can reach backend on 8080             │  │
│       │               │           │                                        │  │
│       │               │           └────────────────────────────────────────┘  │
│                                                                                │
│   Layer:  Edge TLS      Ingress TLS    CNI Encryption    NetworkPolicy       │
│           (HTTPS)       (HTTPS)        (WireGuard)       (L3/L4 ACL)         │
│                                                                                │
└────────────────────────────────────────────────────────────────────────────────┘

Choosing Your Strategy

Decision Flow

Do you need to encrypt traffic inside the cluster?
├── No → TLS termination at load balancer
└── Yes
    ├── Is compliance satisfied by network-layer encryption?
    │   ├── Yes → CNI encryption (WireGuard/IPsec), Simple
    │   └── No (need application-level identity)
    │       ├── Can you use a service mesh?
    │       │   ├── Yes → Istio/Linkerd mTLS (takes care of a lot of the mess for you)
    │       │   └── No → Application TLS with cert-manager (blech)
    │       └── Need E2E encryption?
    │           └── Yes → Re-encryption at each hop
    └── Do you need to restrict which pods can communicate?
        └── Yes → Add Network Policies

Troubleshooting Encryption

TLS Certificate Issues

# Test TLS connection to a service
openssl s_client -connect app.example.com:443 -servername app.example.com

# View certificate details
echo | openssl s_client -connect app.example.com:443 2>/dev/null | openssl x509 -text -noout

# Check certificate expiry
kubectl get secret my-app-tls -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -enddate -noout
# Output: notAfter=Mar 15 12:00:00 2024 GMT

# View cert-manager certificate status
kubectl describe certificate my-app-cert
# Look for Ready condition and any error messages

# View cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager -f

WireGuard Problems

# Check WireGuard status on node
sudo wg show
# Verify peers are connected and traffic is flowing

# Check for WireGuard errors in CNI logs
# Calico:
kubectl logs -n calico-system -l k8s-app=calico-node | grep -i wireguard

# Cilium:
kubectl logs -n kube-system -l k8s-app=cilium | grep -i wireguard

# Verify kernel module is loaded
lsmod | grep wireguard
# Output: wireguard  81920  0

# Check if traffic is actually encrypted (should see UDP 51820)
sudo tcpdump -i eth0 -nn udp port 51820

Network Policy Problems

# List all network policies
kubectl get networkpolicies -A

# Describe a specific policy
kubectl describe networkpolicy backend-policy

# Test connectivity from a debug pod
kubectl run debug --rm -it --image=busybox -- wget -qO- --timeout=5 http://backend:8080

# Check CNI logs for policy enforcement
# Calico:
kubectl logs -n calico-system -l k8s-app=calico-node | grep -i policy

# Cilium:
cilium policy get
cilium monitor --type policy-verdict

Summary

Encryption in Kubernetes involves a decision making process at various levels:

Edge encryption (from client to cluster):

TLS terminates at the load balancer: simplest, but traffic unencrypted inside cluster
TLS passthrough: TLS terminates at Ingress Controller within cluster
Re-encryption: TLS at every hop, the most secure but more to manage

Cluster encryption (from pod to pod):

CNI encryption (WireGuard/IPsec): transparent, encrypts all cross-node traffic transparently
Application TLS: application controls certificates, enables mTLS
Service mesh (not covered here): automates mTLS with sidecars (lots of batteries included and more options for filtering traffic)

Network access control:

Network Policies: L3/L4 rules restricting which pods can communicate
This is NOT encryption, but a complementary security layer to go along with encryption.

This all depends on compliance requirements, tolerance for operational complexity, and whatever threat model you are basing your security policies on. In a lot of cases, TLS at the edge combined with CNI encryption is more than enough and more manageable. For environments requiring application-level identity and mTLS, then application TLS with cert-manager or a service mesh is necessary.

References

Kubernetes TLS and Ingress

Ingress TLS: https://kubernetes.io/docs/concepts/services-networking/ingress/#tls
Securing a Cluster: https://kubernetes.io/docs/tasks/administer-cluster/securing-a-cluster/

cert-manager

cert-manager Documentation: https://cert-manager.io/docs/
Installation: https://cert-manager.io/docs/installation/
ACME Issuer: https://cert-manager.io/docs/configuration/acme/
CA Issuer: https://cert-manager.io/docs/configuration/ca/

Let’s Encrypt

Let’s Encrypt: https://letsencrypt.org/
ACME Protocol: https://datatracker.ietf.org/doc/html/rfc8555

CNI Encryption

Calico WireGuard: https://docs.tigera.io/calico/latest/network-policy/encrypt-cluster-pod-traffic
Cilium Encryption: https://docs.cilium.io/en/stable/security/network/encryption/
WireGuard: https://www.wireguard.com/

Network Policies

Network Policies: https://kubernetes.io/docs/concepts/services-networking/network-policies/
Network Policy Recipes: https://github.com/ahmetb/kubernetes-network-policy-recipes

TLS Best Practices

Mozilla SSL Configuration Generator: https://ssl-config.mozilla.org/
TLS 1.3 RFC: https://datatracker.ietf.org/doc/html/rfc8446

NGINX Ingress Controller

TLS/HTTPS: https://kubernetes.github.io/ingress-nginx/user-guide/tls/
Backend HTTPS: https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/annotations/#backend-protocol
SSL Passthrough: https://kubernetes.github.io/ingress-nginx/user-guide/tls/#ssl-passthrough

Security Frameworks

NIST Cryptographic Standards: https://csrc.nist.gov/projects/cryptographic-standards-and-guidelines
CIS Kubernetes Benchmark: https://www.cisecurity.org/benchmark/kubernetes

Kubernetes Networking Deep Dive, Part 3

Hugh Tipping — Fri, 20 Feb 2026 18:00:17 GMT

Photo by Sunil Ray on Unsplash

This is the third post in my four-part series tracking packets as they flow through a Kubernetes cluster. In Part 2, I went over pod-to-pod (east-west) traffic. Now let’s talk about traffic from an external user, through a LoadBalancer into the cluster, to a pod, and back again. Yep, all that.

Every packet destined for a Kubernetes Service has to pass through iptables rules that select a backend pod and modify packet headers. This is important for debugging connectivity problems, latency, and service configuration.

The Service Abstraction

Let’s dig a bit into the Service resource type in Kubernetes.

Pods are “ephemeral,” meaning they are temporary. Every time a pod gets created or restarted, it gets a new IP address. Trying to connect to a pod’s IP address directly is brittle since the IP can change at a moment’s notice. Instead, use a Service to provide a more stable endpoint that will route your traffic to the pods it exposes.

Service Types

ClusterIP (default): This will allocate a virtual IP from the Service CIDR for the service itself. You can only get to this IP from inside the cluster. It appears only in iptables or IPVS rules, not on any kind of network interface.

NodePort: This type of service opens a port (default range 30000-32767) directly on every node in the cluster. External traffic can reach the service via :.

LoadBalancer: This provisions a load balancer outside of the cluster within whatever platform you’re using (cloud provider or MetalLB for physical services). The load balancer obtains an external IP address and forwards traffic to the NodePort.

# View services and their types
kubectl get svc -o wide
# Output:
# NAME         TYPE           CLUSTER-IP    EXTERNAL-IP     PORT(S)        AGE
# kubernetes   ClusterIP      10.96.0.1               443/TCP        30d
# my-app       LoadBalancer   10.96.0.15    203.0.113.50    80:30080/TCP   5d
# internal-api ClusterIP      10.96.0.42              8080/TCP       5d

For the my-app service, the 80:30080/TCP means: external port 80 coming into the LB maps to the NodePort listening on 30080.

Scenario

For example, let’s say we trace traffic to a LoadBalancer service with three backend pods:

External client: 198.51.100.5 (public internet)
Load balancer external IP: 203.0.113.50 (provided by the LoadBalancer provisioner)
NodePort: 30080
Service ClusterIP: 10.96.0.15
Backend pods:
- Pod 1: 10.244.0.5 on Node 1 (192.168.1.10)
- Pod 2: 10.244.1.3 on Node 2 (192.168.1.11)
- Pod 3: 10.244.2.2 on Node 3 (192.168.1.12)

The Ingress, Step by Step

Step 1: A Client Initiates Request (Layer 7/4/3)

The client’s browser connects to http://203.0.113.50 (the load balancer). The client’s TCP/IP stack creates a packet:

Source IP: 198.51.100.5 (the client)
Destination IP: 203.0.113.50 (the load balancer)
Source port: 54321 (ephemeral)
Destination port: 80 (what the load balancer listens on)

Step 2: Load Balancer Receives Traffic (Layer 4)

The external load balancer receives the packet on its external IP. Then, the load balancer:

Accepts the TCP connection (three way handshake)
It selects a healthy backend node from its pool (nodes with NodePort 30080)
Then it forwards the traffic to that node

The load balancer performs checks against the NodePort to determine if the node is ready to accept traffic.

# Example health check with netcat (what the LB does internally)
# TCP connect to each node on port 30080 to ensure it's responding.
nc -zv 192.168.1.10 30080
nc -zv 192.168.1.11 30080
nc -zv 192.168.1.12 30080

Depending on how the load balancer is configured:

SNAT mode: LB changes source IP to its own IP (this helps you restrict incoming traffic only from the LB. You could also place the source IP into a header X-Forwarded-For and have the client read that if the source IP is important)
DSR/Transparent mode: LB preserves client source IP

Step 3: Packet Arrives at Node (Layer 3)

The load balancer forwards the packet to Node 1 (192.168.1.10):

Source IP: 198.51.100.5 (client, preserved)
Destination IP: 192.168.1.10 (node)
Destination port: 30080 (NodePort port)

The packet goes to the node’s physical interface (eth0).

Step 4: iptables PREROUTING Chain (Layer 3)

The packet first passes through the PREROUTING chain in the iptables nat table. This is where Kubernetes service routing starts.

sudo iptables -t nat -L PREROUTING -n --line-numbers
# Output:
# Chain PREROUTING (policy ACCEPT)
# num  target     prot opt source               destination
# 1    KUBE-SERVICES  all  --  0.0.0.0/0        0.0.0.0/0

From the above, all traffic is sent to the KUBE-SERVICES chain.

Step 5: KUBE-SERVICES Chain (Layer 3)

The KUBE-SERVICES chain contains rules for all the Services in the cluster. It matches a rule by the destination IP:port combinations.

sudo iptables -t nat -L KUBE-SERVICES -n | head -20
# Output:
# Chain KUBE-SERVICES (2 references)
# target                     prot opt source       destination
# KUBE-SVC-XXXX1             tcp  --  0.0.0.0/0    10.96.0.15    /* default/my-app cluster IP */ tcp dpt:80
# KUBE-NODEPORTS             all  --  0.0.0.0/0    0.0.0.0/0     ADDRTYPE match dst-type LOCAL

For NodePort traffic, the destination is a local node IP, not the ClusterIP. The rule ADDRTYPE match dst-type LOCAL catches this and then goes to the chain KUBE-NODEPORTS. (Dizzy yet?)

Step 6: KUBE-NODEPORTS Chain (Layer 3)

This chain matches the actual NodePort numbers:

sudo iptables -t nat -L KUBE-NODEPORTS -n
# Output:
# Chain KUBE-NODEPORTS (1 references)
# target                     prot opt source       destination
# KUBE-EXT-XXXX1             tcp  --  0.0.0.0/0    0.0.0.0/0    /* default/my-app */ tcp dpt:30080

Traffic to port 30080 then moves to the KUBE-EXT-XXXX1 chain for that particular node (external traffic handling for this service).

Step 7: KUBE-EXT Chain and KUBE-SVC Chain (Layer 3)

The KUBE-EXT chain handles external traffic policy and then jumps to the service chain:

sudo iptables -t nat -L KUBE-EXT-XXXX1 -n
# Output (externalTrafficPolicy: Cluster):
# Chain KUBE-EXT-XXXX1 (1 references)
# target                     prot opt source       destination
# KUBE-MARK-MASQ             all  --  0.0.0.0/0    0.0.0.0/0
# KUBE-SVC-XXXX1             all  --  0.0.0.0/0    0.0.0.0/0

KUBE-MARK-MASQ marks the packet for source NAT (SNAT) later. This is necessary because the packet may be forwarded to a pod on a different node.

The KUBE-SVC chain does load balancing across endpoints within the cluster (the different available pods):

sudo iptables -t nat -L KUBE-SVC-XXXX1 -n
# Output:
# Chain KUBE-SVC-XXXX1 (2 references)
# target                     prot opt source       destination
# KUBE-SEP-AAAA1             all  --  0.0.0.0/0    0.0.0.0/0    statistic mode random probability 0.33333333349
# KUBE-SEP-BBBB2             all  --  0.0.0.0/0    0.0.0.0/0    statistic mode random probability 0.50000000000
# KUBE-SEP-CCCC3             all  --  0.0.0.0/0    0.0.0.0/0

Now, probability rules implement random load balancing for picking which pod:

First rule: 33.3% chance (1/3)
Second rule: 50% of remaining (1/2 of 2/3 = 1/3)
Third rule: 100% of remaining (1/3)

Each endpoint gets equal probability.

Step 8: KUBE-SEP Chain - DNAT (Layer 3)

Assume the random selection chooses KUBE-SEP-BBBB2 (Pod 2 on Node 2):

sudo iptables -t nat -L KUBE-SEP-BBBB2 -n
# Output:
# Chain KUBE-SEP-BBBB2 (1 references)
# target                     prot opt source       destination
# KUBE-MARK-MASQ             all  --  10.244.1.3   0.0.0.0/0
# DNAT                       tcp  --  0.0.0.0/0    0.0.0.0/0    tcp to:10.244.1.3:8080

The DNAT rule rewrites the destination:

Before: dst 192.168.1.10:30080 (the NodePort)
After: dst 10.244.1.3:8080 (the Pod’s actual IP address! We’ve nearly there!)

The packet now has:

Source IP: 198.51.100.5 (client)
Destination IP: 10.244.1.3 (Pod 2)
Destination port: 8080

Step 9: Routing Decision (Layer 3)

After PREROUTING, the kernel does some routing magic. The destination 10.244.1.3 is on Node 2, not local to this node. The packet must be forwarded.

ip route get 10.244.1.3
# Output (VXLAN example):
# 10.244.1.3 via 10.244.1.0 dev flannel.1 src 10.244.0.0

The packet will head out the flannel.1 interface to get to Node 2.

Step 10: iptables FORWARD Chain (Layer 3)

The packet passes through the FORWARD chain in the filter table:

sudo iptables -L FORWARD -n | head -10
# Output:
# Chain FORWARD (policy ACCEPT)
# target     prot opt source               destination
# KUBE-FORWARD  all  --  0.0.0.0/0        0.0.0.0/0
# KUBE-SERVICES  all  --  0.0.0.0/0       0.0.0.0/0   ctstate NEW

Step 11: iptables POSTROUTING Chain - SNAT (Layer 3)

Before the packet leaves the node, it passes through POSTROUTING in the nat table (Don’t worry if it’s not all familiar to you):

sudo iptables -t nat -L POSTROUTING -n
# Output:
# Chain POSTROUTING (policy ACCEPT)
# target                     prot opt source       destination
# KUBE-POSTROUTING           all  --  0.0.0.0/0    0.0.0.0/0

sudo iptables -t nat -L KUBE-POSTROUTING -n
# Output:
# Chain KUBE-POSTROUTING (1 references)
# target     prot opt source               destination
# RETURN     all  --  0.0.0.0/0            0.0.0.0/0    mark match ! 0x4000/0x4000
# MARK       all  --  0.0.0.0/0            0.0.0.0/0    MARK xor 0x4000
# MASQUERADE all  --  0.0.0.0/0            0.0.0.0/0

The packet was marked by KUBE-MARK-MASQ earlier. MASQUERADE performs SNAT, changing the source IP to the node’s IP:

Before: src 198.51.100.5
After: src 192.168.1.10

The packet now has:

Source IP: 192.168.1.10 (Node 1)
Destination IP: 10.244.1.3 (Pod 2)

Step 12: Cross-Node Forwarding (Layer 2/3)

The packet is forwarded to Node 2 using the CNI’s cross-node mechanism (VXLAN, BGP, etc.) as described in Part 2.

Step 13: Packet Arrives at Pod (Layer 3/4/7)

On Node 2, the packet is decapsulated (if overlay) and routed to Pod 2. The pod receives:

Source IP: 192.168.1.10 (Node 1, because of SNAT)
Destination IP: 10.244.1.3 (Pod 2)
Destination port: 8080

The application sees the request as coming from Node 1, not the original client. The client IP has been lost due to SNAT.

The Return Path

Step 1: Application Responds (Layer 7/4/3)

Once the app is done doing what it needs to do with the packet, Pod 2’s application sends a response:

Source IP: 10.244.1.3
Destination IP: 192.168.1.10 (Node 1, from SNAT)
Source port: 8080
Destination port: 54321 (client’s original port, preserved in conntrack)

Step 2: Packet Routes to Node 1 (Layer 3)

The destination 192.168.1.10 is Node 1. The packet is forwarded via the CNI.

Step 3: Connection Tracking Reverses NAT (Layer 3)

When the packet arrives at Node 1, the kernel’s connection tracking (conntrack) recognizes it as a reply to an already established connection:

sudo conntrack -L | grep 10.244.1.3
# Output:
# tcp  6 117 TIME_WAIT src=198.51.100.5 dst=192.168.1.10 sport=54321 dport=30080 
#      src=10.244.1.3 dst=192.168.1.10 sport=8080 dport=54321 [ASSURED] mark=0 use=1

The conntrack entry shows the original connection (client to NodePort) and the reply direction (pod to node). The Linux kernel automatically reverses the NAT (kinda neat, eh?):

Un-SNAT: Source 10.244.1.3 becomes 192.168.1.10 (and then to the nodeport perspective)
Un-DNAT: Source 192.168.1.10:30080 (from the client’s perspective)

The packet is then sent back to the client:

Source IP: 192.168.1.10 (Node 1)
Destination IP: 198.51.100.5 (client)
Source port: 30080

Step 4: Load Balancer and Client (Layer 3/4)

The packet returns through the load balancer to the client. The load balancer maintains its own connection state and may perform additional translations to present the external IP (203.0.113.50) of the Load Balancer itself as the source (this hides your internal infrastructure).

The client receives the response from 203.0.113.50:80.

externalTrafficPolicy

As if that wasn’t enough (and it was a lot), here is some additional information about the behavior of packet routing.

The default behavior (externalTrafficPolicy: Cluster) is to use SNAT, which loses the client IP. But there are other ways.

externalTrafficPolicy: Cluster (Default)

Traffic can land on any node
If the selected pod is on a different node, traffic is forwarded
SNAT is applied to ensure return traffic comes back through the same node
Client IP is lost
Load is evenly distributed

externalTrafficPolicy: Local

Traffic only goes to pods on the node that received it
If there are no local pods, the node will actually fail the health checks and the load balancer will stop sending traffic to that node
No SNAT is needed because the traffic stays local to the node.
The Client IP is preserved
Load may not be evenly distributed in cases where nodes have MORE pods so are likely to get more traffic.

apiVersion: v1
kind: Service
metadata:
  name: my-app
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local  # Preserve client IP
  selector:
    app: my-app
  ports:
  - port: 80
    targetPort: 8080

With externalTrafficPolicy: Local, the iptables rules will also change:

sudo iptables -t nat -L KUBE-EXT-XXXX1 -n
# Output (externalTrafficPolicy: Local):
# Chain KUBE-EXT-XXXX1 (1 references)
# target                     prot opt source       destination
# KUBE-SVC-XXXX1             all  --  0.0.0.0/0    0.0.0.0/0

Notice: No KUBE-MARK-MASQ. No, MASQERADEing SNAT will be applied.

The KUBE-SVC chain only contains endpoints local to the node:

# On Node 1, which has Pod 1 (10.244.0.5)
sudo iptables -t nat -L KUBE-SVC-XXXX1 -n
# Output:
# Chain KUBE-SVC-XXXX1 (1 references)
# target                     prot opt source       destination
# KUBE-SEP-AAAA1             all  --  0.0.0.0/0    0.0.0.0/0

In this example, only one endpoint (the local pod) is listed. Nodes without local pods have this:

# On Node 3, which has no pods for this service
sudo iptables -t nat -L KUBE-SVC-XXXX1 -n
# Output:
# Chain KUBE-SVC-XXXX1 (1 references)
# target                     prot opt source       destination
# KUBE-MARK-DROP             all  --  0.0.0.0/0    0.0.0.0/0

The KUBE-MARK-DROP rule causes the packet to be dropped. This will cause the health check to fail since the packet is essentially thrown away. The load balancerwill see this and will stop sending traffic to this particular node.

Choosing Between Policies

Use Cluster when:

Client IP is not needed (or you are putting the IP in an X-Forwarded-For header at the LB level)
Even load distribution is wanted
All nodes should be receiving traffic regardless of where the traffic is destined,

Use Local when:

Client IP must be preserved (logging, geolocation, rate limiting)
Application needs to see a real client IP address for whatever reason
It’s ok to have uneven load balancing

IPVS Mode Differences

I read up that when kube-proxy runs in IPVS mode, the flow is similar to that long path above, but it’s done in a different manner.

IPVS Virtual Servers

Instead of iptables chains, IPVS creates virtual servers that you can check on with the ipvsadm command.

sudo ipvsadm -Ln
# Output:
# IP Virtual Server version 1.2.1 (size=4096)
# Prot LocalAddress:Port Scheduler Flags
#   -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
# TCP  10.96.0.15:80 rr
#   -> 10.244.0.5:8080              Masq    1      2          0
#   -> 10.244.1.3:8080              Masq    1      1          0
#   -> 10.244.2.2:8080              Masq    1      3          0
# TCP  192.168.1.10:30080 rr
#   -> 10.244.0.5:8080              Masq    1      2          0
#   -> 10.244.1.3:8080              Masq    1      1          0
#   -> 10.244.2.2:8080              Masq    1      3          0

IPVS handles both ClusterIP (10.96.0.15:80) and NodePort (192.168.1.10:30080) Service types as virtual servers.

IPVS Still Uses iptables

IPVS mode still uses iptables under the hood for a few cases:

Masquerading (SNAT)
Packet marking
NodePort handling on all node IPs

For some more torture, here you go:

sudo iptables -t nat -L KUBE-POSTROUTING -n
# Output (IPVS mode):
# Chain KUBE-POSTROUTING (1 references)
# target     prot opt source               destination
# MASQUERADE all  --  0.0.0.0/0            0.0.0.0/0    match-set KUBE-LOOP-BACK dst,dst,src

IPVS Load Balancing Algorithms

IPVS supports multiple Load Balancing algorithms (e.g. rr = roundrobin)

# View current scheduler
sudo ipvsadm -Ln | grep "TCP  10.96"
# Output shows scheduler algorithms that are available: rr, lc, dh, sh, sed, nq

# Let's see how it's configured in the kube-proxy config
kubectl get configmap kube-proxy -n kube-system -o yaml | grep scheduler
# Output: scheduler: "rr"

Ingress Controllers

Let’s move a little further out from the level of iptables and ipvs and examine the Ingress Controller. This resource adds another hop to the flow of traffic. Traffic flows:

Client → Load Balancer → NodePort → Ingress Controller Pod → Backend Pod

How Ingress Works

An Ingress Controller (nginx, envoy, traefik, etc.) runs as pods in the cluster
Those pods are exposed via a LoadBalancer or NodePort service (so that it routes traffic to the Ingress)
Ingresses let you route based upon things like host or path to the backend services (so that a specific host name or URL will route to a difference running app).
The controller receives the traffic and proxies this traffic to backends based upon Ingress rules

kubectl get ingress
# Output:
# NAME      CLASS   HOSTS           ADDRESS         PORTS   AGE
# my-app    nginx   app.example.com 203.0.113.50    80      5d

Ingress Traffic Path

A client browser resolves, e.g., app.example.com to 203.0.113.50 (The Ingress Load Balander IP)
The traffic arrives at LoadBalancer
The load balancer forwards to the NodePort of the Ingress Controller Service
iptables routes this traffic to an Ingress Controller pod (as we have already discussed)
The Ingress Controller examines the Host header and path
The Controller then opens new connection to the backend service (ClusterIP)
iptables routes to the backend pod as we’ve discussed before.
The response returns back through the controller to the client

The Ingress Controller terminates the original connection and creates a new one, providing L7 routing capabilities.

# View Ingress Controller pods and their node placement
kubectl get pods -n ingress-nginx -o wide
# Output:
# NAME                                        READY   STATUS    IP           NODE
# ingress-nginx-controller-5c8d66c76d-abc12   1/1     Running   10.244.0.8   node-1
# ingress-nginx-controller-5c8d66c76d-def34   1/1     Running   10.244.1.9   node-2

Observability for North-South Traffic

Viewing iptables counters

# Watch packet counts through service chains
sudo iptables -t nat -L KUBE-SVC-XXXX1 -n -v
# Output:
# Chain KUBE-SVC-XXXX1 (2 references)
#  pkts bytes target     prot opt in     out     source               destination
#   847  50K KUBE-SEP-AAAA1  all  --  *      *   0.0.0.0/0            0.0.0.0/0    statistic mode random probability 0.333
#   823  49K KUBE-SEP-BBBB2  all  --  *      *   0.0.0.0/0            0.0.0.0/0    statistic mode random probability 0.500
#   851  51K KUBE-SEP-CCCC3  all  --  *      *   0.0.0.0/0            0.0.0.0/0

View the conntrack entries

# Watch connection tracking for a specific service
sudo conntrack -E -p tcp --dport 30080
# Output (live events):
# [NEW] tcp      6 120 SYN_SENT src=198.51.100.5 dst=192.168.1.10 sport=54321 dport=30080
# [UPDATE] tcp   6 60 SYN_RECV src=198.51.100.5 dst=192.168.1.10 sport=54321 dport=30080
# [UPDATE] tcp   6 432000 ESTABLISHED src=198.51.100.5 dst=192.168.1.10 sport=54321 dport=30080

Capture traffic at each hop

tcpdump is your friend here:

# At the node's physical interface (incoming)
sudo tcpdump -i eth0 -nn port 30080

# At the bridge (after DNAT, before forwarding)
sudo tcpdump -i cni0 -nn port 8080

# At the VXLAN interface (cross-node traffic)
sudo tcpdump -i flannel.1 -nn port 8080

# Inside the pod
kubectl exec -it my-pod -- tcpdump -i eth0 -nn port 8080

Trace the iptables processing

# Enable iptables tracing (verbose, use sparingly)
sudo iptables -t raw -A PREROUTING -p tcp --dport 30080 -j TRACE
sudo iptables -t raw -A OUTPUT -p tcp --sport 8080 -j TRACE

# View trace in kernel log
sudo dmesg -w | grep TRACE

# Clean up when done
sudo iptables -t raw -D PREROUTING -p tcp --dport 30080 -j TRACE
sudo iptables -t raw -D OUTPUT -p tcp --sport 8080 -j TRACE

Troubleshooting the North-South Traffic

Service not reachable from outside

Verify that LoadBalancer does, in fact, have an external IP address:

kubectl get svc my-app
# Check EXTERNAL-IP is not

Verify that the NodePort is open:

# From a node
ss -tlnp | grep 30080
# Output should show kube-proxy listening

Check that the endpoints exist (you probably won’t have to do this much if ever):

kubectl get endpoints my-app
# Output:
# NAME     ENDPOINTS                                         AGE
# my-app   10.244.0.5:8080,10.244.1.3:8080,10.244.2.2:8080   5d

Verify iptables rules:

sudo iptables -t nat -L KUBE-SERVICES -n | grep my-app

Client IP is not visible to application

Check externalTrafficPolicy:

kubectl get svc my-app -o jsonpath='{.spec.externalTrafficPolicy}'
# Output: Cluster (means SNAT is applied)

Change to Local if client IP needed:

kubectl patch svc my-app -p '{"spec":{"externalTrafficPolicy":"Local"}}'

Verify pods are running on nodes receiving traffic:

kubectl get pods -o wide -l app=my-app

Connection timeouts

Check if SNAT is happening when it actually shouldn’t:

sudo conntrack -L -d  | head
# Is the source IP the client's or the node's?

Verify that the CNI is forwarding cross-node traffic:

# On source node
sudo tcpdump -i flannel.1 -nn host

Check that the pod is healthy:

kubectl describe pod  | grep -A5 Conditions

Summary

North-south traffic through a LoadBalancer service follows this path:

Client connects to external load balancer IP address
Load balancer forwards to the NodePort on a healthy node
iptables PREROUTING/KUBE-SERVICES chains intercept the packet
KUBE-SVC chain randomly selects a backend pod (this is the load balancing decision)
KUBE-SEP chain performs DNAT to the pod IP
If the pod is on a different node, SNAT is applied (externalTrafficPolicy: Cluster)
Packet is forwarded to the pod via CNI
Return traffic uses conntrack to reverse the NAT translations

Two choices for the configuration:

externalTrafficPolicy: Cluster: Even load distribution, loses client IP
externalTrafficPolicy: Local: Preserves client IP, may have uneven distribution

Part 4 will cover encryption in flight: where TLS terminates, CNI-level encryption options, and how to achieve end-to-end encryption without a service mesh.

References

Official Kubernetes Documentation

Service: https://kubernetes.io/docs/concepts/services-networking/service/
Service Types: https://kubernetes.io/docs/concepts/services-networking/service/#publishing-services-service-types
External Traffic Policy: https://kubernetes.io/docs/tasks/access-application-cluster/create-external-load-balancer/#preserving-the-client-source-ip
Ingress: https://kubernetes.io/docs/concepts/services-networking/ingress/
Ingress Controllers: https://kubernetes.io/docs/concepts/services-networking/ingress-controllers/

kube-proxy Documentation

kube-proxy Modes: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-proxy/
IPVS Proxy Mode: https://kubernetes.io/docs/concepts/services-networking/service/#proxy-mode-ipvs
Virtual IPs and Service Proxies: https://kubernetes.io/docs/reference/networking/virtual-ips/

Linux Networking

iptables: https://netfilter.org/documentation/
iptables-extensions (statistic module): https://man7.org/linux/man-pages/man8/iptables-extensions.8.html
conntrack: https://conntrack-tools.netfilter.org/
conntrack man page: https://man7.org/linux/man-pages/man8/conntrack.8.html
IPVS: http://www.linuxvirtualserver.org/software/ipvs.html
ipvsadm: https://man7.org/linux/man-pages/man8/ipvsadm.8.html

Cloud Load Balancers

AWS ELB: https://docs.aws.amazon.com/elasticloadbalancing/
GCP Load Balancing: https://cloud.google.com/load-balancing/docs
Azure Load Balancer: https://docs.microsoft.com/en-us/azure/load-balancer/

MetalLB (Bare Metal Load Balancer)

MetalLB: https://metallb.universe.tf/

Ingress Controllers

NGINX Ingress Controller: https://kubernetes.github.io/ingress-nginx/
Traefik: https://doc.traefik.io/traefik/providers/kubernetes-ingress/
Envoy/Contour: https://projectcontour.io/
HAProxy Ingress: https://haproxy-ingress.github.io/

Kubernetes Networking Deep Dive: Part 2

Hugh Tipping — Sun, 15 Feb 2026 18:18:27 GMT

Photo by Jametlene Reskp on Unsplash

This is the second post in a four-part series tracing packets through a Kubernetes cluster. In Part 1, we covered the foundational concepts: network namespaces, veth pairs, CNI, and kube-proxy. Now we trace actual packets between pods.

Pod-to-pod communication is often called “east-west traffic” and is the most common network traffic in a cluster. A pod connects to a service, a microservice calls another, in-cluster databases receive queries from application pods, etc. This traffic stays within the cluster and does not go through a load balancer.

Let’s look at two scenarios: pods on the same node (communication through a bridge) and pods on different nodes (requiring overlay encapsulation or direct routing).

Scenario Setup

For upcoming examples, let’s use the following setup:

Pod A: IP 10.244.0.5, running on Node 1 (192.168.1.10)
Pod B: IP 10.244.0.6, also running on Node 1
Pod C: IP 10.244.1.5, running on Node 2 (192.168.1.11)

Pod A runs a client application. Pod B and Pod C each run a server listening on port 8080. Let’s trace what happens when Pod A makes an HTTP request to Pod B (on the same node) and then to Pod C (on a different node).

Pod-to-Pod Communication on the Same Node

When two pods are running on the same node, the network traffic never leaves the host. The packet travels via the Linux bridge (that acts as a virtual switch on each node) connecting all the local pod veth endpoints.

Packet Flow Steps

Step 1: Application makes connection (OSI Layer 7)

The application in Pod A calls the connect() system call to establish a TCP connection to 10.244.0.6:8080. (A standard socket operation)

Step 2: Kernel starts building the TCP/IP packet (OSI Layers 4 and 3)

The kernel’s TCP/IP stack creates the packet:

Source IP: 10.244.0.5
Destination IP: 10.244.0.6
Source port: ephemeral, chosen by the OS (e.g., 45678)
Destination port: 8080
Protocol: TCP

Step 3: Routing is decided in Pod A’s networking namespace (OSI Layer 3)

The kernel looks at Pod A’s routing table to determine the outgoing interface:

# Inside Pod A
ip route

# Output:
# default via 10.244.0.1 dev eth0
# 10.244.0.0/24 dev eth0 proto kernel scope link src 10.244.0.5

The destination 10.244.0.6 matches the 10.244.0.0/24 route, which says “send directly via eth0.” A gateway is not needed because the destination is on the same subnet.

Step 4: ARP resolution (Layer 2)

Before sending the packet, the kernel will need the MAC address of 10.244.0.6. It checks the ARP** ( cache (a.k.a. Neighbor Table) or sends an ARP request:

# Inside Pod A's namespace; 
ip neigh show

# Output:
# 10.244.0.6 dev eth0 lladdr 62:a1:b2:c3:d4:e6 REACHABLE
# 10.244.0.1 dev eth0 lladdr 8a:1b:2c:3d:4e:5f REACHABLE

The kernel builds an Ethernet frame with:

Source MAC: Pod A’s eth0 MAC
Destination MAC: Pod B’s eth0 MAC (62:a1:b2:c3:d4:e6)

**ARP = Address Resolution Protocol: For discovery of OSI Layer 2 (MAC) address.

Step 5: Packet leaves Pod A via veth pair (OSI Layer 2)

The frame exits through Pod A’s eth0, which is one end of a veth pair. The packet emerges from the other end (veth-pod-a) in the node’s namespace.

Step 6: Bridge forwards the frame (OSI Layer 2)

The host-side veth is attached to a bridge (commonly named cni0, cbr0, or docker0 depending on CNI). The bridge operates like a Layer 2 switch:

It receives the frame on port veth-pod-a
It looks up the destination MAC in its forwarding table
It finds that MAC 62:a1:b2:c3:d4:e6 is reachable on port veth-pod-b
It forwards the frame out that port

You can view the bridge’s MAC table:

# On the host (FDB = Forwarding Database)
bridge fdb show br cni0 | grep -i "62:a1"

# Output:
# 62:a1:b2:c3:d4:e6 dev veth-pod-b master cni0

Step 7: Packet enters Pod B via veth pair (OSI Layer 2)

The frame enters veth-pod-b in the host namespace and emerges from eth0 in Pod B’s namespace.

Step 8: Kernel delivers to application (OSI Layers 3, 4, 7 → Back up the OSI stack)

Pod B’s kernel:

Receives the packet
Takes out the the Ethernet header (no longer needed), sees that it is an IP packet
Verifies that the destination IP matches its own (10.244.0.6)
Gets rid of the IP header (no longer needed here), sees it is TCP and destined for port 8080
Delivers the payload to the application listening on that port

The return traffic (Pod B’s response) follows the same path in reverse.

Observing Same-Node Traffic

You can check out this traffic at different points:

# Capture on the bridge (sees ALL local pod traffic)
sudo tcpdump -i cni0 -nn host 10.244.0.5 and host 10.244.0.6

# Output:
# 14:23:01.234567 IP 10.244.0.5.45678 > 10.244.0.6.8080: Flags [S], seq 123456789
# 14:23:01.234789 IP 10.244.0.6.8080 > 10.244.0.5.45678: Flags [S.], seq 987654321, ack 123456790

# Capture on a specific veth (sees only that pod's traffic, great for troubleshooting)
sudo tcpdump -i veth-pod-a -nn port 8080

iptables’ Involvement

For direct pod-to-pod communication (not through a Service), iptables is not heavily involved. The packet passes through the FORWARD chain in the filter table, but unless you have NetworkPolicies configured, the default is to ACCEPT.

Another fun iptables command for you:

sudo iptables -L FORWARD -n -v | head -5
# Output:
# Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
#  pkts bytes target     prot opt in     out     source               destination
#  1.2M  890M KUBE-FORWARD  all  --  *      *       0.0.0.0/0            0.0.0.0/0
#  1.2M  890M CNI-FORWARD   all  --  *      *       0.0.0.0/0            0.0.0.0/0

The CNI-FORWARD chain generally contains rules for NetworkPolicy enforcement if your CNI supports it (note that not all do!)

Cross-Node Pod-to-Pod Traffic

When pods are on different nodes, the packet has to travel on the physical network connecting the two nodes. This is where the capabilities of the CNI plugin matters. Let’s take a high-level look at both overlay (VXLAN) and routed (BGP) approaches.

The Central Problem

Node 1 has a packet destined for 10.244.1.5 (Pod C on Node 2). The physical network between nodes has no information about pod IP addresses. It only knows how to route traffic between node IP addresses (192.168.1.10 and 192.168.1.11).

We have two solutions:

Overlay: Encapsulate the pod-to-pod packet inside a node-to-node packet
Routed: Configure the physical network to route the pod CIDRs themselves

Cross-Node Communication: Overlay (VXLAN)

VXLAN (Virtual Extensible LAN) creates a Layer 2 overlay on top of a Layer 3 network. Eek, what’s that?

Well, pod traffic is encapsulated in UDP packets sent between nodes.

Step-by-Step Packet Flow

Step 1: Application initiates connection (OS Layer 7)

Pod A’s application connects to 10.244.1.5:8080 (Pod C). Standard socket connection like above.

Step 2: Kernel builds the TCP/IP packet (OSI Layers 4 and 3)

Source IP: 10.244.0.5
Destination IP: 10.244.1.5
Source ephemeral port: 45678
Destination port: 8080

Step 3: Routing decision done in Pod A’s namespace (OSI Layer 3)

# Inside Pod A
ip route

# Output:
# default via 10.244.0.1 dev eth0
# 10.244.0.0/24 dev eth0 proto kernel scope link src 10.244.0.5

The destination 10.244.1.5 does not match 10.244.0.0/24, so the packet will go to the default gateway (10.244.0.1).

Step 4: Packet reaches host namespace via veth

The packet exits eth0 in Pod A, and emerges from veth-pod-a in the host namespace.

Step 5: Host’s routing table lookup (OSI Layer 3)

The host’s kernel takes a look at its routing table:

# On Node 1
ip route

# Output (Flannel VXLAN example):
# default via 192.168.1.1 dev eth0
# 10.244.0.0/24 dev cni0 proto kernel scope link src 10.244.0.1
# 10.244.1.0/24 via 10.244.1.0 dev flannel.1 onlink
# 10.244.2.0/24 via 10.244.2.0 dev flannel.1 onlink

The destination 10.244.1.5 matches 10.244.1.0/24 via flannel.1. (if that’s the CNI you’re using, but note that the Flannel CNI does not have NetworkPolicy capabilities). The flannel.1 interface is a VXLAN Tunnel Endpoint (VTEP).

Step 6: VXLAN encapsulation (OSI Layer 2 over Layer 3)

The flannel.1 interface:

Looks up which node owns 10.244.1.0/24 (Node 2 in our case: 192.168.1.11)
It then encapsulates the original Ethernet frame within a VXLAN header
It wraps that in a UDP packet (destination port 4789)
And finally it’s wrapped in an IP packet (192.168.1.10 to 192.168.1.11)

You can see the VXLAN FDB (forwarding database):

# On Node 1
bridge fdb show dev flannel.1

# Output:
# 5a:2b:3c:4d:5e:6f dst 192.168.1.11 self permanent
# 7a:8b:9c:0d:1e:2f dst 192.168.1.12 self permanent

This maps the VTEP MAC addresses to node IPs.

Step 7: Outer packet sent to Node 2 (Layers 3 and 2)

The encapsulated packet is then routed normally over the network like any other packet going between nodes to 192.168.1.11:

Source IP: 192.168.1.10 (Node 1)
Destination IP: 192.168.1.11 (Node 2)
Protocol: UDP
Destination port: 4789 (VXLAN)
Payload: VXLAN header + original Ethernet frame

Step 8: Node 2 receives and removes the encapsulation (OSI Layer 3)

Node 2’s kernel:

Receives UDP packet on port 4789
Recognizes it as VXLAN traffic for the flannel.1 interface
Nixes the Ethernet, IP, UDP, and VXLAN headers (tear that envelope open!)
Extracts the original Ethernet frame

Step 9: Host routing to the destination local pod (Layer 3)

The extracted packet has destination 10.244.1.5. Node 2’s routing table:

# On Node 2
ip route

# Output:
# 10.244.1.0/24 dev cni0 proto kernel scope link src 10.244.1.1

A local route so the packet goes to the cni0 bridge.

Step 10: Bridge forwards to Pod C (Layer 2)

The bridge looks up the MAC address for 10.244.1.5 and forwards the frame to veth-pod-c.

Step 11: Pod C receives packet (Layers 3, 4, 7)

The packet enters Pod C’s namespace through eth0. The kernel delivers it to the application on port 8080.

VXLAN Packet Structure

At step 7, the packet on the wire looks something like this. (Yeah a bit in the weeds. I had help drawing this.)

┌────────────────────────────────────────────────────────────────────────────┐
│                    VXLAN ENCAPSULATED PACKET                               │
├───────────────┬────────────────┬──────────┬────────┬──────────┬───────────┤
│ Outer         │ Outer IP       │ UDP      │ VXLAN  │ Inner    │ Inner IP  │
│ Ethernet      │ Header         │ Header   │ Header │ Ethernet │ Packet    │
├───────────────┼────────────────┼──────────┼────────┼──────────┼───────────┤
│ dst: router   │ src: 192.168.  │ src:     │ VNI:   │ dst: Pod │ src:      │
│ MAC           │ 1.10           │ random   │ 1      │ C MAC    │ 10.244.   │
│               │ dst: 192.168.  │ dst:     │        │ src: Pod │ 0.5       │
│ src: Node1    │ 1.11           │ 4789     │        │ A MAC    │ dst:      │
│ MAC           │                │          │        │          │ 10.244.   │
│               │                │          │        │          │ 1.5       │
├───────────────┼────────────────┼──────────┼────────┼──────────┼───────────┤
│    14 bytes   │    20 bytes    │  8 bytes │8 bytes │ 14 bytes │ 20+ bytes │
└───────────────┴────────────────┴──────────┴────────┴──────────┴───────────┘
                │                                    │
                │◄──────── 50 bytes overhead ───────►│

Total overhead: 50 bytes (outer Ethernet + outer IP + UDP + VXLAN + inner Ethernet)
This is why pod MTU is typically 1450 when node MTU is 1500.

Viewing VXLAN Traffic

# On Node 1, capture VXLAN-encapsulated traffic
sudo tcpdump -i eth0 -nn udp port 4789

# Output:
# 14:30:01.123 IP 192.168.1.10.52341 > 192.168.1.11.4789: VXLAN, flags [I] (0x08), vni 1
# IP 10.244.0.5.45678 > 10.244.1.5.8080: Flags [S], seq 123456789

# Capture on the VXLAN interface (sees non-encapsulated traffic)
sudo tcpdump -i flannel.1 -nn host 10.244.1.5

# Output:
# 14:30:01.123 IP 10.244.0.5.45678 > 10.244.1.5.8080: Flags [S], seq 123456789

Cross-Node Communication: Routed (BGP)

In routed (BGP***) mode, there is no encapsulation. Pod IPs are advertised via BGP (or even static routes) so the physical network knows how to route them.

Prerequisites

Routed mode requires one of:

BGP peering
- between nodes and network routers
- between nodes themselves
Static routes configured on network infrastructure
Cloud VPC route table entries

***BGP = Border Gateway Protocol: standardized gateway protocol to exchange routing and reachability information among autonomous systems

Step-by-Step Packet Flow

Steps 1-4: Same as overlay

Pod A builds a packet for 10.244.1.5, it exits via veth to the host namespace.

Step 5: Host routing table lookup (Layer 3)

Notice that the host’s routing table in BGP mode looks a bit different:

# On Node 1 (Calico BGP mode)
ip route

# Output:
# default via 192.168.1.1 dev eth0
# 10.244.0.0/24 dev cni0 proto kernel scope link src 10.244.0.1
# 10.244.1.0/24 via 192.168.1.11 dev eth0 proto bird
# 10.244.2.0/24 via 192.168.1.12 dev eth0 proto bird

The route for 10.244.1.0/24 points directly to Node 2’s IP (192.168.1.11) via the physical interface (eth0 - on the host, not to be confused with a pod’s eth0). The proto bird indicates these routes were installed by the BIRD BGP daemon used by Calico CNI (for this example).

Step 6: Packet sent to Node 2 (OSI Layer 3)

The packet is sent directly with:

Source IP: 10.244.0.5 (Pod A, unchanged)
Destination IP: 10.244.1.5 (Pod C, unchanged)

At Layer 2:

Source MAC: Node 1’s eth0
Destination MAC: Next hop (a router or Node 2 if on same Layer 2 network segment. In this case it’s the same segment.)

Remember that there is no encapsulation. The pod IPs are visible on the physical network.

Step 7: Physical network routing

The physical network needs to know how to route 10.244.1.0/24 to Node 2. This happens via:

BGP: Nodes advertise their pod CIDRs. Routers learn the routes.
Static routes: Network admin configures routes on routers (I’m not fond of static routing. It makes maintenance tougher.)
Cloud VPC: Cloud provider handles the routing.

Step 8: Node 2 receives packet (OSI Layer 3)

Node 2 receives a packet with destination 10.244.1.5. Its routing table:

# On Node 2
ip route

# Output:
# 10.244.1.0/24 dev cni0 proto kernel scope link src 10.244.1.1

The destination is local so route to cni0 bridge.

Steps 9-10: Same as with overlay

Bridge forwards to Pod C. Application receives the packet.

Observing BGP Routes

Again, in the weeds.

# View BGP-learned routes (Calico with BIRD)
sudo calicoctl node status

# Output:
# IPv4 BGP status
# +--------------+-------------------+-------+----------+-------------+
# | PEER ADDRESS |     PEER TYPE     | STATE |  SINCE   |    INFO     |
# +--------------+-------------------+-------+----------+-------------+
# | 192.168.1.11 | node-to-node mesh | up    | 10:23:45 | Established |
# | 192.168.1.12 | node-to-node mesh | up    | 10:23:47 | Established |
# +--------------+-------------------+-------+----------+-------------+

# View routes learned via BGP
ip route show proto bird

# Output:
# 10.244.1.0/24 via 192.168.1.11 dev eth0
# 10.244.2.0/24 via 192.168.1.12 dev eth0

# Capture unencapsulated pod traffic on physical interface
sudo tcpdump -i eth0 -nn host 10.244.1.5

# Output:
# 14:35:01.234 IP 10.244.0.5.45678 > 10.244.1.5.8080: Flags [S], seq 123456789

IPinIP: A Lighter Overlay

Some CNIs (notably Calico) support IPinIP as a lighter-weight alternative to VXLAN. IPinIP encapsulates the original IP packet directly in another IP packet, without the UDP and VXLAN headers.

┌────────────────────────────────────────────────────────────────┐
│                    IPINIP PACKET                               │
├───────────────┬────────────────┬───────────────────────────────┤
│ Outer         │ Outer IP       │ Inner IP Packet               │
│ Ethernet      │ Header         │ (original pod-to-pod packet)  │
├───────────────┼────────────────┼───────────────────────────────┤
│ dst: router   │ src: 192.168.  │ src: 10.244.0.5               │
│ MAC           │ 1.10           │ dst: 10.244.1.5               │
│               │ dst: 192.168.  │ + TCP header + payload        │
│               │ 1.11           │                               │
│               │ proto: 4 (IPIP)│                               │
├───────────────┼────────────────┼───────────────────────────────┤
│    14 bytes   │    20 bytes    │         20+ bytes             │
└───────────────┴────────────────┴───────────────────────────────┘

Overhead: 20 bytes (just the outer IP header)
Pod MTU can be 1480 instead of 1450.

Calico can use IPinIP for cross-subnet traffic and direct routing for same-subnet traffic (”CrossSubnet” mode).

Comparing Approaches

Here are some nerdy numbers I looked up to help you compare overhead for different approaches.

Connection Tracking

Regardless of the approach, the Linux connection tracking system (conntrack) maintains state for TCP and UDP flows. This is important for return traffic (and stateful firewalling).

# View connection tracking entries
sudo conntrack -L | grep 10.244.0.5

# Output:
# tcp      6 117 TIME_WAIT src=10.244.0.5 dst=10.244.1.5 sport=45678 dport=8080 
#          src=10.244.1.5 dst=10.244.0.5 sport=8080 dport=45678 [ASSURED] use=1

The conntrack entry shows both directions of the connection

src=10.244.0.5 dst=10.244.1.5 sport=45678 dport=8080
src=10.244.1.5 dst=10.244.0.5 sport=8080 dport=45678

which lets the kernel match return packets to the original flow.

Troubleshooting Pod-to-Pod Communication

Here is a non-exhaustive list of commands to help you troubleshoot communication problems between pods.

Check routing

# From inside a pod, check route to destination
ip route get 10.244.1.5

# Output:
# 10.244.1.5 via 10.244.0.1 dev eth0 src 10.244.0.5

# On the host, check route to remote pod CIDR
ip route get 10.244.1.5

# Output (VXLAN):
# 10.244.1.5 via 10.244.1.0 dev flannel.1 src 10.244.0.0
# Output (BGP):
# 10.244.1.5 via 192.168.1.11 dev eth0 src 192.168.1.10

Check bridge connectivity

# Verify bridge exists and has interfaces
bridge link show

# Output:
# 8: veth12345@if2:  mtu 1450 master cni0 state forwarding
# 9: veth67890@if2:  mtu 1450 master cni0 state forwarding

# Check bridge MAC table
bridge fdb show br cni0 | head

Check VXLAN state (overlay networks)

# Verify VXLAN interface exists
ip -d link show flannel.1

# Output:
# 4: flannel.1:  mtu 1450 qdisc noqueue state UNKNOWN
#     link/ether 5a:2b:3c:4d:5e:6f brd ff:ff:ff:ff:ff:ff promiscuity 0
#     vxlan id 1 local 192.168.1.10 dev eth0 srcport 0 0 dstport 4789 nolearning

# Check VXLAN FDB entries
bridge fdb show dev flannel.1

Capture traffic at each hop

# Inside source pod
tcpdump -i eth0 -nn host 10.244.1.5

# On source node bridge
sudo tcpdump -i cni0 -nn host 10.244.0.5

# On source node physical interface (shows encapsulated or raw traffic)
sudo tcpdump -i eth0 -nn host 192.168.1.11  # or host 10.244.1.5 for BGP

# On source node VXLAN interface (shows inner traffic)
sudo tcpdump -i flannel.1 -nn host 10.244.1.5

Summary

Pod-to-pod communication takes different paths depending on pod locality:

Same-node: Packets traverse veth pairs and a Linux bridge. This is Layer 2 switching within the host. iptables is not involved unless NetworkPolicies are in place.

Cross-node with overlay (VXLAN/IPinIP): Packets are encapsulated with an outer header containing node IPs. The physical network only sees traffic between nodes. The inner pod IPs are hidden.

Cross-node with BGP routing: Packets are sent directly with pod IPs intact. The physical network must have routes for pod CIDRs. No encapsulation overhead.

Part 3 will trace north-south traffic: a packet from an external user through a LoadBalancer service, into the cluster, and back. This is where iptables and kube-proxy become central to the packet’s journey.

Kubernetes Networking Deep Dive: Part 1

Hugh Tipping — Sat, 07 Feb 2026 05:50:11 GMT

As promised in the Introduction here is the first post in series following the life of a packet through a Kubernetes cluster. Before I start, let’s establish the foundational concepts: how Kubernetes allocates IP addresses to resources, how pods get their network interfaces, and what components in a cluster are responsible for routing traffic.

The Kubernetes Networking Model

Kubernetes has three fundamental requirements in any networking implementation:

Pods can communicate with all other pods on any node without NAT
Nodes can communicate with all pods without NAT
The IP that a pod sees for itself is the same IP that other pods see for it

These requirements are defined in the Kubernetes documentation and any CNI plugin you install in your cluster must satisfy them. This also means every pod gets a routable IP address.

This is different from, say, default Docker networking, where containers on different hosts cannot communicate without port mapping or overlay configuration (was always a bit annoying which is why I like to work in Kubernetes). Kubernetes abstracts this away: from an app perspective, all pods are directly reachable by IP, though services are preferred for connectivity.

Reference: Kubernetes Networking Model documentation at: https://kubernetes.io/docs/concepts/services-networking/

IP Address Allocation

A Kubernetes cluster uses two separate IP ranges defined when a cluster is first created.

Pod CIDR

The “Pod CIDR” is the IP range from which all pod IP addresses are allocated.

For example: the CIDR block 10.244.0.0/16 provides 65,536 addresses. The cluster divides this range among all the nodes in a cluster. Each node gets a chunk, say via a /24 subnet mask (providing 256 addresses for the node), and from here Kubernetes assigns IPs to pods that get scheduled on that node.

For example in a 3-node cluster:

Cluster pod CIDR: 10.244.0.0/16
Node 1 range: 10.244.0.0/24 (pods get 10.244.0.2, 10.244.0.3, etc.)
Node 2 range: 10.244.1.0/24 (pods get 10.244.1.2, 10.244.1.3, etc.)
Node 3 range: 10.244.2.0/24 (pods get 10.244.2.2, 10.244.2.3, etc.)

The node’s kubelet, along with the CNI plugin, handles the IP assignment when a pod starts up.

NOTE:

Remember from Kubernetes basics that Kubernetes itself does not provide any networking functionality per se. It has a Container Networking Interface (CNI) which lets you install whatever networking plugin you wish and the kubelet doesn’t have to care which. I’ll talk more about that later.

Service CIDR

The Service CIDR is a different range used for ClusterIP services. For example, let’s use 10.96.0.0/12. Unlike pod IPs, service IPs are actually virtual. They DO NOT get assigned to any network interface. They are only entries in iptables (or IPVS rules) for redirecting network traffic to pod endpoints. Neither are exactly fun to manage which is part of the beauty of Kubernetes. It takes care of all that for you.

When a Service gets created, Kubernetes grabs an IP from this range and updates the routing rules on EVERY node to handle traffic destined for that specific Service IP.

Showing the Cluster CIDR Configuration

You can check out the CIDR ranges for your cluster with kubectl commands like the below:

# View the pod CIDR, for example
kubectl cluster-info dump | grep -m 1 cluster-cidr
# The output would contain something like: cluster-cidr=10.244.0.0/16

# View the service CIDR, for example
kubectl cluster-info dump | grep -m 1 service-cluster-ip-range
# The output would contain something like: service-cluster-ip-range=10.96.0.0/12

# View the CIDR allocated to a specific node
kubectl get node node-1 -o jsonpath=’{.spec.podCIDR}’
# Output would be something like: 10.244.0.0/24

Reference: Cluster Networking documentation at: https://kubernetes.io/docs/concepts/cluster-administration/networking/

Network Namespaces and the Pod Sandbox

As I was researching this topic, I dug into some interesting things about how Kubernetes actually grabs IP addresses from the Pod CIDR, and handles pod creation and isolation,

Every pod has its own Linux network namespace. A network namespace (not to be confused with a Kubernetes Namespace) provides isolated networking components: its own interfaces, routing tables, iptables rules, etc. Processes running in one networking namespace cannot see nor interact with network resources in another namespace unless explicitly connected. This helps give pods their own isolated environments for their containers.

The Pause Container

Whenever Kubernetes creates a new pod, the container runtime (e.g. containerd, cri-o) first creates what is called a “pause” container, also known as a “sandbox” container. This pause container doesn’t actually do anything. It just waits forever, but holds the network namespace for the upcoming workload containers for the pod.

Then these workload containers (e.g. nginx, your app, a logging sidecar) join this existing networking namespace instead of creating their own. This shared namespace is how containers within the same pod share the same IP address and can communicate with each other over localhost. This eases the sidecar paradigm. Kinda neat.

You can check out pause containers on a node (though you’ll seldom have to do so, it’s still interesting to see how things work):

# On a node that has containerd running, you use the ctr command
sudo ctr -n k8s.io containers list | grep pause
# Output example:
# a1b2c3d4e5f6    registry.k8s.io/pause:3.9    io.containerd.runc.v2

# Or with crictl if you are using cri-o
sudo crictl ps -a | grep pause
# Output example:
# 7f8e9d0c1b2a   3 hours ago   Running   pause   0   abc123def456   nginx-pod

Here’s more info on container runtimes.

Examining Network Namespaces

With root access on a node, you can inspect a pod’s networking namespace. Note that in many cases you may not have direct access to a node, especially in a production environment, so you may be able to do it via a pod with the right security context.

# List network namespaces (requires root on the node or a pod must have CAP_SYS_ADMIN)
sudo lsns -t net

# Output example:
#         NS TYPE NPROCS   PID USER    NETNSID NSFS                           COMMAND
# 4026531840 net     145     1 root unassigned                               /sbin/init
# 4026532509 net       2  1842 65535          0 /run/netns/cni-a1b2c3d4-e5f6   /pause
# 4026532592 net       3  2156 65535          1 /run/netns/cni-f7g8h9i0-j1k2   /pause

# You can also “enter” a pod’s network namespace and inspect it
POD_PID=$(sudo crictl inspect  | jq .info.pid)
sudo nsenter -t $POD_PID -n ip addr

# Output example:
# 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN
#     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
#     inet 127.0.0.1/8 scope host lo
#        valid_lft forever preferred_lft forever
# 3: eth0@if12:  mtu 1450 qdisc noqueue state UP
#     link/ether 62:a1:b2:c3:d4:e5 brd ff:ff:ff:ff:ff:ff link-netnsid 0
#     inet 10.244.0.5/24 brd 10.244.0.255 scope global eth0
#        valid_lft forever preferred_lft forever

That funny-looking eth0@if12 name indicates that this is one end of a veth pair (next section), with the other end being interface index 12 on the node itself.

Reference: Linux network namespaces documentation in man 7 network_namespaces and Kubernetes Pod documentation at: https://kubernetes.io/docs/concepts/workloads/pods/

Virtual Ethernet Pairs

To understand a veth (virtual ethernet) pair, think of a networking cable (virtual, of course) connecting two different network namespaces. This functionality is part of Linux. Packets sent to one side of the pair come out of the other side. Kubernetes standard networking uses veth pairs to join pod namespaces to the host namespace and is how packets travel into and out of a node in a cluster. Without a veth pair (or similar such as macvlan), the pod's network namespace would be isolated, unable to communicate with anything else.

When a pod is created:

The CNI plugin creates the veth pair for the pod.
One end of the pair is placed in the pod namespace (usually called something like eth0)
The other end stays in the host namespace (named vethXXXXXX or something like that)
The host end is attached to a bridge or configured with routes. More on that further below.

Taking a look at the veth Pairs

# On the node, list veth interfaces
ip link show type veth

# Output something like:
# 12: veth9f8e7d6c@if3:  mtu 1450 qdisc noqueue master cni0 state UP
#     link/ether 8a:1b:2c:3d:4e:5f brd ff:ff:ff:ff:ff:ff link-netns cni-a1b2c3d4-e5f6

# The “master cni0” means these are attached to a bridge named cni0
# The “link-netns” shows which network namespace the other end is located in
# View the bridge and its attached interfaces

bridge link show

# Output:
# 12: veth9f8e7d6c@if3:  mtu 1450 master cni0 state forwarding

Reference: veth documentation in man 4 veth

Container Network Interface (CNI)

The CNI is a Kubernetes spec and set of libraries that configure network interfaces for Linux containers. As mentioned above, Kubernetes does not implement networking directly. Instead, it defers the functionality to CNI plugins that do the actual work; this “loose coupling” provides the flexibility to use different types of networking, including cloud networking.

What does the CNI actually do?

When a kubelet needs to set up the networking for a new pod:

The kubelet will call whichever container runtime is installed (containerd, cri-o)
The runtime creates the pod pause container with a new network namespace
The runtime invokes the CNI plugin specified in the node’s CNI configuration.
The CNI plugin then configures these in the networking namespace: creation of interfaces, assignment of IPs, setting up routes within the pod’s routing table.
The CNI plugin then returns the configuration to the runtime, which reports it back to the kubelet

CNI plugins perform three operations:

ADD (configure networking for a new container)
DEL (clean up networking when container stops)
CHECK (verify configuration is correct).

Are you still with me? Good... let’s keep going! 😅

CNI Configuration

CNI configuration lives in /etc/cni/net.d/ on each node. For example, using a flannel CNI plugin the below is a standard config. I won’t go into what the configuration options are.

cat /etc/cni/net.d/10-flannel.conflist

# Output:
# {
#   “name”: “cbr0”,
#   “cniVersion”: “0.3.1”,
#   “plugins”: [
#     {
#       “type”: “flannel”,
#       “delegate”: {
#         “hairpinMode”: true,
#         “isDefaultGateway”: true
#       }
#     },
#     {
#       “type”: “portmap”,
#       “capabilities”: {
#         “portMappings”: true
#       }
#     }
#   ]
# }

Overlay vs Routed Networking

Here’s where it gets even more interesting...

CNI plugins fall into two large categories depending on how they handle cross-node traffic:

Overlay Networks

Encapsulate pod traffic in an outer packet with node IPs
Function on any kind of network infrastructure
Add overhead: extra headers reduce the effective MTU* , plus encap/decap adds CPU overhead
Examples: Flannel (VXLAN mode), Calico (VXLAN or IPinIP mode), Cilium (VXLAN mode)

*MTU = Maximum Transmission Unit, the largest packet size a network link can carry, usually 1500 bytes for Ethernet. With the overhead of encapsulating with more headers, you can transmit less data.

Routed Networks (BGP, host routing)

Pod IPs are routed directly on the physical network connecting nodes
Require network infrastructure configuration: BGP peering or static routes
No encapsulation overhead, full MTU available
Pod IPs visible in network flow logs and to firewalls
Examples: Calico (BGP mode), Cilium (native routing)

Oof that’s a lot! I’ll go into both in detail in Part 2 when tracing cross-node pod-to-pod communication.

Reference: CNI specification at https://www.cni.dev/docs/spec/ and Kubernetes Network Plugins documentation at: https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/

kube-proxy and Service Routing

The kube-proxy is a Kubernetes component that runs on every node in a cluster and maintains all the network rules defined to route traffic to Services. Despite how it’s named, it does not proxy any traffic itself. Instead, it configures the kernel’s packet filtering and NAT facilities.

What kube-proxy actually does

kube-proxy watches the Kubernetes API for Service and EndpointSlice (the latter you generally won’t have to deal with directly) objects. Whenever these change, it updates the node’s packet processing (e.g. iptables) rules to:

Intercept traffic whose destination is a Service’s ClusterIP
Select a backend pod (load balancing)
Redirect the traffic to that pod’s IP (DNAT - mapping a public IP/port to private)
Handle return traffic back to the source

What kube-proxy Does Not Do

kube-proxy is not involved in pod-to-pod communication that does not go through a Service. When Pod A communicates directly with Pod B’s IP address, the traffic is handled by the CNI networking layer. kube-proxy’s rules are only followed when the destination is a Service IP or NodePort.

kube-proxy Modes: iptables vs IPVS

kube-proxy can operate in two main modes (a third, nftables, is newer and I won’t be covering it). The mode affects how Service routing rules are implemented.

iptables Mode

In iptables mode, kube-proxy creates iptables firewall rules for each Service and endpoint.

Advantages:

iptables has been around for a while so it’s mature and well-understood (as well as one can understand that dark art!)
No additional kernel modules are required
Works on any Linux distribution (well, I’d say most)

Disadvantages:

Rule count can grow with more Services and endpoints
Rule updates require rewriting entire chains adding latency for a large set of rules.
Sequential rule evaluation can add latency (if you have a large list and the matching rule is towards the end)

Performance characteristics:

Works well up to approximately 1,000 Services after which things start to slow down because of the sequential processing of rules.
Rule-update latency increases beyond 5,000 Services and updates are not atomic (i.e. the whole thing needs to be updated if there is one change)
Memory usage for rules can become quite a bit as you scale up

Viewing iptables rules (if you like to torture yourself):

# List Service-related NAT rules
sudo iptables -t nat -L KUBE-SERVICES -n | head -20

# Output example:
# Chain KUBE-SERVICES (2 references)
# target     prot opt source               destination
# KUBE-SVC-NPX46M4PTMTKRN6Y  tcp  --  0.0.0.0/0    10.96.0.1     /* default/kubernetes:https cluster IP */ tcp dpt:443
# KUBE-SVC-TCOU7JCQXEZGVUNU  udp  --  0.0.0.0/0    10.96.0.10    /* kube-system/kube-dns:dns cluster IP */ udp dpt:53
# KUBE-SVC-ERIFXISQEP7F7OF4  tcp  --  0.0.0.0/0    10.96.0.10    /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:53

# Count total Service-related rules
sudo iptables -t nat -L -n | wc -l
# Output example: 847

IPVS Mode

In IPVS mode, kube-proxy uses the kernel’s IPVS (IP Virtual Server) subsystem. IPVS is good at load balancing and uses hash tables for instant lookup regardless of the number of services so that speeds things up.

Advantages:

Rule matching via hash tables (fast!)
Supports multiple load balancing algorithms
Lower latency rule updates
Better performance at scale (10,000+ Services or so)

Disadvantages:

Requires IPVS kernel modules
More complex debugging (ipvsadm command)
Still uses iptables for some functions (masquerading (SNAT), NodePort handling)

Viewing IPVS rules (here’s some more fun!):

# Check if IPVS mode is active
sudo ipvsadm -Ln | head -10

# Output:
# IP Virtual Server version 1.2.1 (size=4096)
# Prot LocalAddress:Port Scheduler Flags
#   -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
# TCP  10.96.0.1:443 rr
#   -> 192.168.1.10:6443            Masq    1      3          0
# TCP  10.96.0.10:53 rr
#   -> 10.244.0.2:53                Masq    1      0          0
#   -> 10.244.1.3:53                Masq    1      0          0
# UDP  10.96.0.10:53 rr
#   -> 10.244.0.2:53                Masq    1      0          12
#   -> 10.244.1.3:53                Masq    1      0          8

# View IPVS connection tracking
sudo ipvsadm -Lnc | head -10

# Output:
# IPVS connection entries
# pro expire state       source             virtual            destination
# TCP 14:56  ESTABLISHED 10.244.0.5:48892   10.96.0.1:443      192.168.1.10:6443

Choosing Between Modes (The right mode for the right job)

Use iptables mode when things are smaller:

Running smaller clusters (under 1,000 Services)
You want to keep it simple
You don’t/can’t have IPVS kernel modules installed

Use IPVS mode when things scale up:

Running larger clusters (1,000+ Services)
You need specific load balancing algorithms
Service creation/update latency reduction is important

To configure the mode, set --proxy-mode in kube-proxy’s configuration:

# Check current mode
kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode

# Output: mode: “ipvs” or “iptables” or empty (defaults to iptables)

Reference: kube-proxy documentation at: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-proxy/ and IPVS-based proxying at https://kubernetes.io/docs/concepts/services-networking/service/#proxy-mode-ipvs

OSI Model Context

As we trace packets through the cluster in later posts, we will reference OSI layers to clarify where processing occurs:

Layer 7 (Application): HTTP, gRPC, DNS queries. This is where your application code operates.
Layer 4 (Transport): TCP/UDP. Ports, connections, and load balancing decisions happen here.
Layer 3 (Network): IP addresses and routing. CNI plugins, iptables, NAT, and IPVS operate at this layer.
Layer 2 (Data Link): MAC addresses and switching. Bridges like cni0 and VXLAN encapsulation operate here.

Kubernetes networking operates mostly at Layers 3 and 4, with Layer 2 involvement for local bridging and overlay encapsulation.

Summary

This post covered the foundational concepts required to understand Kubernetes networking:

The flat network model guarantees pod-to-pod communication without NAT
Pod CIDR provides addresses for pods; Service CIDR provides virtual IPs for Services
Each pod runs in an isolated network namespace
Veth pairs connect pod namespaces to the host network
CNI plugins handle the actual network interface configuration
kube-proxy handles Service routing rules using iptables or IPVS
Overlay and routed networking are the two main approaches for cross-node traffic

Part 2 will trace packets through pod-to-pod communication, covering both same-node traffic (through the bridge) and cross-node traffic (using VXLAN overlay and BGP routing).

Kubernetes Networking Deep Dive: Intro

Hugh Tipping — Sat, 31 Jan 2026 11:36:44 GMT

Photo by Scott Rodgerson on Unsplash

I was having a fairly random tech conversation with a friend of mine and out of curiosity, he asked me some questions about Kubernetes networking, knowing I had spent time in that universe. As I explained to him the high-level basics, he dug deeper (as I am happy he always does) detailed questions about the movement of the packet into, out of, and through a cluster. I soon came to realize some gaps in my knowledge.

I went back home and started looking up answers. I found my efforts snowballing into a lot of information. Then it hit me: why not start off my first Substack posts with a series on Kubernetes networking, so that others could benefit from this work.

I decided to do this as a “Life of a Packet” through a Kubernetes cluster. In this series of posts, I will trace out the path from an external user through a load balancer, into the cluster, to a container in a pod, and back out again. I will then cover pod-to-pod communication within the cluster and touch upon encryption.

My goal is to illustrate what not only within Kubernetes constructs, but also what happens at the network level, below the abstractions of Kubernetes. I’ll take a look at iptables rules, network namespaces, veth pairs, and CNI plugins. I will map a packet’s journey including OSI layers: Layer 2 switching, Layer 3 routing, and Layer 4 load balancing.

I purposefully kept this cloud services agnostic, and I do not cover the concept of the Service Meshes since that warrants posts of its own. This will be standard, simple, Kubernetes networking infrastructure, providing foundational knowledge to anyone growing their Kubernetes skills.

Here’s a breakdown of the series:

Part 1: Foundations

This is where I will go over core concepts:

The Kubernetes networking model
Pod and service CIDR allocation
Linux network namespaces
VETH pairs
The Container Network Interface (CNI).

I will also cover kube-proxy’s role and compare iptables vs IPVS from a high-level.

Part 2: Pod-to-Pod Communication (East-West Traffic)

I’ll then trace a packet from one pod to another otherwise known as “East-West” traffic.

This will go over two scenarios:

same-node communication (through a bridge)
cross-node communication

For cross-node traffic, I will explain overlay networking (VXLAN encapsulation) and routed networking (BGP). Don’t get too scared. I’ll have diagrams to show the packet structure at each hop, including what gets encapsulated and where. Of course, if it’s too in-the-weeds you can always skip this bit. Weeds can be fun, though.

Part 3: External Traffic to Pods (North-South Traffic)

Once I’m finished going over traffic within a cluster, I’ll pull back a bit and discuss traffic into and out of a cluster, called “North-South” traffic. This will begin its journey at an external user, go through a Load Balancer service, then land on a pod.

I will dig into iptables: PREROUTING, the KUBE-SERVICES chain, service endpoint selection, DNAT, and SNAT. (Wow. Say that 10 times fast.)

I’ll also show the return path and how connection tracking enables it. And I’ll touch upon externalTrafficPolicy and its tradeoffs for client IP preservation.

Part 4: Encryption In-Flight

Finally, I get into encryption. Oh, yeah, this is always fun, “TLS and Certificates and Keys... on my”.

I will cover:

TLS termination at the load balancer
TLS passthrough to ingress controllers
backend TLS to pods
CNI-level encryption for pod-to-pod traffic.

I will also touch upon the tradeoffs among the different termination points and briefly cover network policies as a complementary security layer.

What You Will Need

If you haven’t been scared off already, note that for these posts I will assume some familiarity with basic networking concepts (TCP/IP) and core Kubernetes concepts (pods, services, nodes, the kubelet). It couldn’t hurt to review the networking OSI model but deep expertise is NOT required. I will include some command examples you can try out for yourself.

Next post: Kubernetes Networking Deep Dive, Part 1: Foundations