Kubernetes

MetalLB with Calico BGP: Deployment, Architecture, and Validation

Deploy MetalLB as controller-only and let Calico’s BGP advertise a LoadBalancer pool. The guide covers Helm setup, IP pool config with avoidBuggyIPs, route flow analysis, source-IP-preserving Local policy, and failover tests for production-grade bare-metal LB.

AHdark

06 Jul 2025 — 19 min read

Photo by Shubham Dhage / Unsplash

Motivation and Background

In bare-metal Kubernetes clusters, the LoadBalancer service type requires custom solutions since no cloud provider is managing external load balancers. MetalLB is a popular add-on that provides LoadBalancer functionality on-premises. Traditionally, MetalLB can operate in Layer 2 mode (using ARP) or BGP mode (advertising service IP routes via BGP). However, when Calico is used as the CNI with BGP routing, running MetalLB’s own BGP speaker can conflict with Calico’s BGP sessions – BGP typically allows only one session per node-to-router pair. In other words, if Calico already peers with your top-of-rack (ToR) router, a second BGP session from MetalLB on the same node would be rejected.

Newer versions of Calico (v3.18+) offer a solution: Calico itself can advertise service IPs, including MetalLB-assigned LoadBalancer IPs, over BGP. By leveraging this capability, we can run MetalLB in a controller-only mode (no MetalLB speaker daemons) and let Calico handle all BGP route advertisements. This design avoids dual BGP sessions and uses fewer components, while still providing external load-balanced access to services. According to Calico’s documentation, advertising service IPs via Calico’s BGP obviates the need for a separate hardware or cloud load balancer and supports advanced features like ECMP load balancing and preservation of client source IP.

Advertise Kubernetes service IP addresses | Calico Documentation

Configure Calico to advertise Kubernetes service cluster IPs and external IPs outside the cluster using BGP.

Calico Documentation

In summary, the motivation for this architecture is to integrate MetalLB with Calico’s BGP for efficient, conflict-free load balancer service exposure. We gain direct routing of external traffic to cluster nodes via BGP, avoid source NAT (with proper configuration), and ensure high availability and failover by using routing protocols rather than relying on a single load balancer appliance.

Architecture Overview and Component Roles

MetalLB (Controller-Only Mode): In this deployment, only MetalLB’s controller component runs. The controller is responsible for allocating IP addresses from a predefined pool to Kubernetes services of type LoadBalancer. We disable MetalLB’s speaker pods (and the optional FRR integration) entirely, since Calico will handle route advertisement. The MetalLB controller watches for new LoadBalancer services and assigns an IP from the pool, updating the service’s .status.loadBalancer.ingress field. It does not itself announce anything via BGP in this mode; it simply manages IP assignments. MetalLB’s configuration will include an IPAddressPool CR that defines the usable IP range, but notably no BGPAdvertisement is configured (unlike a standard MetalLB BGP setup).

Calico (BGP Routing): Calico is assumed to be the CNI plugin in this cluster, providing pod networking and BGP capabilities. Each Kubernetes node runs the Calico agent (often with BIRD or FRR under the hood) that establishes BGP sessions with the ToR switches or routers in the environment. Calico already advertises pod IP routes (and possibly cluster IP routes) to the ToR routers so that internal cluster networks are known externally. With MetalLB integration, we also configure Calico to advertise the service LoadBalancer IP range to those routers. In effect, the Calico BGP daemon on each node will announce routes for service IPs, making those IPs reachable from outside the cluster. Calico implements Kubernetes’ externalTrafficPolicy rules in its route announcements: for services with externalTrafficPolicy set to Local, only nodes that actually host a service’s pod endpoints will advertise a route (typically a /32 route) for that service IP. This means traffic from outside will be directed only to nodes that can serve it, preserving the client’s source IP and avoiding an extra hop across nodes. (In contrast, if externalTrafficPolicy were “Cluster”, every node might advertise the service network, and traffic could be forwarded internally with potential source NAT.)

Upstream Routers/Switches: The ToR switches or routers are configured as BGP peers for Calico. They learn routes for pod subnets and service IPs from the Calico nodes. In this design, the routers will learn the 10.208.1.0/24 range (our LoadBalancer pool) via BGP from the cluster. If using externalTrafficPolicy: Local, the routers may learn more specific host routes (/32s) for each allocated service IP, potentially with multiple next-hops (ECMP) if a service is served by pods on multiple nodes. The routers then route client traffic destined for a service IP to one of the advertising nodes.

Kube-Proxy and Service Handling: Kubernetes’ kube-proxy is still responsible for implementing service IPs on the nodes. For a LoadBalancer service, kube-proxy allocates a nodePort (and a healthCheckNodePort if needed) and ensures traffic arriving at the node for the service’s external IP gets forwarded to the service’s pods. With externalTrafficPolicy: Local, kube-proxy will only forward external traffic to pods that are local to the same node and will drop packets if no local endpoints exist. This mode also preserves the original client IP address, since no SNAT is done for local traffic. Essentially, each node will handle LoadBalancer traffic only if it has a healthy pod for that service, otherwise it will not accept or forward that traffic. (In cloud environments or MetalLB’s own BGP, a similar effect is achieved by withdrawing advertisement or failing health checks on nodes without endpoints. With Calico, advertisement of /32 routes naturally ceases on nodes without pods for Local mode services.)

Flow of Traffic: The overall flow is: a client issues traffic to the service’s external IP; the upstream router, having learned the route via BGP, forwards the packet to one of the Kubernetes nodes advertising that IP; that node’s kube-proxy intercepts the traffic and directs it to a local pod; the pod responds, and the return traffic leaves the node to the router, which sends it back to the client. No additional hop or overlay is needed once BGP routes are in place, and the client’s IP is preserved into the pod. This process is illustrated in the diagram below.

Traffic ingress flow with MetalLB (controller-only) and Calico BGP. (1) The client sends a packet to the service’s LoadBalancer IP (e.g. 10.208.1.X). (2) The ToR router, which learned the route via BGP, forwards it to a node that hosts a service endpoint. (3) The node’s kube-proxy delivers the traffic to the local pod. Dashed arrows show the return path: the pod’s response goes back through the same node and router to the client, with the source IP preserved.

Deployment Steps and Configuration

Deploying this setup involves configuring MetalLB via Helm (with speakers disabled), defining the IP address pool CRD, and adjusting Calico’s BGP settings. The following steps outline the process:

Install MetalLB in Controller-Only Mode

Use the MetalLB Helm chart (or manifest manifests) to deploy only the controller component. When installing via Helm, set the values to disable the speaker and FRR. For example, one can override values as follows (assuming the MetalLB namespace is metallb-system):

controller:
  enabled: true          # ensure the controller is deployed
speaker:
  enabled: false         # disable the speaker DaemonSet
  # If using FRR mode in MetalLB (for BGP), explicitly disable it as well:
  frr:
    enabled: false

Install MetalLB with these values (the exact Helm command depends on the chart source; e.g., for Bitnami or the official chart). This will create the MetalLB controller Deployment in the cluster, but no speaker pods will be scheduled on the nodes. We still get the MetalLB CRDs installed for configuring IP addresses. MetalLB’s controller will now act solely as an IP allocator. As per MetalLB documentation, running MetalLB with only a pool (and no BGP speaker) is sufficient when Calico is announcing the routes.

Configure an IPAddressPool for the LoadBalancer IPs

We need to define the pool of IP addresses that MetalLB can allocate for LoadBalancer services. In our scenario, the IP pool is 10.208.1.0/24, but with the .0 and .255 addresses to be avoided (since .0 is the network address and .255 is the broadcast address for that subnet, and some network equipment will not accept those addresses for hosts). We can create a MetalLB IPAddressPool custom resource as below:

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: lb-pool-10-208-1
  namespace: metallb-system
spec:
  addresses:
    - 10.208.1.0/24
  avoidBuggyIPs: true       # Avoid allocating .0 and .255:contentReference[oaicite:17]{index=17}

The avoidBuggyIPs: true flag ensures MetalLB will skip any address ending in .0 or .255 when assigning IPs. Alternatively, we could specify the address range as 10.208.1.1-10.208.1.254 to exclude them. No BGPAdvertisement resource is provided – leaving it out indicates that MetalLB itself will not advertise these routes (Calico will do it). If you were using MetalLB’s own BGP speaker, you would normally create a BGPAdvertisement CR to define how to announce the pool; here it’s deliberately omitted.

Apply this manifest to the cluster. Once created, MetalLB’s controller will be aware of the pool and ready to allocate addresses from it. (Note: ensure the MetalLB CRDs are installed by the Helm chart and that the controller is running before applying CRs.)

Enable BGP Route Advertisement in Calico for the Service Pool

Since Calico is peering with the upstream routers, we must configure Calico to advertise the LoadBalancer service IP range. Calico uses a resource called BGPConfiguration (specifically the default BGPConfiguration) to control route advertisement. We add our pool CIDR (10.208.1.0/24) to the list of service networks to advertise. For example, using calicoctl:

calicoctl patch BGPConfiguration default --patch '{
  "spec": {
    "serviceLoadBalancerIPs": [ { "cidr": "10.208.1.0/24" } ]
  }
}'

This patch adds the /24 block to Calico’s advertisement settings After this, each Calico node will advertise the 10.208.1.0/24 network to its BGP peers (the ToR switches). Important: Calico by default will advertise the entire block as one route. However, when services are configured with externalTrafficPolicy: Local, Calico can advertise more specific routes (/32s for each service IP) from only the nodes that have active endpoints. In practice, this means upstream routers may receive multiple equal-cost routes for the /24 (one from each node), or they may receive individual /32 routes for service IPs (if Calico breaks them out in Local mode). According to Calico’s documentation, for Local services the nodes with pods will advertise specific /32 routes and preserve client IP, whereas for Cluster services all nodes advertise the service CIDR with potential ECMP distribution. This behavior ensures that with Local traffic policy, nodes without a pod will not attract traffic for that service (since they won’t advertise the /32 route for it).

Note: The exact advertisement granularity can depend on Calico version. If necessary, one can explicitly list smaller prefixes or even individual IPs in serviceLoadBalancerIPs to ensure specific routes. In our case, a single /24 is acceptable since Calico is aware of externalTrafficPolicy and will handle route advertisement per service appropriately. If you prefer to advertise only in-use addresses, you could list each service’s /32 once allocated (but this requires updating Calico config dynamically, which is not common). Most deployments use a stable pool prefix for simplicity.

At this stage, MetalLB and Calico are configured: MetalLB will allocate addresses from 10.208.1.1–10.208.1.254 to services, and Calico will advertise 10.208.1.0/24 (or the specific service IPs) to the upstream network. We assume Calico’s BGP peering with the routers is already set up (this typically involves Calico BGPPeer or setting the nodes to peer with the ToR IPs; those details are outside our scope). We also assume the upstream routers are configured to accept and propagate the routes learned from the cluster, and to use multipath (ECMP) if multiple nodes advertise the same prefix – ECMP is recommended for balancing traffic in such scenarios.

Service Exposure using LoadBalancer and `externalTrafficPolicy: Local`

With the infrastructure in place, exposing a service externally is straightforward. You can create a Service of type LoadBalancer as usual. For example:

apiVersion: v1
kind: Service
metadata:
  name: web-server
  namespace: demo
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local
  selector:
    app: web-server
  ports:
    - port: 80
      targetPort: 8080
      protocol: TCP

Here, externalTrafficPolicy: Local is set to preserve the client source IP and avoid a second hop in cluster. When this service is created, the MetalLB controller will promptly allocate an IP from the pool (say 10.208.1.5 for example) and assign it to the service. You can verify this by describing the service (kubectl get svc web-server -o wide), which will show an EXTERNAL-IP of 10.208.1.5 (once allocated). Kubernetes will also allocate a nodePort for this service (and a healthCheckNodePort since externalTrafficPolicy is Local). Kube-proxy on each node will program rules such that any traffic hitting the node on the service IP (10.208.1.5) and port 80 is forwarded to a local pod on port 8080 if one exists. If a node has no pods for this service, kube-proxy will not forward the traffic (it effectively drops it due to the Local setting). Furthermore, Kubernetes uses a health check mechanism to ensure external load balancers (like cloud LBs) only send traffic to healthy nodes; in our on-prem BGP case, this correlates with Calico only advertising routes from nodes with endpoints, achieving a similar effect.

At this point, the service is announced to the outside world: Calico sees that there is a service IP in the range 10.208.1.0/24 and, since we configured BGP advertisement, it will inject the route into BGP. For a Local service, Calico will advertise a route to 10.208.1.5 via each node that’s running a web-server pod. Typically, this is implemented by advertising a /32 route from those specific nodes. Upstream, the routers now have routes to 10.208.1.5 pointing to the IPs of the relevant Kubernetes nodes. If multiple nodes have web-server pods, the router may have multiple next-hops for 10.208.1.5 and can distribute traffic between them (ECMP). If only one node has the pods, only that node announces the route, and all traffic will go there.

By using externalTrafficPolicy: Local, we ensure that the node does not do source NAT on incoming connections. The pod will see the real client IP as the source address. This is important for logging, security rules, and any logic that depends on the origin IP. It also means return traffic can go directly back out to the client without needing to be rerouted through a different node (avoiding the “second hop” scenario). In contrast, if we had used the default (Cluster) policy, any node could accept the traffic and then potentially forward it to a pod on another node, changing the source IP to the node’s IP in the process (SNAT), which we want to avoid here.

Traffic Flow Verification

Once a LoadBalancer service is deployed and assigned an IP, it’s crucial to verify that everything works as expected. We will walk through several validation steps:

BGP Route Advertisement

First, confirm that the service IP (or its containing prefix) is being advertised via BGP by the Calico agents. On each node, you can check the BGP session status using Calico’s tools. For example, run calicoctl node status – this will show the BGP peers and the routes being advertised/received. You should see that the 10.208.1.0/24 network is in the list of exported routes. If the service is using externalTrafficPolicy: Local, check that each node with a service pod is advertising either the /24 or a /32 for the service IP. Another approach is to inspect the routing table on the ToR router: you should find an entry for 10.208.1.0/24 (or 10.208.1.5/32, etc.) pointing towards the Kubernetes node(s) via BGP. For instance, using a command like show ip route 10.208.1.0/24 on the router might reveal multiple next-hops (each node’s IP) if multiple nodes are advertising it. This confirms that Calico is advertising the LoadBalancer IP pool as configured. (If this route is missing, see the Troubleshooting section.)

External Connectivity Test

Next, test that an external client can reach the service. Using a machine outside the cluster (but within the network that the ToR serves), try to connect to the service IP. For example, if it’s a web service, run curl http://10.208.1.5/ (replace with your service’s IP and port as appropriate). The request should be routed to one of the Kubernetes nodes and then to the pod, and you should get a response. Success here indicates that the BGP advertisement and routing are correctly directing traffic into the cluster. If the connection fails, one thing to check is any firewall rules: ensure that the upstream network and the node itself allow traffic on the service’s port. Remember that, under the hood, the traffic is hitting the node on the NodePort (which is usually a high port). Kubernetes takes care of opening the NodePort on 0.0.0.0, but if the host has firewall (iptables rules or cloud security groups), they must permit the traffic. In most bare-metal setups with Calico, Calico’s own network policy could also block traffic if not allowed – ensure no network policy is inadvertently denying the external traffic.

Source IP Preservation

To verify that the client’s source IP is preserved (one of the key benefits of using Local mode), you can inspect the traffic reaching the pod. One simple method is to check the logs of the application pod if it logs client connections. For example, if the pod runs an Nginx or Apache, the access logs should show the real IP of the client (e.g., 10.0.0.25) rather than an internal node IP. If you don’t have such logging, you can deploy a test server that outputs the client address (for instance, a small Python/Go web server that prints the X-Forwarded-For or the remote address). You could also exec into the pod and use tools like tcpdump or ss to observe the source addresses of incoming connections.

Expected result: the source IP should match the external client’s IP. If you see the node’s IP instead, then likely the service is falling back to Cluster mode or traffic is being rerouted – recheck that externalTrafficPolicy is set to Local on the service, and that the pod indeed runs on the node that received the traffic. The Kubernetes docs note that externalTrafficPolicy=Local will preserve client IP and drop packets on nodes without endpoints, so each successful connection’s source IP preservation indicates traffic hit a node with a pod (as designed).

Node Failover Scenario

One of the advantages of using BGP for service routing is robust failover. We should test what happens if the node serving the LoadBalancer traffic goes down or the pod is rescheduled to a different node. To simulate this, perform a controlled failover test:

Pod relocation: Cordone the node that is currently receiving traffic (kubectl cordon <node>), then delete the pod (so it gets rescheduled on another node). Once the pod starts on a new node, Calico should detect that the new node now has an endpoint for the service and the old node does not. Calico on the old node will withdraw the BGP route for the service IP, and Calico on the new node will announce it. This transition is usually very fast (within seconds). Verify by checking routes on the router – the next-hop for 10.208.1.5 should update to the new node’s IP, or an additional route for the /32 from the new node appears while the old one disappears. Now try accessing the service IP again from the client. It should still work, now being served by the pod on the new node. In ideal cases, this failover happens with minimal disruption (a few dropped packets during convergence).
Node down: For a more drastic test, simulate a node failure (if possible in a staging environment). This could be shutting down the node or stopping the Calico BGP service on that node. The upstream router’s BGP session with that node will go down, and it will remove the route to 10.208.1.5 via that node. If Kubernetes reschedules the pod to another node, that node’s Calico will announce the route. If there was already a second pod on another node, traffic would fail over to that node’s route automatically. This kind of high availability is a strong feature of BGP-based load balancing – it avoids having a single choke point. After node failover, confirm that the service remains reachable and that the client IP is still preserved.

Throughout these tests, also keep an eye on the MetalLB controller’s logs (to see IP assignment events or errors) and Calico’s logs (for any BGP session issues or route advertisement logs). Both MetalLB and Calico provide metrics and diagnostics (MetalLB has metrics for allocated IPs; Calico’s calico-node status can show route counts, etc.) that can aid in verifying that everything is functioning.

Troubleshooting Common Issues

Even with the correct setup, a few issues can arise. Here are common problems and how to address them:

MetalLB assigned an IP ending in .0 or .255: If you notice a service got the .0 or .255 address (e.g., 10.208.1.0 or 10.208.1.255 in our pool), it could lead to connectivity problems. Many networks consider the .0 and .255 of a /24 as network and broadcast addresses respectively, and some routers or clients will refuse to route them due to “smurf” attack protections. To avoid this, always set avoidBuggyIPs: true in the IPAddressPool (or explicitly exclude those addresses). If one of those addresses was already allocated, you should remove that IP from the service. MetalLB won’t automatically reassign a new IP if the pool changes for an existing service; you might need to delete and recreate the service (or cycle the MetalLB controller) to force a new allocation. In summary, prevention is best: configure the pool to skip .0/.255 from the start.
No route to the LoadBalancer IP (service not reachable): If external clients cannot reach the service at all, and pinging the service IP yields no response, it’s likely that the BGP advertisement isn’t working. Check the Calico BGP configuration to ensure the serviceLoadBalancerIPs field includes the correct CIDR. A common oversight is forgetting to configure this, meaning Calico is not announcing the service range. If using Calico’s Kubernetes operator, ensure that the corresponding setting is enabled (in some Calico installations, this can be set via the advertiseServiceLoadBalancerIPs in the FelixConfiguration or BGPConfiguration). Also verify that the BGP peering between the nodes and the router is established (using calicoctl node status or looking at router BGP status). If the BGP session is down, obviously no routes will be advertised. Another possibility is that the MetalLB controller did not allocate an IP (service still in pending state for external IP). In that case, ensure the MetalLB controller is running and the IPAddressPool is correctly created (check kubectl get ipaddresspool -A). Any errors in MetalLB’s controller logs (e.g., if the pool is misconfigured or no available IP) should be resolved. Once the BGP route is in place, the service IP should start responding. You can test route propagation by attempting a traceroute or checking the router’s routing table for the service IP.
Connectivity works but is intermittent or one node is blackholing traffic: This can happen if a node with no endpoints is somehow still advertising the service or receiving traffic. In theory, Calico’s per-node advertisement for Local services should prevent this. However, if all nodes advertise a broad /24 and the upstream router does ECMP, it might sometimes send traffic to a node without an endpoint. Those packets would be dropped by kube-proxy (as design), but from the client perspective it appears as intermittent packet loss. To mitigate this, ensure that Calico is actually doing per-service /32 advertisements. You might need to upgrade Calico if using an older version where this was not functioning correctly (earlier Calico 3.18+ should support it, but a bug or misconfiguration could cause advertisement of only the aggregate). If necessary, as a workaround, you could restrict the MetalLB IP pool to only specific nodes (using MetalLB’s addressPool.spec.serviceSelectors or scheduling all service pods to all nodes). Another angle is to check that the kube-proxy health check node port is functioning – some external load balancers use it to determine node readiness. Although we rely on BGP here, if using a cloud provider’s mechanism for health checking in tandem, ensure nodes without endpoints fail the health check so traffic isn’t sent to them. In pure BGP, this should be automatic via route withdrawal.
BGP session or route flapping causing connection drops: If you observe that connections drop periodically or fail over too often, it might be due to BGP stability. Check the BGP timers (hold time, keepalive) in Calico’s configuration – the default hold timer might be 180s which is quite long for failover. However, Calico might be using a shorter graceful restart timer when withdrawing service routes. Tune BGP timers if needed to achieve faster failover, but beware of too aggressive settings causing flaps. Also verify that BGP multipath is enabled on the upstream router. Without multipath, the router will only use one node as the next-hop for the /24, and if that node goes down, it has to converge to another, causing a brief outage. With multipath, it can load balance and instantly switch paths if one fails. On Cisco/FRR, this usually means enabling maximum-paths for BGP.
Firewall and security issues: In some cases, everything on the Kubernetes side is correct, but network policies or host firewalls block the traffic. Calico’s default is to allow all traffic if no network policy is in place (policy is “deny” only if configured so). Ensure no NetworkPolicy is denying the external ingress. On the host, if running something like firewalld or iptables rules outside of kube-proxy, make sure traffic to the nodePort or BGP ports isn’t blocked. Calico BGP uses TCP port 179; if that was firewalled, the BGP session wouldn’t come up. The service traffic will come to nodePort (30000-32767 range by default), which should be allowed. If using externalTrafficPolicy: Local, Kubernetes by default opens the healthCheckNodePort on each node; if you have an upstream load balancer performing health checks (not in pure BGP scenario), ensure those can reach that port. Although our scenario doesn’t use an external load balancer, healthCheckNodePort is still allocated (visible in kubectl get svc -o yaml output) and kube-proxy will respond with 200 OK on that port on nodes with endpoints. This is primarily for cloud LBs; for us, it’s not actively used unless some custom monitoring takes advantage of it.

By addressing these issues, one can achieve a stable and robust LoadBalancer setup. The combination of MetalLB (for IP management) and Calico (for BGP route distribution) provides an elegant, cloud-native way to expose services in bare-metal clusters, with the networking intelligence handled at the routing layer. This setup is suitable for advanced Kubernetes deployments where administrators have control over the network infrastructure and need to maximize performance and resilience for service exposure. With proper validation and monitoring, it can run in production to deliver cloud-like LoadBalancer services on premises.

MetalLB with Calico BGP: Deployment, Architecture, and Validation

AHdark

Motivation and Background

Architecture Overview and Component Roles

Deployment Steps and Configuration

Install MetalLB in Controller-Only Mode

Configure an IPAddressPool for the LoadBalancer IPs

Enable BGP Route Advertisement in Calico for the Service Pool

Service Exposure using LoadBalancer and `externalTrafficPolicy: Local`

Traffic Flow Verification

BGP Route Advertisement

External Connectivity Test

Source IP Preservation

Node Failover Scenario

Troubleshooting Common Issues

Read more

Infrastructure Design for Self-Hosted Kubernetes Clusters

Multi-Platform API Alignment with Stable Updates

A more elegant way to propagate tracing context

Web 后端应用程序的可观测性改进

Motivation and Background

Architecture Overview and Component Roles

Deployment Steps and Configuration

Install MetalLB in Controller-Only Mode

Configure an IPAddressPool for the LoadBalancer IPs

Enable BGP Route Advertisement in Calico for the Service Pool

Service Exposure using LoadBalancer and externalTrafficPolicy: Local

Traffic Flow Verification

BGP Route Advertisement

External Connectivity Test

Source IP Preservation

Node Failover Scenario

Troubleshooting Common Issues

Read more

Infrastructure Design for Self-Hosted Kubernetes Clusters

Multi-Platform API Alignment with Stable Updates

A more elegant way to propagate tracing context

Web 后端应用程序的可观测性改进

Service Exposure using LoadBalancer and `externalTrafficPolicy: Local`