November 10, 2020

Cilium 1.9: Maglev, Deny Policies, VM Support, OpenShift, Hubble mTLS, Bandwidth Manager, eBPF Node-Local Redirect, Datapath Optimizations, and more

intro ogimage

We are excited to announce the Cilium 1.9 release. A total of 2816 commits have been contributed by a community of 251 developers, many of whom made their first contributions this cycle. Cilium 1.9 brings with it several brand new features:

  • Maglev Load Balancing: Maglev provides load balancing with consistent hashing for high-availability scenarios by dynamically adapting to environments where nodes come and go. Connections are consistently balanced to backends even if the packets arrive at different load balancing nodes. (More details)
  • Deny Network Policies: Users can now define network policies that explicitly reject traffic from specific sources or to specific destinations. These policies may be configured via CiliumNetworkPolicy or CiliumClusterwideNetworkPolicy, allowing multi-tenant environments to have baseline restrictions on connectivity and also delegate allow policies to application teams. (More details)
  • VM/Metal Support: Cilium can now be deployed on any VM or baremetal node to connect that node to the Cilium cluster and represent external workloads running on such nodes as if the workload was running as a Pod inside the Kubernetes cluster. These nodes also gain access to Kubernetes services, and the visibility and policy enforcement scope of Cilium is extended to cover the VMs and baremetal nodes. (More details)
  • Bandwidth Manager: A new bandwidth manager automatically optimizes TCP Congestion Control and other network settings for improved latency and throughput. Fair queueing is automatically enabled and rate limiting can be configured including support for the kubernetes.io/egress-bandwidth annotation. (More details)
  • OpenShift Support: A new guide describes the installation of Cilium on OpenShift. Red Hat Universal Base Image (UBI) based builds and an operator to simplify installation will become available in the next weeks. (More details)
  • Hubble mTLS Automation: Hubble is now capable of automatically generating and distributing TLS certificates between Hubble agents to secure the communication between all Hubble components. (More details)
  • eBPF-Based Node-Local DNS and KIAM: The new eBPF-based Local Redirect Policies allow to build node-local DNS resolvers, KIAM, and other node-local services entirely with eBPF instead of relying on iptables. (More details)
  • Datapath Optimizations (iptables bypass): We have added two new eBPF helpers to the Linux kernel that allow implementing the direct routing datapath entirely in eBPF and bypassing netfilter/iptables entirely. This leads to significant improvements in throughput and latency, in particular for single stream TCP sessions. Cilium automatically enables these optimizations on kernel 5.10 or later. (More details)
  • Lots of other improvements: This release brings lots of additional improvements such as High Availability for Cilium Operator to improve fault tolerance, Hubble UI improvements, Transparent Proxy Support in eBPF, and a new Performance Testing Framework.

What is Cilium?

Cilium is open source software for transparently providing and securing the network and API connectivity between application services deployed using Linux container management platforms such as Kubernetes.

At the foundation of Cilium is a new Linux kernel technology called eBPF, which enables the dynamic insertion of powerful security, visibility, and networking control logic within Linux itself. eBPF is utilized to provide functionality such as multi-cluster routing, load balancing to replace kube-proxy, transparent encryption as well as network and service security. Besides providing traditional network-level security, the flexibility of eBPF enables security with the context of application protocols and DNS requests/responses. Cilium is tightly integrated with Envoy and provides an extension framework based on Go. Because eBPF runs inside the Linux kernel, all Cilium functionality can be applied without any changes to the application code or container configuration.

See the section Introduction to Cilium for a more detailed general introduction to Cilium.

eBPF Summit 2020 - Recordings are now available

We hosted the first ever eBPF Summit in October this year. The recordings including all keynotes and lightning talks are now available on the eBPF Summit website.

Maglev Load Balancing

Contributed by Daniel Borkmann and Martynas Pumputis

Maglev train

In Cilium 1.8 we extended Cilium's eBPF-based north-south load balancer to run at the XDP layer and accelerate handling for Kubernetes service types NodePort, LoadBalancer, and services with externalIPs. This significantly reduces CPU overhead and supports handling millions of packets per second efficiently while being co-located with regular user workloads. The new release complements the north-south load balancer by adding support for Maglev's consistent hashing algorithm for the backend selection.

By default, Kubernetes service load balancing implemented by kube-proxy or Cilium selects service backends randomly and ensures that the traffic remains sticky to the backend. However, the issue with this scheme is that in case of a node failure, the upstream load balancer selects a different load balancing node which has no prior context on which backend is currently serving the connection. This eventually leads to unexpected disruptions on connection-oriented protocols like TCP as client connections are being reset by the newly selected backends.

Maglev consistent hashing minimizes such disruptions by ensuring that each load balancing node has a consistent view and ordering for the backend lookup table such that selecting the backend through the packet's 5-tuple hash will always result in forwarding the traffic to the very same backend without having to synchronize state with the other nodes. This not only improves resiliency in case of failures but also achieves better load balancing properties since newly added nodes will make the same consistent backend selection throughout the cluster.

Aside from that, the Maglev consistent hashing algorithm ensures even balancing among backends as well as minimal disruptions in the case where backends are added or removed. Specifically, a flow is highly likely to choose the same backend after adding or removing a backend for a service as it did before the operation. Upon backend removal, the backend lookup tables are reprogrammed with minimal changes for unrelated backends, that is, typical configurations provide the upper bound of at most 1% tolerable difference in the reassignments.

In order to support Maglev in Cilium's eBPF load balancer, the datapath backend selection algorithm is now pluggable between maglev and random. Maglev achieves the better resiliency by trading off memory consumption for its potentially large lookup tables. Our implementation optimizes the needed memory overhead by "compressing" the lookup tables into an eBPF array-in-hash map lookup instead of using a single flat array for all services. This allows us to scale memory on demand and ensures dynamic provisioning of services as is the case with Kubernetes. To take memory reduction even further, we also extended the eBPF map-in-map infrastructure in the Linux kernel to enable dynamic inner array map sizes. This paves the way for the next Cilium release to annotate Kubernetes service specification with hints on whether a given service has a small or large number of backends.

Maglev-based backend selection for Cilium's load balancer can be enabled by setting the Helm option loadBalancer.algorithm=maglev. Further information and tunables on Maglev can be found in its getting started guide.

Deny-based Network Policy

Contributed by André Martins

By default, Cilium's behavior, as defined in Kubernetes, allows traffic to and from Pods until a Kubernetes, or a Cilium, Network Policy selects that particular Pod. This is powerful because it allows defining which Pods and other entities can be allowed to communicate with the selected Pod. However, this does not cover cases where it is useful to explicitly define which network peers must not be able to communicate.

The 1.9 release brings deny-based network policies to Cilium. These policies will take precedence over any other rules as they are intended to explicitly deny traffic. As an example, if a cluster is under attack by an external entity, operators can immediately deploy the following policy to stop the attack:

apiVersion: "cilium.io/v2"
kind: CiliumClusterwideNetworkPolicy
metadata:
  name: "external-lockdown"
spec:
  endpointSelector: {}
  ingressDeny:
  - fromEntities:
    - "world"
  ingress:
  - fromEntities:
    - "all"

deny-policies-demo

More information about deny policies can be found in the relevant documentation.

VM Support / External Workloads

Contributed by Jarno Rajahalme

A major focus of Cilium 1.9 has been to enable Cilium to run in other environments where users are already running workloads. VM support, or more generally speaking, the ability to join any node to the Cilium cluster, allows integrating non-Kubernetes workloads with Cilium.

In many practical cases it is infeasible to move all service components to a Kubernetes cluster at once. So far, Cilium has classified ingress from any non-Kubernetes workload as either WORLD, making it impossible to distinguish your own external nodes from any random Internet host, or as a fromCIDR that allows distinguishing traffic from specific IP addresses or ranges. New support for external workloads enables non-Kubernetes nodes to join the cluster, allowing you to apply label-based policies on the traffic between the cluster Pods and such external nodes. External nodes joining the cluster also get access to cluster's services and can resolve cluster names using cluster's kube-dns. This is a beta feature that will be further enhanced in coming releases. For more information see the new Setting up Support for External Workloads Getting Started Guide.

OpenShift Support

Contributed by Ilya Dmitrichenko

Support for OpenShift has been requested by many community members for a long time now. While it was already possible to run on OpenShift, the lack of a good guide made it challenging to get started. This release introduces a Getting Started Guide for OpenShift OKD, the primary aim being to show administrators how to install Cilium on new OKD clusters. The guide relies on openshift-install tool and has been tested in most of the major cloud providers. This documentation has also been made available in the Cilium 1.8 documentation. At the time of the 1.9.0 release, work is ongoing to enable further integration with OpenShift Container Platform by automating installation steps with an operator that will be submitted for RedHat certification.

Making Observability Easier with Hubble

Contributed by Alexandre Perrin, Robin Hahling, Sebastian Wicki and others

Hubble, the network and security observability component built into Cilium, has gained many quality of life improvements with this release, making it even easier to use and deploy.

In the previous release of Cilium, we announced initial support for cluster-wide visibility via Hubble Relay. During the Cilium 1.9 development cycle, we have made continuous improvements to Hubble Relay to ensure it works reliably. Many of these changes were backported and have thus already been available in recent minor releases of Cilium 1.8.

In Cilium 1.9, connections from Hubble Relay to nodes is secured using mutual TLS (mTLS) by default. Hubble Relay has also gained support for optional TLS/mTLS when serving clients such as the Hubble CLI. Therefore, we are happy to announce that cluster-wide visibility with Hubble Relay has reached General Availability (GA) with Cilium 1.9. We are excited for more users to try it out!

We have made many small improvements to ensure deploying and operating Hubble remains easy. For example, TLS certificates required for mTLS are automatically generated and distributed by default. Of course, sophisticated users can still employ their own certificate infrastructure. This is notably required in a Cluster Mesh scenario to get cross-cluster visibility via Hubble Relay.

As another notable improvement, support for TLS certificates hot reloading was added. This means that new TLS certificates are automatically picked up by Cilium agent and Hubble Relay Pods without requiring a restart nor breaking any existing connection. This allows for painless and frequent TLS certificate renewals.

Hubble UI

Contributed by Dmitry Kharitonov and Renat Tuktarov

Hubble UI, the graphical user interface for Hubble, has been fully reworked for this release. Its backend has been entirely rewritten in Go for overall better performance, which means it now also scales better with the workload and cluster size. The frontend and service dependency map interface has been redesigned to provide better user experience, especially in busy namespaces with lots of active services.

New Hubble UI Service Dependency Map

The new Hubble UI makes it easier to find dropped flows within a namespace by highlighting them as special edges in the service map.

New Hubble UI Dropped Flows View

Hubble CLI

Contributed by Glib Smaga, Michi Mutsuzaki, Robin Hahling, Sebastian Wicki and others

Cilium 1.9 also coincides with a new release of the Hubble CLI. Hubble CLI 0.7 supports all new additions to the Hubble and Hubble Relay API available in Cilium 1.9.

One of the most visible changes is that Hubble Relay now reports the number of nodes to which it is connected. This information is available when using the status subcommand:

$ hubble status
Healthcheck (via localhost:4245): Ok
Current/Max Flows: 81920/81920 (100.00%)
Flows/s: 9.80
Connected Nodes: 20/20

In case of connection failures, a list of unavailable nodes is shown. This provides a good starting point to troubleshoot eventual connectivity issues of Hubble Relay to nodes:

$ hubble status
Healthcheck (via localhost:4245): Ok
Current/Max Flows: 24576/24576 (100.00%)
Flows/s: 9.80
Connected Nodes: 6/20
Unavailable Nodes: 14
  - kind-worker10
  - kind-worker11
  - kind-worker12
  - kind-worker14
  - kind-worker15
  - kind-worker16
  - kind-worker18
  - kind-worker7
  - kind-worker8
  - kind-worker9
  - and 4 more...

Hubble CLI now also supports new flags for HTTP method and path filters:

$ hubble observe --namespace galaxy --http-method POST --http-path /v1/request-landing
TIMESTAMP             SOURCE                                DESTINATION                           TYPE            VERDICT     SUMMARY
Oct 29 15:03:01.839   galaxy/tiefighter:59238               galaxy/deathstar-87c949bbd-gmw98:80   http-request    FORWARDED   HTTP/1.1 POST http://deathstar.galaxy.svc.cluster.local/v1/request-landing
Oct 29 15:03:01.840   galaxy/deathstar-87c949bbd-gmw98:80   galaxy/tiefighter:59238               http-response   FORWARDED   HTTP/1.1 200 1ms (POST http://deathstar.galaxy.svc.cluster.local/v1/request-landing)

Another quality of life improvement in the Hubble CLI is the new hubble config sub-command to inspect and edit its config file, allowing users to persist settings, such as the newly added support for TLS-protected API endpoints.

While most users will want to use the Hubble CLI to submit cluster-wide queries against Hubble Relay, another notable change in Cilium 1.9 is that the Cilium Pods now also expose the Hubble API locally by default, making it easier to troubleshoot node-local issues.

This new version of the Hubble CLI contains many more features (jsonpb support, support for fish and powershell completion, addition of flows per second output in the status subcommand, Linux arm and arm64 binaries, and more). See its changelog for the full list of changes.

Node-Local Redirect Policies

Contributed by Aditi Ghag and Weilong Cui

The 1.9 release introduces Cilium Local Redirect Policies, which enable application Pod traffic destined to an IP address and port/protocol tuple or Kubernetes service to be redirected locally to a backend Pod within a node. The policies can be gated by Kubernetes Role-Based Access Control (RBAC) framework.

There are two types of Local Redirect Policies supported. When traffic for a Kubernetes service needs to be redirected, use the ServiceMatcher type. When traffic matching IP address and port/protocol that does not belong to any Kubernetes service needs to be redirected, use the AddressMatcher type.

apiVersion: "cilium.io/v2"
kind: CiliumLocalRedirectPolicy
metadata:
  name: "nodelocaldns"
  namespace: kube-system
spec:
  redirectFrontend:
    serviceMatcher:
      serviceName: kube-dns
      namespace: kube-system
  redirectBackend:
    localEndpointSelector:
      matchLabels:
        k8s-app: node-local-dns
    toPorts:
      - port: "53"
        name: dns
        protocol: UDP
      - port: "53"
        name: dns-tcp
        protocol: TCP

Local Redirect Policies are in beta. Follow the local-redirect-policy guide to get more details, including how to configure them for use cases like setting up NodeLocal DNSCache and KIAM proxy.

High Availability for cilium-operator

Contributed by Deepesh Pathak

The 1.9 release of Cilium brings High Availability for cilium-operator deployments. Cilium operator is now an integral part of Cilium installation in Kubernetes environments, and the agent requires the operator to be running before it can start functioning properly. This dependency on the operator calls for a more robust deployment of cilium-operator in the cluster. Cilium operator uses the Kubernetes leader election library in conjunction with lease locks to provide a HA cluster of cilium-operator instances. The capability is supported on Kubernetes versions 1.14 and above and is Cilium's default behavior for this release.

The number of replicas for the HA deployment can be configured using Helm option operator.replicas.

$ helm install cilium cilium/cilium --version 1.9.0 \
    --namespace kube-system \
    --set operator.replicas=3

$ kubectl get deployment cilium-operator -n kube-system
NAME              READY   UP-TO-DATE   AVAILABLE   AGE
cilium-operator   3/3     3            3           46s

Bandwidth Manager

Contributed by Daniel Borkmann

The new Cilium Bandwidth Manager is responsible for managing traffic more efficiently with the goal of improving overall application latency and throughput. This functionality is focused in two areas, that is, from an upper protocol and from a queueing discipline perspective.

When the Bandwidth Manager is enabled, it switches the TCP congestion control algorithm by default to BBR which achieves higher bandwidths and lower latencies in particular for Internet-facing traffic. It configures the kernel's networking stack to more server-oriented sysctl settings that have proven beneficial in production environments. And it reconfigures the traffic control queueing discipline (Qdisc) layer to use multi-queue Qdiscs with Fair Queue (FQ) on all external-facing network devices used by Cilium. With switching to FQ, the Bandwidth Manager also implements support for Earliest Departure Time (EDT) rate-limiting with the help of eBPF and now natively supports the kubernetes.io/egress-bandwidth Pod annotation.

This also removes the need for chaining the bandwidth CNI plugin which has scalability limitations given its use of TBF (Token Bucket Filter). With the EDT-based model, global locking in the Qdisc layer in particular under multi-queue NICs can be avoided. Cilium's eBPF datapath classifies network traffic into a per-Pod aggregate which then enforces the user-defined kubernetes.io/egress-bandwidth rate by setting earliest departure timestamps on the network packets on egress, shortly before passing the packet to the FQ leaf Qdiscs. The latter maintains a per-flow state and schedules their departure by respecting that packets are not sent out earlier than their timestamp dictates. Through the flexibility of eBPF, the classification into the Pod aggregate works not only for direct routing, but also in case of tunnelling or use of L7 proxies.


Evaluating the application latency under rate-limiting with HTB (Hierarchical Token Bucket) compared to eBPF with FQ shows that the CPU utilization can be reduced significantly while improving transmission latency. Latency reductions of about 20x have been observed for the 95th percentile as well as reductions of about 10x for the 99th percentile when eBPF and FQ are used in combination.

Below is an example deployment for an application Pod whose egress bandwidth is limited to 50 Mbit/s thanks to the kubernetes.io/egress-bandwidth annotation:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: netperf
spec:
  selector:
    matchLabels:
      run: netperf
  replicas: 1
  template:
    metadata:
      labels:
        run: netperf
      annotations:
        kubernetes.io/egress-bandwidth: "50M"
    spec:
      containers:
      - name: netperf
        image: cilium/netperf
        ports:
        - containerPort: 12865

Cilium's Bandwidth Manager can be enabled by setting the Helm option bandwidthManager=true. Further information on the Bandwidth Manager can be found in its getting started guide.

Virtual Ethernet Device Optimization with eBPF

Contributed by Daniel Borkmann

During the course of the 1.9 development cycle, we have performed a number of eBPF datapath performance optimizations. One that stands out in particular is the improvement of raw performance for network-namespaced Pods connected to the host namespace through a veth device pair, as is the default operation mode in Cilium. When Cilium is used in direct routing mode, traffic that is ingressing to or egressing from Pods is passed up the network stack inside the host namespace in order to let the routing layer perform the forwarding. Historically, this was a necessity mainly for letting the netfilter subsystem masquerade egressing Pod traffic. This masquerading also required the connection tracker to see traffic from both directions in order to avoid drops from invalid connections (asymmetric visibility of traffic in the connection tracker would apply here, too).

Thanks to recent advances of Cilium's datapath in prior releases, we are able to perform masquerading natively in eBPF code. Given that, in the course of 1.9 development, we have extended the eBPF networking functionality for the v5.10 kernel (part1 and part2) in order for Cilium to handle forwarding right in the tc eBPF layer instead of having to push packets up the host stack. This results in significant gains in single stream throughput as well as significant reductions in latency for request/response-type workloads. The two helpers bpf_redirect_peer() and bpf_redirect_neigh() that we have added to the Linux kernel as well as Cilium 1.9 code base enable the new packet handling in the host namespace.

The bpf_redirect_peer() enables switching network namespaces from the ingress of the NIC to the ingress of the Pod without a software interrupt rescheduling point when traversing the network namespace. The physical NIC can thus push packets up the stack into the application's socket residing in a different Pod namespace in one go. This also leads to quicker application wake-up for picking up the received data. Similarly, rescheduling points are reduced from 2 to 1 for local Pod-to-Pod communication resulting in better latency there as well.

The bpf_redirect_neigh() handles a Pod's egress traffic by injecting the traffic into the Linux kernel's neighboring subsystem, allowing to find the next hop and resolving layer 2 addresses for the network packet. Performing the forwarding only in tc eBPF layer and not pushing the packet further up the networking stack also provides proper back pressure for the TCP stack and feedback for TCP's TSQ (TCP Small Queues) mechanism to reduce potential excessive queueing of TCP packets. That is, feedback is given to the TCP stack that the packet has left the node instead of inaccurately providing it too early when it would be pushed up to the host stack for routing. This is now possible because the packet's socket association can be kept intact when it is passed down into the NIC driver.



As can be seen from initial results, the single stream TCP throughput for a Pod to remote Pod session under direct routing doubles in throughput for the nodes under test when using Cilium 1.9's eBPF extensions on a v5.10 kernel as opposed to having both directions handled by the host stack's forwarding. Similarly, the TCP request/response transaction performance improved by almost 3x for the Pods under test when avoiding the host stack.

The underlying kernel is automatically probed from Cilium by default and if available for the configuration the eBPF kernel extensions will be transparently enabled for new deployments. The latter requires the use of Cilium's eBPF kube-proxy replacement as well as eBPF-based masquerading given netfilter in the host namespace is then bypassed. This behavior can also be opted-out through the Helm bpf.hostRouting option.

Transparent L7 Proxying in eBPF

Contributed by Joe Stringer

A recurring theme that we hear from Cilium users is around the pain that iptables brings to cluster operators, whether due to the limitations around scalability, the operational complexity for debugging, or when the existing functionality provides a 90% solution but the last 10% is too difficult to achieve. Cilium's datapath is built on eBPF specifically to mitigate these concerns by building a network forwarding layer designed for the needs of the network now, not the needs of the 1990s.

Building upon the eBPF TPROXY work in the upstream kernel earlier this year, the Cilium 1.9 release enables L7 policy natively in eBPF on kernel 5.7 or later. This implementation is now simpler by reducing dependence on iptables and the packet mark. Earlier this year at Linux Plumbers conference, we discussed how network applications need to co-ordinate packet mark usage to ensure that the kernel implements the network forwarding behavior that users desire. The work on transparently proxying L7 proxy natively in eBPF provides a path to better integrate Cilium with other CNIs by reducing such packet mark usage. On the journey to a world free of iptables, this is yet another positive step forwards.

Performance Testing Framework

Contributed by Kornilios Kourtis and Paul Chaignon

Performance evaluation, and specifically evaluating network performance, is a nuanced process. Small changes in the environment or setup (e.g., hardware, kernel version and build options, system configuration, workload) can result in substantial differences in measured performance. This makes it challenging to evaluate Cilium's performance and causes confusion when seemingly similar setups behave differently. Hence, in addition to the many performance improvements Cilium 1.9 brings, we also developed utilities that enable both developers and users to easily evaluate Cilium network performance in a reproducible way.

Specifically, we developed a benchmark for Kubernetes networking called kubenetbench, as well as a set of tools that cover the whole process of performance evaluation: Provisioning machines, installing and configuring software, and executing benchmarks. These tools not only fully automate the evaluation, but also encompass best practices for producing reliable results. Using the above tools, we performed an extensive evaluation of Cilium network performance that is now a part of our documentation, and allows users to easily reproduce our results. Looking forwards, we plan to extend our tools with more benchmarks and add this infrastructure to our CI process.

Getting Started

New to Cilium? Follow one of the Getting Started Guides.

Upgrade Instructions

Follow the upgrade guide to upgrade your Cilium deployment. Cilium 1.9 deprecates the use of Helm 2 for managing deployments, as it is reaching end of life on November 13, 2020. Furthermore, this release introduces new option names for most Helm settings, see the guide for more details. Feel free to ping us on Slack if you have any questions.

Release

  • Release Notes: 1.9.0
  • Container images: quay.io/cilium/cilium:v1.9.0, docker.io/cilium/cilium:v1.9.0

Share on