April 2, 2018

Cilium 1.0.0-rc9 - Feature Freeze for 1.0!

We are excited to announce Cilium 1.0.0-rc9 with many, many bugfixes and the delivery of the final feature we were waiting on prior for 1.0: Egress policy enforcement support. It is therefore only logical that we announce full feature freeze with rc9. This means that we will only merge critical bugfixes and release 1.0 as soon as we have resolved all release blockers. More on this below. We are thrilled to have come this far and appreciate all of the efforts by the wide range of contributors that have helped to get us here.

Upgrade Instructions

No special upgrade instructions are required for this release. Please follow out simple upgrade guide for the generic instrutions on how to upgrade.

Highlights

As usual, the full release notes are attached at the end of the blog but can be found on the 1.0.0-rc9 release page. The vast majority of the work in this release has been around bugfixes and testing. Here is a list of some highlights:

Egress Policy Enforcement capability

Cilium uses an identity based policy enforcement mechanism as its standard enforcement mechanism and only falls back to IP/CIDR based enforcement when absolutely required. The identity based model implies that we encode the identity of the sending endpoint with all packets and then enforce on the receiving side whether that identity is allowed to communicate with the respective peer. Cilium only falls back to an IP/CIDR based enforcement mode if we are not in control of the sender.

With this release, we are now completing the egress policy enforcement by adding labels and entities based enforcement on top of the existing IP/CIDR egress enforcement that existed before.

A few simple egress examples

The following example is tailored for Kubernetes and shows how to enable default deny at egress for all role=frontend pods and then explicitly whitelist the connection to role=backend on port TCP/80:

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
description: "Allow egress TCP/80 from frontend to backend"
metadata:
  name: "egress-rule"
spec:
  endpointSelector:
    matchLabels:
      role: frontend
  egress:
  - toEndpoints:
    - matchLabels:
      role: backend
    toPorts:
    - ports:
      - port: "80"
        protocol: TCP

This obviously also applies to L7 aware policies. Here is another example which shows how to whitelist POST /metric on port TCP/8080 from pods with the label app=myService to their respective local host.

apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
description: "Allow HTTP POST /metric from myService to local host"
metadata:
  name: "rule1"
spec:
  endpointSelector:
    matchLabels:
      app: myService
  egress:
  - toEntities:
    - host
    toPorts:
    - ports:
      - port: "8080"
        protocol: TCP
      rules:
        http:
        - method: "POST"
          path: "/metric$"

Configurable 403 HTTP access denied messages

The ability to specify the text as returned with 403 HTTP responses is obviously a critical enterprise grade feature as explained in this separate blog post:

No further explanation required.

Scale Improvements

We have done a series of scale and stress tests which lead to tweaking of default limits and improvements that affect scalability:

  • Several upper limits for BPF maps covering connection state have been increased. We will likely make this adjustable and improve defaults to be based on available system memory to take a good guess at expected network load.

  • A new expedited garbage collector mode has been introduced which identifies connections that have never been established (no complete SYN-ACK handshake observed). Such incomplete connections are removed from state tables much more aggressively. This finds a good balance to keep long lived TCP connections in state tables for days without seeing any traffic while aggressively removing connections created by connection attempt floods or services such as Cassandra which perform retries very aggressively.

  • We have started enabling TCP keepalive for all proxied connections to gain a better understanding of the health of long lived connections with minimal traffic such as TCP connections used for health checking.

Known issues before 1.0

We have a couple of issues that are we tracking and fixing before releasing 1.0. If you are running into any issues, check the list of 1.0 blocker bugs first.

Release Notes

Major Changes

  • envoy: Make 403 message configurable. (3430, @jrajahalme)
  • Add support label-dependent L4 egress policy (3372, @ianvernon)

Bugfixes Changes

  • Fix entity dependent L4 enforcement (3451, @tgraf)
  • cli: Fix cilium bpf policy get (3446, @tgraf)
  • Fix CIDR ingress lookup (3406, @joestringer)
  • xds: Handle NACKs of initial versions of resources (3405, @rlenglet)
  • datapath: fix egress to world entity traffic, add e2e test (3386, @ianvernon)
  • bug: Fix panic in health server logs if /healthz didn't respond before checking status (3378, @nebril)
  • pkg/policy: remove fromEntities and toEntities from rule type (3375, @ianvernon)
  • Fix IPv4 CIDR lookup on older kernels (3366, @joestringer)
  • Fix egress CIDR policy enforcement (3348, @tgraf)
  • envoy: Fix concurrency issues in Cilium xDS server (3341, @rlenglet)
  • Fix bug where policies associated with stale identities remain in BPF policy maps, which could lead to "Argument list too long" errors while regenerating endpoints (3321, @joestringer)
  • Update CI and docs : kafka zookeeper connection timeout to 20 sec (3308, @manalibhutiyani)
  • Reject CiliumNetworkPolicy rules which do not have EndpointSelector field (3275, @ianvernon)
  • Envoy: delete proxymap on connection close (3271, @jrajahalme)
  • Fix nested cmdref links in documentation (3265, @joestringer)
  • completion: Fix race condition that can cause panic (3256, @rlenglet)
  • Additional NetworkPolicy tests and egress wildcard fix (3246, @tgraf)
  • Add timeout for getting etcd session (3228, @nebril)
  • conntrack: Cleanup egress entries and distinguish redirects per endpoint (3221, @rlenglet)
  • Silence warnings during endpoint restore (3216, @tgraf)
  • Fix MTU connectivity issue with external services (3205, @joestringer)
  • endpoint: Don't fail with fatal on l4 policy application (3199, @tgraf)
  • Add new Kafka Role to the docs (3186, @manalibhutiyani)
  • Fix log records for Kafka responses (3127, @tgraf)

Other Changes

  • Refactor /endpoint/{id}/config for API 1.0 stability (3448, @tgraf)
  • envoy: Add host identity (nphds) gRPC client (3407 gRPC client (3407), @jrajahalme)
  • Increase capacity of BPF maps (3391, @tgraf)
  • daemon: Merge Envoy logs with cilium logs by default. (3364, @jrajahalme)
  • docs: Fix the Kafka policy to use the new role in the GSG (3350, @manalibhutiyani)
  • CI / GSG : make Kafka service headless (3320, @manalibhutiyani)
  • Use alpine as base image for Docs container (3301, @iamShantanu101)
  • Update kafka zookeeper session timeout to 20 sec in CI tests and docs (3298, @manalibhutiyani)
  • Support access log from sidecar and per-endpoint redirect stats (3278, @rlenglet)
  • Improve sanity checking in endpoint PATCH API (3274, @joestringer)
  • Update Kafka GSG policy and docs to use the new "roles" (3269, @manalibhutiyani)
  • maps: allow for migration when map properties change (3267, @borkmann)
  • bpf: Retire CT entries quickly for unreplied connections (3238, @joestringer)
  • CMD: Add json output on endpoint config (3234, @eloycoto)
  • Plumb the contents of the ip-identity cache to a BPF map for lookup in the datapath. (3037, @ianvernon)

Release binaries

Release binaries

As usual, let us know on Slack if you have any questions.