Kubernetes v1.33: User Namespaces enabled by default!
In Kubernetes v1.33 support for user namespaces is enabled by default. This means that, when the stack requirements are met, pods can opt-in to use user namespaces. To use the feature there is no need to enable any Kubernetes feature flag anymore!
In this blog post we answer some common questions about user namespaces. But, before we dive into that, let's recap what user namespaces are and why they are important.
What is a user namespace?
Note: Linux user namespaces are a different concept from Kubernetes namespaces. The former is a Linux kernel feature; the latter is a Kubernetes feature.
Linux provides different namespaces to isolate processes from each other. For example, a typical Kubernetes pod runs within a network namespace to isolate the network identity and a PID namespace to isolate the processes.
One Linux namespace that was left behind is the user namespace. It isolates the UIDs and GIDs of the containers from the ones on the host. The identifiers in a container can be mapped to identifiers on the host in a way where host and container(s) never end up in overlapping UID/GIDs. Furthermore, the identifiers can be mapped to unprivileged, non-overlapping UIDs and GIDs on the host. This brings three key benefits:
Prevention of lateral movement: As the UIDs and GIDs for different containers are mapped to different UIDs and GIDs on the host, containers have a harder time attacking each other, even if they escape the container boundaries. For example, suppose container A runs with different UIDs and GIDs on the host than container B. In that case, the operations it can do on container B's files and processes are limited: only read/write what a file allows to others, as it will never have permission owner or group permission (the UIDs/GIDs on the host are guaranteed to be different for different containers).
Increased host isolation: As the UIDs and GIDs are mapped to unprivileged users on the host, if a container escapes the container boundaries, even if it runs as root inside the container, it has no privileges on the host. This greatly protects what host files it can read/write, which process it can send signals to, etc. Furthermore, capabilities granted are only valid inside the user namespace and not on the host, limiting the impact a container escape can have.
Enablement of new use cases: User namespaces allow containers to gain certain capabilities inside their own user namespace without affecting the host. This unlocks new possibilities, such as running applications that require privileged operations without granting full root access on the host. This is particularly useful for running nested containers.
User namespace IDs allocation
If a pod running as the root user without a user namespace manages to breakout, it has root privileges on the node. If some capabilities were granted to the container, the capabilities are valid on the host too. None of this is true when using user namespaces (modulo bugs, of course 🙂).
Demos
Rodrigo created demos to understand how some CVEs are mitigated when user namespaces are used. We showed them here before (see here and here), but take a look if you haven't:
Mitigation of CVE 2024-21626 with user namespaces:
Mitigation of CVE 2022-0492 with user namespaces:
Everything you wanted to know about user namespaces in Kubernetes
Here we try to answer some of the questions we have been asked about user namespaces support in Kubernetes.
1. What are the requirements to use it?
The requirements are documented here. But we will elaborate a bit more, in the following questions.
Note this is a Linux-only feature.
2. How do I configure a pod to opt-in?
A complete step-by-step guide is available here. But the short
version is you need to set the hostUsers: false
field in the pod spec. For
example like this:
apiVersion: v1
kind: Pod
metadata:
name: userns
spec:
hostUsers: false
containers:
- name: shell
command: ["sleep", "infinity"]
image: debian
Yes, it is that simple. Applications will run just fine, without any other changes needed (unless your application needs the privileges).
User namespaces allows you to run as root inside the container, but not have privileges in the host. However, if your application needs the privileges on the host, for example an app that needs to load a kernel module, then you can't use user namespaces.
3. What are idmap mounts and why the file-systems used need to support it?
Idmap mounts are a Linux kernel feature that uses a mapping of UIDs/GIDs when accessing a mount. When combined with user namespaces, it greatly simplifies the support for volumes, as you can forget about the host UIDs/GIDs the user namespace is using.
In particular, thanks to idmap mounts we can:
- Run each pod with different UIDs/GIDs on the host. This is key for the lateral movement prevention we mentioned earlier.
- Share volumes with pods that don't use user namespaces.
- Enable/disable user namespaces without needing to chown the pod's volumes.
Support for idmap mounts in the kernel is per file-system and different kernel releases added support for idmap mounts on different file-systems.
To find which kernel version added support for each file-system, you can check
out the mount_setattr
man page, or the online version of it
here.
Most popular file-systems are supported, the notable absence that isn't supported yet is NFS.
4. Can you clarify exactly which file-systems need to support idmap mounts?
The file-systems that need to support idmap mounts are all the file-systems used
by a pod in the pod.spec.volumes
field.
This means: for PV/PVC volumes, the file-system used in the PV needs to support idmap mounts; for hostPath volumes, the file-system used in the hostPath needs to support idmap mounts.
What does this mean for secrets/configmaps/projected/downwardAPI volumes? For
these volumes, the kubelet creates a tmpfs
file-system. So, you will need a
6.3 kernel to use these volumes (note that if you use them as env variables it
is fine).
And what about emptyDir volumes? Those volumes are created by the kubelet by
default in /var/lib/kubelet/pods/
. You can also use a custom directory for
this. But what needs to support idmap mounts is the file-system used in that
directory.
The kubelet creates some more files for the container, like /etc/hostname
,
/etc/resolv.conf
, /dev/termination-log
, /etc/hosts
, etc. These files are
also created in /var/lib/kubelet/pods/
by default, so it's important for the
file-system used in that directory to support idmap mounts.
Also, some container runtimes may put some of these ephemeral volumes inside a
tmpfs
file-system, in which case you will need support for idmap mounts in
tmpfs
.
5. Can I use a kernel older than 6.3?
Yes, but you will need to make sure you are not using a tmpfs
file-system. If
you avoid that, you can easily use 5.19 (if all the other file-systems you use
support idmap mounts in that kernel).
It can be tricky to avoid using tmpfs
, though, as we just described above.
Besides having to avoid those volume types, you will also have to avoid mounting the
service account token. Every pod has it mounted by default, and it uses a
projected volume that, as we mentioned, uses a tmpfs
file-system.
You could even go lower than 5.19, all the way to 5.12. However, your container rootfs probably uses an overlayfs file-system, and support for overlayfs was added in 5.19. We wouldn't recommend to use a kernel older than 5.19, as not being able to use idmap mounts for the rootfs is a big limitation. If you absolutely need to, you can check this blog post Rodrigo wrote some years ago, about tricks to use user namespaces when you can't support idmap mounts on the rootfs.
6. If my stack supports user namespaces, do I need to configure anything else?
No, if your stack supports it and you are using Kubernetes v1.33, there is nothing you need to configure. You should be able to follow the task: Use a user namespace with a pod.
However, in case you have specific requirements, you may configure various options. You can find more information here. You can also enable a feature gate to relax the PSS rules.
7. The demos are nice, but are there more CVEs that this mitigates?
Yes, quite a lot, actually! Besides the ones in the demo, the KEP has more CVEs you can check. That list is not exhaustive, there are many more.
8. Can you sum up why user namespaces is important?
Think about running a process as root, maybe even an untrusted process. Do you think that is secure? What if we limit it by adding seccomp and apparmor, mask some files in /proc (so it can't crash the node, etc.) and some more tweaks?
Wouldn't it be better if we don't give it privileges in the first place, instead of trying to play whack-a-mole with all the possible ways root can escape?
This is what user namespaces does, plus some other goodies:
Run as an unprivileged user on the host without making changes to your application. Greg and Vinayak did a great talk on the pains you can face when trying to run unprivileged without user namespaces. The pains part starts in this minute.
All pods run with different UIDs/GIDs, we significantly improve the lateral movement. This is guaranteed with user namespaces (the kubelet chooses it for you). In the same talk, Greg and Vinayak show that to achieve the same without user namespaces, they went through a quite complex custom solution. This part starts in this minute.
The capabilities granted are only granted inside the user namespace. That means that if a pod breaks out of the container, they are not valid on the host. We can't provide that without user namespaces.
It enables new use-cases in a secure way. You can run docker in docker, unprivileged container builds, Kubernetes inside Kubernetes, etc all in a secure way. Most of the previous solutions to do this required privileged containers or putting the node at a high risk of compromise.
9. Is there container runtime documentation for user namespaces?
Yes, we have containerd
documentation.
This explains different limitations of containerd 1.7 and how to use
user namespaces in containerd without Kubernetes pods (using ctr
). Note that
if you use containerd, you need containerd 2.0 or higher to use user namespaces
with Kubernetes.
CRI-O doesn't have special documentation for user namespaces, it works out of the box.
10. What about the other container runtimes?
No other container runtime that we are aware of supports user namespaces with Kubernetes. That sadly includes cri-dockerd too.
11. I'd like to learn more about it, what would you recommend?
Rodrigo did an introduction to user namespaces at KubeCon 2022:
- Run As “Root”, Not Root: User Namespaces In K8s- Marga Manterola, Isovalent & Rodrigo Campos Catelin
Also, this aforementioned presentation at KubeCon 2023 can be useful as a motivation for user namespaces:
Bear in mind the presentation are some years old, some things have changed since then. Use the Kubernetes documentation as the source of truth.
If you would like to learn more about the low-level details of user namespaces,
you can check man 7 user_namespaces
and man 1 unshare
. You can easily create
namespaces and experiment with how they behave. Be aware that the unshare
tool
has a lot of flexibility, and with that options to create incomplete setups.
If you would like to know more about idmap mounts, you can check its Linux kernel documentation.
Conclusions
Running pods as root is not ideal and running them as non-root is also hard with containers, as it can require a lot of changes to the applications. User namespaces are a unique feature to let you have the best of both worlds: run as non-root, without any changes to your application.
This post covered: what are user namespaces, why they are important, some real world examples of CVEs mitigated by user-namespaces, and some common questions. Hopefully, this post helped you to eliminate the last doubts you had and you will now try user-namespaces (if you didn't already!).
How do I get involved?
You can reach SIG Node by several means:
You can also contact us directly:
- GitHub: @rata @giuseppe @saschagrunert
- Slack: @rata @giuseppe @sascha