06-04-2026

What A Linux Container Actually Is

Try the interactive lab for this article Take the quiz (6 questions · ~5 min)

There is no such thing as a container.

That sentence annoys a lot of people the first time they hear it, especially people who have spent years writing Dockerfiles, troubleshooting Kubernetes pods, or arguing about whether Podman or containerd is the better daemon. But it is the most honest way to introduce the subject. The Linux kernel does not have a "container" concept. There is no container syscall. There is no kernel subsystem called containers. What there are, instead, are a handful of relatively independent kernel features that, when composed carefully, produce something that feels exactly like a container, and can be treated as one by higher-level tools.

A container is a process (or a group of processes) that the kernel has been asked to lie to in very specific ways. It is told that it has its own root filesystem, its own list of running processes, its own network interfaces, its own user and group IDs, its own hostname, its own mount tree, and its own view of what CPU and memory resources exist. Every one of these lies is implemented by a different kernel feature. Compose them all, pick a root filesystem, drop some capabilities, and you have what Docker calls a container and what Kubernetes calls a pod.

This article takes that assembly apart. We will walk through namespaces, cgroups, capabilities, pivot_root, seccomp, and the smaller pieces like the clone3 syscall and OverlayFS, and we will build up a working mental model of what every container runtime is doing when it launches a process. By the end you should be able to write a working container from a shell script in about forty lines, and know exactly why each line is there.

The Problem Containers Solve

Before the mechanism, the motivation. For decades, if you wanted to run a server application in isolation from everything else on the host, you had two choices. You could use chroot, which gave you an isolated filesystem and almost nothing else: the chrooted process could still see every other process on the system, share the same network stack, and use the same user database. Or you could use a virtual machine: a full operating system inside a hypervisor, with hardware-emulated everything. Chroot was cheap and leaky. Virtual machines were strong and expensive.

The gap between those two options was very wide. Running a thousand lightweight services, each on its own isolated Linux, used to mean running a thousand VMs. Each one carried a full kernel, a full boot process, a full init, its own RAM and CPU overhead. Companies like Google ran into this wall early: they had enormous fleets of servers and wanted fine-grained, low-overhead isolation between workloads that all ran on the same Linux host.

Google's engineers started merging isolation primitives into the Linux kernel in the mid-2000s. The pieces arrived slowly. Process namespaces came in 2006 with kernel 2.6.19. Network namespaces in 2007. User namespaces, the hardest one, arrived in 2013 with kernel 3.8. By the time Docker wrapped everything up in a user-friendly CLI in 2013, the kernel had already quietly grown all of the machinery needed to pretend a process lived alone on a Linux box. Docker's great contribution was not inventing containers. It was packaging an existing set of kernel features into a workflow anyone could learn in an afternoon.

Namespaces: The Lies You Tell A Process

A namespace is the kernel's mechanism for giving one process a different view of some global resource than other processes have. If your process is in its own PID namespace, the only processes it can see via /proc are those that also share that PID namespace. If your process is in its own network namespace, the network interfaces it can see are those that belong to that network namespace. The rest of the system is still there, but your process is not allowed to know about it.

There are eight namespaces on current Linux kernels. Each one isolates a specific resource.

Mount (mnt): isolates the list of mounted filesystems. A mount in one namespace is invisible in another.
PID: isolates the process ID number space. The first process in a new PID namespace gets PID 1.
Network (net): isolates network interfaces, routing tables, iptables rules, socket tables, and sysctl networking parameters.
IPC: isolates System V IPC objects (message queues, semaphores, shared memory) and POSIX message queues.
UTS: isolates the hostname and domain name.
User: isolates user and group IDs, allowing a process to be root inside its namespace without being root outside it.
Cgroup: isolates the view of the cgroup hierarchy visible through /proc.
Time: isolates monotonic and boot clock offsets so containers can be migrated between hosts without the monotonic clock jumping.

Every Linux process belongs to exactly one namespace of each type. By default, every process is in the host's namespaces. New namespaces are created with unshare or clone, and once created they can be joined by other processes with setns.

You can inspect which namespaces a process is in by looking at /proc/$PID/ns:

ls -l /proc/self/ns
# lrwxrwxrwx 1 user user 0 Apr  6 10:00 cgroup -> 'cgroup:[4026531835]'
# lrwxrwxrwx 1 user user 0 Apr  6 10:00 ipc    -> 'ipc:[4026531839]'
# lrwxrwxrwx 1 user user 0 Apr  6 10:00 mnt    -> 'mnt:[4026531840]'
# lrwxrwxrwx 1 user user 0 Apr  6 10:00 net    -> 'net:[4026531992]'
# lrwxrwxrwx 1 user user 0 Apr  6 10:00 pid    -> 'pid:[4026531836]'
# lrwxrwxrwx 1 user user 0 Apr  6 10:00 time   -> 'time:[4026531834]'
# lrwxrwxrwx 1 user user 0 Apr  6 10:00 user   -> 'user:[4026531837]'
# lrwxrwxrwx 1 user user 0 Apr  6 10:00 uts    -> 'uts:[4026531838]'

The number in brackets is the namespace's unique ID. Every process on your host that shows the same ID for, say, mnt, is sharing the same mount namespace. Every process on your host that shows a different ID for mnt is living in a different mount world.

Two processes in the same PID namespace can see each other in ps. Two processes in different PID namespaces cannot. This is not a permission check dressed up in a UI: the kernel does not even expose the other namespace's processes to /proc readers. From inside, they do not exist.

Creating Namespaces From A Shell

You do not need Docker to explore this. unshare is a command-line tool in util-linux that runs a program in freshly created namespaces. A minimal demonstration:

# Run bash in a new PID and mount namespace
sudo unshare --pid --mount --fork --mount-proc bash
 
# Inside that shell, look at processes:
ps -ef
# UID        PID  PPID  C STIME TTY          TIME CMD
# root         1     0  0 10:10 pts/0    00:00:00 bash
# root         5     1  0 10:11 pts/0    00:00:00 ps -ef

Inside the new namespace, bash thinks it is PID 1. Its actual PID on the host is different; you can see that from another terminal by grepping for bash in ps on the host. But within its PID namespace, bash is the first process, and it sees only itself and its children. When PID 1 dies in a PID namespace, the kernel kills all other processes in that namespace and destroys the namespace itself, exactly the same way PID 1 behaves on a regular boot.

--fork is necessary because the process that called unshare is still visible to itself: creating a new PID namespace only takes effect for child processes. --mount-proc tells unshare to remount /proc inside the new mount namespace, so that our ps -ef shows only the new namespace's processes. Without it we would see the host's processes via the host's /proc.

Let us do the same for networking:

sudo unshare --net bash
 
# Inside the new net namespace:
ip addr
# 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN
#     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
 
ping 8.8.8.8
# ping: connect: Network is unreachable

The new network namespace has exactly one network interface: a loopback, and even that one is down. No ethernet interfaces, no IP addresses, no routes. Leaving this namespace and coming back to the host restores your normal network. Containers that want network access wire up a virtual ethernet pair (veth) with one end in the container's net namespace and the other end on a bridge on the host. That is what the docker0 bridge does.

The User Namespace: Root Without Rootness

The user namespace is the most conceptually important of the eight, and the most recently matured. Its job is to let a process be UID 0 inside its namespace while being a different (often unprivileged) UID on the host. This unlocks rootless containers: Podman can run a container where the "root" user inside the container is actually just your regular unprivileged user on the host, with no CAP_SYS_ADMIN, no ability to do anything the host considers dangerous.

The mechanism is a UID mapping. When you create a user namespace, you specify a table that says "UID 0 inside maps to UID 1000 outside", "UID 1 inside maps to UID 100001 outside", and so on. Reads of /proc/self/uid_map and writes to it from the parent process set the mapping. The kernel then translates UIDs on every permission check: a file owned by UID 100001 on disk appears to be owned by UID 1 inside the namespace.

The reason this works safely is that nothing outside the namespace trusts the container's UID 0. If the container's root tries to open a file owned by host-root, the kernel compares the inside UID 0 to the file's host UID 0 and sees no match, and rejects the access. The container's root is root only for files owned by the mapping range. Try a capability operation that requires CAP_SYS_ADMIN in the initial user namespace, and the kernel checks whether you hold the capability in the initial namespace (you do not, because you are unprivileged on the host), and denies it.

User namespaces let unprivileged users do things that used to require root, in a controlled way. You can create a new user namespace as a regular user, and inside that namespace you can then create a new mount namespace, a new PID namespace, a new network namespace, because inside your user namespace you hold CAP_SYS_ADMIN (relative to that user namespace). You cannot touch host resources, but you can set up a sandbox.

Podman, Bubblewrap (the Flatpak sandboxing tool), Firejail, and unprivileged LXC all use this pattern. On a typical Linux laptop in Madrid or Helsinki, every Flatpak app you launch is running inside a user namespace plus a mount namespace plus a seccomp filter, and it has zero host privileges even though it thinks it owns /.

You can try user namespaces from the shell:

unshare --user --map-root-user bash
 
# Inside:
id
# uid=0(root) gid=0(root) groups=0(root)
whoami
# root
touch /etc/hosts  # permission denied: we are still unprivileged outside

The --map-root-user flag sets up a 1-entry mapping where host UID (you) becomes UID 0 inside. Inside, id reports root. Outside, nothing has changed: you cannot modify files owned by host-root. All the user namespace did was tell the kernel "for permission checks inside this namespace, treat our inside UID 0 as the new root".

A Word On Unprivileged User Namespaces

User namespaces unlocked rootless containers, but their history is not entirely comfortable. Because an unprivileged user gets CAP_SYS_ADMIN inside a user namespace, that user can reach syscalls that had historically only been accessible to root. Most of those syscalls had never been battle-tested against hostile unprivileged callers before, and a steady trickle of kernel bugs in code paths like the filesystem mounting logic, the keyring subsystem, and the BPF verifier turned out to be reachable and exploitable through user namespaces.

The upshot is that several distributions restrict unprivileged user namespace creation. Debian's kernels historically had a sysctl kernel.unprivileged_userns_clone that defaulted to 0, requiring admins to opt in. Ubuntu 23.10 and later default to requiring AppArmor confinement on unprivileged user namespaces: they are allowed, but with LSM filters active to block the riskier code paths. Red Hat allows them but relies on SELinux. These trade-offs are why Podman on a locked-down RHEL host and Podman on a laptop behave differently in subtle ways: the kernel exposes a different surface area based on policy.

The lesson is not "user namespaces are unsafe". They are the foundation of every modern sandbox on Linux, including the one your Flatpak apps in Lisbon or GNOME Software in Vienna run inside. The lesson is that security features are a moving target, and container isolation is the sum of many pieces, not one.

Cgroups: Bounding What A Process Can Do

Namespaces give you isolation. They do not give you limits. A process in its own PID namespace can still allocate every byte of RAM and consume every CPU cycle on the host, because it is just a regular process from the scheduler's point of view. For limits you need cgroups.

Control groups (cgroups) are a kernel subsystem that lets you assign processes to named groups and then apply resource controllers to those groups. The two versions of cgroups are called v1 and v2. cgroup v2 is the modern, unified version: one hierarchy, one mount, all controllers plug into the same tree. cgroup v1 is the legacy version, with separate hierarchies per controller, and is gradually being retired. Modern distributions (Fedora, Debian 11+, Ubuntu 22.04+) default to v2. We will focus on v2.

cgroup v2 is mounted at /sys/fs/cgroup as a filesystem where directories are groups. You create a new group by creating a directory. You add a process to a group by writing its PID to cgroup.procs. You set a limit by writing a value to a controller file.

sudo mkdir /sys/fs/cgroup/demo
echo 100M | sudo tee /sys/fs/cgroup/demo/memory.max
echo 50000 | sudo tee /sys/fs/cgroup/demo/cpu.max
# (the cpu.max "50000 100000" form means 50ms out of every 100ms = half a CPU)
 
echo $$ | sudo tee /sys/fs/cgroup/demo/cgroup.procs
# now this shell and everything it spawns is capped

Every process the shell spawns inherits the cgroup. Any process that tries to allocate more than 100 MiB of anonymous memory hits the memcg limit, triggers reclaim, and if reclaim cannot satisfy the allocation, gets OOM-killed inside its cgroup. Any process that tries to use more than half a CPU gets throttled by the CFS bandwidth controller.

The controllers available in cgroup v2 are:

cpu: weight-based scheduling and CPU time quotas.
memory: RAM caps, swap caps, memory pressure tracking.
io: block IO weights and bandwidth caps.
pids: maximum number of processes in the group (protects against fork bombs).
cpuset: pin processes to specific CPUs and NUMA nodes.
hugetlb: caps on how many huge pages the group can allocate.
rdma: limits on RDMA resources, for RDMA-heavy workloads.
misc: limits on miscellaneous scalar resources (GPU contexts, FPGA slots).

Docker, Podman, and containerd all create a cgroup per container and configure these controllers from the container spec. Kubernetes goes a level further: each pod gets a cgroup, each container in the pod gets a cgroup nested inside the pod's, and the kubelet sets memory.max and cpu.max on them based on the pod's resource requests and limits. When you ask Kubernetes for "500m CPU and 256Mi memory", what arrives on the node is a cgroup with those exact values in the controller files.

Capabilities: The Pieces Of Root

Traditional UNIX has two levels of privilege: root (UID 0), which can do anything, and everyone else, who cannot. That is too coarse for containers. A container that runs as root inside a namespace should not be allowed to reboot the host, reload kernel modules, or change the system clock, even though those are things "root" can normally do. Linux's answer is capabilities.

A capability is a single, atomic privilege that can be granted or denied independently of the others. The kernel defines around 40 of them. A few examples:

CAP_NET_BIND_SERVICE: bind to ports below 1024.
CAP_NET_RAW: open raw and packet sockets.
CAP_SYS_ADMIN: the famous "kitchen sink" capability. Covers mount, unmount, setting hostname, keyring operations, and dozens of other things.
CAP_SYS_MODULE: insert and remove kernel modules.
CAP_SYS_TIME: set the system clock.
CAP_SYS_CHROOT: call chroot.
CAP_SYS_PTRACE: attach to any process with ptrace.
CAP_DAC_READ_SEARCH: bypass file read permission checks.
CAP_DAC_OVERRIDE: bypass file read and write permission checks.

A process has three capability sets: permitted, effective, and inheritable. The permitted set is the ceiling. The effective set is what is actually checked at the moment of a syscall. The inheritable set is what gets passed across exec. A process can drop capabilities from its sets but cannot add new ones beyond its permitted set.

Container runtimes drop capabilities aggressively. When Docker launches a container by default, it keeps a short whitelist of capabilities (roughly: CHOWN, DAC_OVERRIDE, FOWNER, FSETID, KILL, SETGID, SETUID, SETPCAP, NET_BIND_SERVICE, NET_RAW, SYS_CHROOT, MKNOD, AUDIT_WRITE, SETFCAP) and drops everything else. An attacker who compromises a container and escalates to root inside finds that root cannot mount filesystems, cannot modify sysctls, cannot insert modules, cannot set the clock. They have root, yes, but only in the narrow sense of "UID 0 with a tiny capability set in a restricted user namespace, inside a restricted mount namespace". The host is not trivially at their mercy.

You can drop capabilities from a shell with capsh:

capsh --drop=cap_sys_admin,cap_net_admin --user=$USER -- -c 'mount /tmp /mnt/tmp'
# mount: /mnt/tmp: permission denied

Even as root, the mount fails because CAP_SYS_ADMIN has been removed from the process's effective set and cannot be reacquired.

Seccomp: Filtering Syscalls

Capabilities gate access to classes of kernel features. Seccomp goes one level deeper and lets you filter individual syscalls. A seccomp filter is a BPF program attached to a process that runs on every syscall entry. The filter can decide to allow the syscall, reject it with an errno, kill the process, or trap to userspace for further handling.

Docker ships a default seccomp profile that allows around 300 syscalls and blocks everything else. It blocks things like keyctl, kexec_load, mount, clock_adjtime, open_by_handle_at, clone with certain flags, and dozens of kernel-debug syscalls. The profile is deliberately conservative: the goal is to shrink the kernel's attack surface. If an attacker compromises a container and tries to exploit a kernel bug, the seccomp filter might turn a working exploit into EPERM by denying the syscall it depends on.

You can see the default profile in any Docker install at /etc/docker/seccomp.json on some distributions, or in the source tree of moby/moby. Writing your own seccomp filters is possible from C (via libseccomp) or directly with BPF. Bubblewrap and Flatpak construct tailored filters per application, which is one reason their sandboxes are considered strong.

A minimal seccomp filter that blocks unshare for a process:

#include <linux/seccomp.h>
#include <linux/filter.h>
#include <sys/syscall.h>
#include <sys/prctl.h>
#include <unistd.h>
 
int main(void) {
    struct sock_filter filter[] = {
        BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr)),
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_unshare, 0, 1),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | EPERM),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
    };
    struct sock_fprog prog = {
        .len = sizeof(filter) / sizeof(filter[0]),
        .filter = filter,
    };
    prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
    prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
    execl("/bin/bash", "bash", NULL);
    return 1;
}

Once this program execs bash, any attempt inside that bash to call unshare returns EPERM. The shell can still run freely. Only that specific syscall is blocked.

Pivot_root: Becoming Your Own Root

Chroot changes the process's root directory but does not unmount the previous root. A chrooted process with CAP_SYS_ADMIN can escape by chrooting again into a deeper directory. This is the classic "chroot is not a security boundary" lesson. pivot_root solves this. It swaps the old root for a new one and unmounts the old root entirely, so there is no way back.

A skeleton of building a container root:

# 1. Create a new mount namespace
unshare --mount --pid --fork --mount-proc bash
# 2. Set up a new root directory with essentials
mkdir -p /tmp/newroot/{proc,old_root,bin,lib,lib64,etc}
mount --bind /tmp/newroot /tmp/newroot   # pivot_root needs the target to be a mount
# 3. Populate it: copy busybox, libc, etc.
cp /bin/busybox /tmp/newroot/bin/
for cmd in sh ls cat; do
    ln -sf /bin/busybox /tmp/newroot/bin/$cmd
done
# 4. Pivot
cd /tmp/newroot
pivot_root . old_root
# 5. Mount /proc for the new namespace
mount -t proc proc /proc
# 6. Detach the old root
umount -l /old_root
rmdir /old_root
# 7. Start the shell
exec /bin/sh

After these seven steps you have a shell that sees nothing from the host except what you copied in. The only mounts are the ones inside the new mount namespace. The process has its own PID namespace, its own mount namespace, and its own root. It still shares the host's user, network, and IPC namespaces, and still has all of your user's capabilities, and still has unrestricted syscall access. To make it a "real" container, you would add --net, a new user namespace with UID mapping, dropped capabilities, and a seccomp filter.

OverlayFS: Layered Images Without The Copies

Container images are layered. A typical Debian-based image has a base layer with the root filesystem, then layers for apt packages, then layers for your application code. When the container runs, all of those layers have to look like a single, coherent filesystem root. And they have to be read-only at the layer level (so the image can be shared between many containers) while appearing writable from inside (so the container can create files).

The mechanism is OverlayFS, a union filesystem built into the Linux kernel. OverlayFS takes a list of lower layers (read-only) and a single upper layer (read-write) and presents a merged view. Reads come from whichever layer has the file. Writes go to the upper layer, copying files up from the lower layers on first modification. Deletions create "whiteout" files in the upper layer that hide the corresponding lower-layer entries.

A minimal overlay mount looks like this:

mkdir -p /tmp/overlay/{lower1,lower2,upper,work,merged}
echo "in lower1" > /tmp/overlay/lower1/a.txt
echo "in lower2" > /tmp/overlay/lower2/b.txt
 
mount -t overlay overlay \
    -o lowerdir=/tmp/overlay/lower2:/tmp/overlay/lower1,upperdir=/tmp/overlay/upper,workdir=/tmp/overlay/work \
    /tmp/overlay/merged
 
ls /tmp/overlay/merged   # a.txt  b.txt
echo "hi" >> /tmp/overlay/merged/a.txt
cat /tmp/overlay/upper/a.txt  # contains "in lower1\nhi"
cat /tmp/overlay/lower1/a.txt # unchanged

Every Docker image is stored as a stack of layers in /var/lib/docker/overlay2, and when you run a container, Docker constructs an overlay mount with the image layers as lowerdir and a per-container upperdir for writes. A hundred containers running the same nginx image share the same lower layers: a hundred different upper directories, one immutable set of read-only layers underneath. The total disk cost is one copy of the image plus a small per-container delta.

Putting It Together: clone3 And The Full Recipe

Every container runtime's source code eventually calls one central syscall: clone or its more modern variant clone3. Clone is a kind of fork that lets the caller specify exactly which resources the child should share with the parent and which it should have fresh copies of. You can ask for a new PID namespace, a new mount namespace, a new network namespace, a new user namespace, and so on, in a single syscall.

#define _GNU_SOURCE
#include <sched.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdio.h>
 
static char stack[1024 * 1024];
 
int child(void *arg) {
    printf("child: pid=%d\n", getpid());
    sethostname("container", 9);
    execl("/bin/bash", "bash", NULL);
    return 1;
}
 
int main(void) {
    int flags = CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNS
              | CLONE_NEWNET | CLONE_NEWIPC | SIGCHLD;
    pid_t pid = clone(child, stack + sizeof(stack), flags, NULL);
    waitpid(pid, NULL, 0);
    return 0;
}

Compile, run as root, and you get a bash running in its own UTS, PID, mount, net, and IPC namespaces. hostname returns "container". ps (with a remounted /proc) shows only the shell. ip addr shows only loopback. All from one clone call.

A fully-featured container runtime then layers on user namespace setup (write to /proc/self/uid_map), cgroup assignment (write PID to cgroup.procs), capability dropping (prctl), seccomp filter installation, root filesystem pivot, and finally execve. runc, Docker's low-level runtime, is a well-engineered implementation of exactly these steps plus error handling, OCI spec parsing, and OCI hooks.

The OCI Runtime Spec And runc

Once enough companies were using containers, it became clear that everyone was writing the same "talk to the kernel" code, badly, in slightly different ways. The Open Container Initiative (OCI) was formed in 2015 in a conference room in Amsterdam with engineers from Docker, CoreOS, Red Hat, Google, and a few others, and it published two specifications that every modern container runtime now follows: the OCI Image Specification and the OCI Runtime Specification.

The Runtime Spec describes what a low-level runtime has to accept as input: a filesystem bundle (a directory with a rootfs and a config.json file) and a lifecycle API (create, start, kill, delete). The config.json file is the single source of truth for everything the runtime needs to know: which namespaces to create, what capabilities to drop, which seccomp profile to install, which cgroup limits to apply, what environment variables to set, which syscalls to mask. It looks roughly like this, trimmed:

{
  "ociVersion": "1.2.0",
  "process": {
    "terminal": true,
    "user": { "uid": 0, "gid": 0 },
    "args": ["sh"],
    "env": ["PATH=/usr/bin:/bin"],
    "capabilities": {
      "bounding": ["CAP_CHOWN", "CAP_NET_BIND_SERVICE"],
      "effective": ["CAP_CHOWN", "CAP_NET_BIND_SERVICE"],
      "permitted": ["CAP_CHOWN", "CAP_NET_BIND_SERVICE"]
    },
    "noNewPrivileges": true
  },
  "root": { "path": "rootfs", "readonly": false },
  "mounts": [
    { "destination": "/proc", "type": "proc", "source": "proc" }
  ],
  "linux": {
    "namespaces": [
      {"type": "pid"}, {"type": "network"}, {"type": "ipc"},
      {"type": "uts"}, {"type": "mount"}, {"type": "user"}
    ],
    "uidMappings": [{"containerID": 0, "hostID": 100000, "size": 65536}],
    "gidMappings": [{"containerID": 0, "hostID": 100000, "size": 65536}],
    "resources": {
      "memory": {"limit": 134217728},
      "cpu": {"quota": 50000, "period": 100000}
    },
    "seccomp": { "defaultAction": "SCMP_ACT_ERRNO", "syscalls": [ ... ] }
  }
}

runc reads that file and turns it into the exact sequence of clone, unshare, setns, mount, pivot_root, prctl, writes to cgroup files, seccomp filter installation, and finally execve that we have been describing. crun is a rewrite of runc in C with lower memory overhead, and youki is a rewrite in Rust. All three are interchangeable: anywhere Docker, containerd, or Kubernetes use a "runtime", they call an OCI-compliant binary and hand it a config.json and a rootfs.

OCI Images: Tarballs And Manifests

The Image Spec is even simpler. An OCI image is a tarball of layers plus a JSON manifest that says "here are the layers in order, here are their sha256 digests, here is the config". A layer is itself a tarball: the files that were added or modified in that layer, plus whiteout files for anything that was deleted.

When you docker pull debian:12, Docker Hub hands you a manifest that lists, say, three layer digests. Docker fetches each one via a content-addressed blob store, unpacks it into /var/lib/docker/overlay2/<layer-id>/diff, and records the layer in its metadata. Running the image means stacking those directories as OverlayFS lower layers and mounting an upper layer on top, then pointing a runtime at the merged view. That is the whole image format. Everything else (tags, signatures, annotations, attestations) is metadata layered on the same content-addressed blob store.

The content-addressed part matters: every layer is named by the sha256 of its uncompressed contents. Two images that share a base layer share the same blob, because their digest computations produce the same hash. This is why docker images tells you the total size but shows SHARED for base layers. Ten Python applications built on the same python:3.12-slim image share the Python layer bit-for-bit, on disk and on the wire.

LSMs: SELinux, AppArmor, And The Second Line

Capabilities and seccomp are discretionary in the sense that they restrict what the process can do, but they do not describe what the process is for. Linux Security Modules (LSMs) add a layer of label-based or path-based access control on top of the usual UNIX permissions. SELinux (originally from the NSA, shipped in Red Hat Enterprise Linux, Fedora, and CentOS Stream) and AppArmor (from Novell, shipped in Ubuntu, Debian, SUSE, and derivatives) are the two production LSMs.

Container runtimes tie into LSMs in two main ways. Docker by default assigns each container a unique SELinux label (something like system_u:object_r:container_file_t:s0:c123,c456) on the rootfs and all mounted volumes, and launches the container with a matching process label. If two containers try to read each other's files through the host filesystem, SELinux's Multi-Category Security enforces that they cannot, because their labels do not overlap. AppArmor does the same thing with paths: Docker ships a docker-default profile that restricts containers' access to specific host paths, and Kubernetes lets pods declare their own AppArmor profile.

LSMs are what saved containers from a long tail of kernel and mount-namespace vulnerabilities. Many "container escape" CVEs of the 2015-2020 era worked in principle but were blocked in practice by the default SELinux or AppArmor profile. If you are running a RHEL-family host, your containers are SELinux-confined whether you noticed or not. If you are running Ubuntu, your Docker containers have an AppArmor profile by default. You can inspect it with cat /sys/kernel/security/apparmor/profiles.

Rootless Containers

For most of Docker's history, running containers required root. The Docker daemon was root, and the docker CLI talked to it over a unix socket that was effectively a root privilege escalation to anyone in the docker group. That was fine for cattle-farm servers but bad for laptops, CI runners, and shared machines where you did not want every developer to be implicitly root.

Rootless containers change that. The user namespace lets an unprivileged user pretend to be root inside a container without being root on the host. Combined with uidmap shadow files (/etc/subuid, /etc/subgid), where each host user is allocated a range of 65,536 subordinate UIDs and GIDs, rootless containers can have their own "root" that maps to a specific, isolated range of high UIDs outside. Podman runs rootless by default. Buildah, Podman's image builder, does too. Docker added a rootless mode in 2020 (dockerd-rootless-setuptool.sh) that runs the daemon itself inside a user namespace.

Rootless is not without compromises. You cannot bind to privileged ports (below 1024) from inside a rootless container without either a host-side helper or setting sysctl net.ipv4.ip_unprivileged_port_start=80. Network setup is harder because unprivileged users cannot create veth pairs or manipulate the host's bridge; Podman uses slirp4netns (a userspace TCP/IP stack that relays packets through a tap device) or more recently pasta for rootless networking. Performance is slightly worse because of the extra copy. But for developer laptops in a shared office in Copenhagen, "docker without root" is a huge security improvement and the overhead is usually invisible.

systemd-nspawn And Other Alternatives

systemd-nspawn is the often-forgotten container runtime that ships with systemd. It was designed originally as "chroot on steroids" for running a second Linux userspace inside your main one during development, and it uses all the same kernel primitives as Docker: namespaces, cgroups, pivot_root, seccomp. It differs in philosophy: it integrates tightly with systemd, so a nspawn container is just a systemd unit, and it leans on machined to track running machines.

sudo systemd-nspawn -D /var/lib/machines/debian --boot

That one line boots a full Debian inside a container from a directory. It is remarkable for how much happens automatically: systemd sets up namespaces, mounts /proc, wires up a private network if asked, attaches to journald for logging, and starts an init inside. It does not have an image registry, a layered filesystem, or an orchestrator, which is exactly the point: it is for the case where you already have a rootfs and want to run it like a VM without virtualisation.

LXC and LXD sit in a similar space: "system containers" optimised for running a long-lived userspace with multiple processes, as opposed to the "application containers" Docker popularised, which run one process per container. Canonical's LXD has shipped on Ubuntu for years and is used by hosting providers that want the economics of containers but the administrative model of lightweight VMs.

Then there are the kernel-isolated runtimes we mentioned earlier. Kata Containers wraps an OCI runtime that, under the hood, boots a lightweight QEMU microVM for each container. gVisor's runsc is also OCI-compatible and looks like a container runtime to Docker and Kubernetes, but inside it the container's syscalls go to a Go program called the Sentry that reimplements Linux semantics in userspace. Both trade performance and kernel compatibility for a much stronger isolation story, and both are used in production at serious scale (Google Cloud Run and App Engine are mostly gVisor; Confidential Containers in Azure are mostly Kata).

CRIU And Checkpoint-Restore

One of the more exotic things you can do with container primitives is freeze a running process and resurrect it later or elsewhere. CRIU (Checkpoint/Restore In Userspace) does exactly that. It walks a process's state through /proc/$PID/maps, /proc/$PID/fd, and a series of specialised ptrace operations, writes all of the memory pages, open file descriptors, pending signals, and credentials into a set of image files, and later reads them back to reconstruct the process.

For containers, CRIU makes live migration possible. A stateful service (say, an in-memory cache in a Prague data centre) can be checkpointed on one node, the checkpoint files shipped across the network, and restored on another node with the process resuming from exactly where it left off, sockets and all. The kernel's support for this is a long tail of features added over a decade: namespaces must be restorable, network sockets must be recreatable with their queue state, memory mappings must be reopen-able at the same virtual addresses, and even the TCP send/receive buffers must be recoverable.

Kubernetes added experimental CRIU support in 2022, behind an alpha feature gate, mostly for checkpointing stateless pods for forensics. The harder case (live-migrating an arbitrary pod with sockets and all) is still an active area of work. When it lands, it will make one of the biggest remaining differences between containers and VMs go away.

Networking: veth, Bridges, And iptables

A fresh network namespace is useless because it has no usable interfaces. To give a container network access, runtimes create a virtual ethernet pair (veth). A veth is a pair of two network interfaces linked directly: whatever you transmit into one comes out of the other. One end is moved into the container's network namespace. The other end stays on the host and is plugged into a bridge (typically docker0 or cni0).

The recipe for adding networking to the container we built above:

# In the host, after the container process is running with PID $CPID:
ip link add veth0 type veth peer name veth1
ip link set veth1 netns $CPID
 
# In the host: bring up the host side and attach it to a bridge
ip link set veth0 up
brctl addif docker0 veth0 || ip link set veth0 master docker0
 
# Inside the container's net namespace, bring up its side and assign an IP
nsenter -t $CPID -n ip link set veth1 name eth0
nsenter -t $CPID -n ip link set eth0 up
nsenter -t $CPID -n ip addr add 172.17.0.42/16 dev eth0
nsenter -t $CPID -n ip route add default via 172.17.0.1

Now the container can ping the host, the host can ping the container, and the host's iptables NAT rules forward the container's outbound traffic to the internet through the host's physical interface. That is Docker's default bridge network in ten lines of shell. Kubernetes CNI plugins do broadly the same thing but with fancier IP address management and overlay networks using VXLAN or Geneve to carry traffic between nodes.

Debugging From The Host: nsenter And Friends

When a container misbehaves, you often want to poke at it from the host, not from inside it. The primary tool for this is nsenter, which takes a target process and a set of namespace types and runs a command inside those namespaces. It is the debugging counterpart to unshare.

# Drop a shell into the network namespace of container PID 3421
sudo nsenter -t 3421 -n bash
# Inside, ip, ss, tcpdump, iptables-save all see the container's network
ss -tlnp
 
# Enter only the mount namespace, keep host's PID and network views
sudo nsenter -t 3421 -m ls /etc/nginx
 
# Enter everything and run a shell exactly as if you were inside
sudo nsenter -t 3421 -a bash

This is how you run tcpdump on a container that does not have it installed, how you poke at configuration files on a production container without docker exec, and how you run strace from the host on a process inside a confined sandbox. It is indispensable when something is wrong and the container's own toolchain is insufficient.

The companion tool is lsns, which lists all namespaces on the system with their type, process count, and owning process. ps -eo pid,pidns,netns,mntns,cmd shows each process's namespace IDs inline. And /proc/$PID/status contains a one-line summary of every namespace the process is in. Put together, you can answer any "who is sharing what" question in a few seconds.

For deep kernel-level investigation, bpftrace and BCC are the right tools. You can attach a kprobe to the do_fork or clone3 entry and watch containers being born, or instrument the seccomp filter path to see which syscalls are being rejected, or measure the page fault rate inside a specific cgroup. The kernel's tracepoint and eBPF facilities make the kind of container introspection that used to require kernel patches into a one-liner.

The Pod Model

Kubernetes introduced a unit called the "pod" that groups multiple containers together. A pod is not a kernel concept; it is a Kubernetes abstraction. But it is built directly out of the primitives we have been discussing. A pod is one or more containers that share a set of namespaces: always the network namespace, always the IPC namespace, usually the UTS namespace, and optionally the PID namespace.

The implication is that containers in the same pod see each other over localhost. They share 127.0.0.1. They share a hostname. They do not share filesystems by default (each container has its own mount namespace and its own image) but they can share volumes that are mounted into both. This is why the "sidecar" pattern works: a logging sidecar container can read a log file that the main container writes to a shared volume, or a service-mesh proxy container can transparently intercept traffic from the main container because they share a network namespace.

Under the hood, Kubernetes creates a tiny "pause" container first whose only job is to own the shared namespaces. Every real container in the pod is then created with its own mount namespace but joined to the pause container's network and IPC namespaces via setns. When the pause container is killed, the shared namespaces go away and the pod is gone. The pause container is the lightest thing imaginable: on disk it is a few hundred kilobytes, and at runtime it calls pause() and waits for signals. That is the entire program.

Knowing that a pod is "a pause container plus some sibling containers that joined its namespaces" makes a lot of Kubernetes behaviour obvious. Why do you have to configure host networking at the pod level rather than the container level? Because the network namespace is shared. Why does kubectl exec into one container of a multi-container pod still show the pod's network interfaces? Same reason. Why are two sidecars in the same pod subject to the same IP, the same port range, the same routing table? The network namespace is shared.

What Containers Are Not

Containers share a kernel. This is the single most important thing to understand about what they are, because everything that makes them cheap and everything that makes them different from VMs comes from this fact. There is exactly one Linux kernel on a Linux host, no matter how many containers you run on it. Every container's syscalls go into the same kernel. Every container's memory is managed by the same memory subsystem. Every container's networking is implemented by the same netfilter and routing stack.

This has several consequences that regularly trip people up.

You cannot run a different kernel in a container. The container's version of "the kernel" is whatever the host is running. You can absolutely run an Alpine-based image (using musl userspace) on an Ubuntu host (using glibc), because userspace is independent. But the kernel is the host's.

You cannot run a Windows container on a Linux host or vice versa (except through virtualisation). Docker Desktop's "Windows containers" feature on Linux is actually booting a Linux VM in the background and running Linux containers inside that. "Windows containers" on Windows, by contrast, uses Microsoft's own kernel-level container primitives, which are a completely separate design.

A kernel bug in the container's eyes is a kernel bug on the host. If a container exploits a Linux kernel privilege escalation, the attacker has host root, not container root. The kernel is the shared trust boundary, and every serious container security discussion comes back to "what does it take to escape the namespaces". Historically the escapes have mostly come from three places: kernel bugs in obscure subsystems like keyrings and user namespaces, misconfigurations that leave host paths bind-mounted into containers, and privileged containers that drop all the isolations on purpose.

For workloads that really need kernel isolation (multi-tenant cloud environments, running untrusted code), people reach for "sandboxed container runtimes" like gVisor (Google's userspace kernel reimplementation that intercepts syscalls) or Kata Containers (which runs each container inside a lightweight VM). Both trade some performance and features for much stronger isolation, and both exist specifically because the shared-kernel model has limits.

The Mental Model Worth Keeping

A Linux container is a process with:

Its own views of processes, mounts, networks, users, and IPC, via namespaces.
Its own resource limits, via cgroups.
A shrunken set of root privileges, via capabilities.
A filter on what syscalls it can call, via seccomp.
A dedicated root filesystem, via pivot_root and OverlayFS.

Everything beyond that is tooling and user experience. The kernel primitives are stable and have been there since around 2013. The things that keep changing are the image formats, the runtimes, the orchestrators, the image registries, the CI pipelines that build images, the tooling around secret injection, and the decision of whether to run containers directly on hardware or inside VMs inside other VMs inside a cloud provider's fleet. The core kernel mechanism has barely shifted.

When something is mysterious (a process that cannot see another process, a file that is inside the image but somehow writable, a network request that vanishes into iptables, a container that cannot bind to port 80 even as root), the productive move is always to ask which of the five primitives is responsible. It is almost always one of them, and the /proc, /sys, nsenter, and capsh tools can all tell you what the actual state is.

Once that model clicks, every container is just "a normal Linux process, with a specific configuration of these five primitives, pointed at a root filesystem". Docker is a convenient way to produce that configuration. Kubernetes is a convenient way to produce that configuration on hundreds of hosts at once. runc and crun and youki are different implementations of the actual "tell the kernel to make this configuration real" step. They all produce the same underlying thing: a Linux process that the kernel has been asked to lie to in very specific, very boring ways.

There is no such thing as a container. There is only a Linux process with a carefully chosen set of lies.