Docker Internals: A Deep Dive into Containers

Docker is a powerful tool that leverages kernel features to isolate processes, creating secure and efficient environments called containers. In this article, we'll explore Docker's core components, including namespaces, cgroups, capabilities, Docker Engine, Docker Runtime, Docker Filesystem, Docker Images, Docker Networking, Docker Image Layers, and Security Best Practices. Let's dive in!

?? Docker Engine: The Heart of Docker

When a container is started using Docker Engine, the Docker Client communicates with the Docker Daemon (dockerd), which pulls a Docker Image and creates an isolated process using various kernel features. The Docker Engine manages:

Networking: Connects containers to networks.
Images: Stores images in the Docker filesystem.
Runtime: Creates isolated processes using namespaces, cgroups, and capabilities.

The Docker Daemon connects to the container runtime, which manages the lifecycle of containers. The actual containers run through containerd, and runc is used as the runtime to create and manage the containerized processes.

?? Namespaces: Process Isolation

Namespaces isolate processes, ensuring that users, hostnames, networks, and PIDs are only visible within their respective namespaces. This is the foundation of containerization. There are eight types of namespaces:

net: Network interfaces.
mnt: Mount points.
uts: Hostname.
pid: Process IDs.
user: User isolation.
time: System time.
ipc: Inter-process communication.
cgroup: Resource management.

Every process belongs to at least one namespace of each type. The host system itself can be seen as a container since all processes belong to default namespaces.

Exploring Namespaces

Example: Viewing Namespace IDs

sudo lsns -p 1

All other processes inherit their parent process's namespaces. To verify this, check the current shell's namespaces:

lsns -p $$

You can also start a new shell in a new namespace using the unshare command:

sudo unshare --uts bash
lsns -p $$

The /proc filesystem provides another way to explore namespaces. For example, list the namespaces of the init process:

sudo ls -l /proc/1/ns

?? cgroups: Resource Management

Control groups (cgroups) manage resources like CPU, memory, disk, and network usage. They ensure that containers don't exceed predefined resource limits.

The /sys/fs/cgroup directory contains multiple subsystems that control various resources:

blkio: Limits block device I/O.
cpu: Controls CPU usage.
cpuacct: Reports CPU usage.
cpuset: Limits individual CPUs on multicore systems.
devices: Controls access to devices.
freezer: Suspends or resumes processes.
memory: Manages memory usage.
net_cls: Tags network packets.
net_prio: Sets network traffic priorities.
ns: Limits access to namespaces.
perf_event: Identifies cgroup membership of processes.

Example: Limiting Memory in a Docker Container

docker run --name alpine -it --rm --memory="512mb" alpine sh
docker stats

You can view the memory limit from inside the container:

cat /sys/fs/cgroup/memory/memory.limit_in_bytes

Cgroups allow Docker to manage resources efficiently, isolating containers while preventing resource exhaustion.

?? Capabilities: Restricting Permissions

Docker Engine uses capabilities to limit the permissions of processes running in a container. By default, containerd runs with all capabilities, but individual containers have restricted capabilities to enhance security.

Exploring Capabilities

Example: Checking Capabilities

docker run -d --name nginx nginx
pid=$(ps aux | grep "nginx" | grep master | awk '{print $2}')
getpcaps $pid

Common capabilities include:

cap_sys_chroot: Required for changing the root filesystem.
cap_mknod: Needed to create special files in /dev.
cap_setuid and cap_setgid: Needed for user and group mappings.

Docker limits capabilities to reduce the risk of privilege escalation attacks.

??? pivot_root: Changing the Root Filesystem

The pivot_root command is used by the Docker Runtime to switch the root filesystem to the container's image filesystem.

Example: Using pivot_root

mount --bind $fs_folder $fs_folder
cd $fs_folder
mkdir oldroot
pivot_root . oldroot
umount -l oldroot
rmdir oldroot

This changes the root directory to the new filesystem inside $fs_folder. The old root filesystem is unmounted and removed, leaving the container with its isolated root.

?? Docker Filesystem: OverlayFS

Docker's default filesystem is a union filesystem called OverlayFS, which is layered:

LowerDir: Read-only layers.
UpperDir: Read-write layer.
MergedDir: The combined view.
WorkDir: Temporary storage for the filesystem.

When you inspect a Docker image, you can see these layers:

docker image inspect nginx | jq '.[0].GraphDriver.Data'

OverlayFS enables efficient storage management by sharing image layers across multiple containers. For example, containers using the same image will share the read-only layers, reducing disk usage.

You can mount an OverlayFS manually:

mkdir -p /mnt/testing
mount -t overlay -o lowerdir=/path/to/layers,upperdir=/path/to/upper,workdir=/path/to/work overlay /mnt/testing

Inspecting Image and Container Layers

First, pull the Nginx image and inspect its layers:

docker pull nginx:latest
docker image inspect nginx | jq '.[0].GraphDriver.Data'

Next, create a container and compare its layers:

docker run --name nginx -d nginx:latest
docker container inspect nginx | jq '.[0].GraphDriver.Data'

?? Docker Image Layers and Build Process

Docker images are built using Dockerfiles. Each instruction in a Dockerfile creates a new image layer. These layers are cached to optimize build times.

Example Dockerfile:

FROM ubuntu:latest
RUN apt-get update && apt-get install -y nginx
COPY . /var/www/html
CMD ["nginx", "-g", "daemon off;"]

Building the Image

docker build -t my-nginx-image .

Inspecting the Built Image

docker image inspect my-nginx-image

Each layer represents a change made by a Dockerfile instruction.

?? Docker Networking Modes

Docker provides several networking modes:

Bridge: The default mode. Containers connect to a virtual bridge.
Host: Containers use the host's network stack.
None: No networking.
Overlay: Used for multi-host networks, often in Swarm clusters.

Example: Creating a Custom Network

docker network create my-custom-network
docker run --name web1 --network my-custom-network nginx

?? Security Best Practices

To improve the security of your Docker environment, consider the following best practices:

Run Containers as Non-Root Users: Avoid running containers as root to reduce the risk of privilege escalation.
Use Official Images: Prefer verified and official images from trusted sources.
Limit Container Capabilities: Remove unnecessary capabilities to minimize potential attack surfaces.
Enable Resource Limits: Use cgroups to limit memory and CPU usage.
Regularly Update Images: Keep your Docker images up to date to apply security patches.

Understanding Docker internals provides a deeper appreciation of how containers achieve isolation, resource management, and security. These insights can help you build more efficient and secure containerized applications.

?? What do you think about Docker's internals? Let me know your thoughts in the comments below!

#Docker #Containers #DevOps

Docker Internals: A Deep Dive into Containers

Anton Lindstr?m

DevOps Architect at Viedoc

?? Docker Engine: The Heart of Docker

?? Namespaces: Process Isolation

Exploring Namespaces

?? cgroups: Resource Management

?? Capabilities: Restricting Permissions

Exploring Capabilities

??? pivot_root: Changing the Root Filesystem

领英推荐

Example: Using pivot_root

?? Docker Filesystem: OverlayFS

Inspecting Image and Container Layers

?? Docker Image Layers and Build Process

Building the Image

Inspecting the Built Image

?? Docker Networking Modes

?? Security Best Practices

社区洞察

其他会员也浏览了

Polymorphic Allocators in C++17

Slim Bootloader Makes x86 Systems Boot Up Faster

Scalable and modular – but can it be software-defined?

Navigating the CPU: Understanding Execution Times, Challenges, Efficiency, Troubleshooting, and Task Distinctions part II

The Architecture Mismatch Dilemma in iOS: Solving the ‘Could Not Find Module *** for Target x86_64-apple-ios-simulator’ Issue on Apple Silicon

Dedicated CPU Vs Shared vCPUs

DDR5 Memory: Coming Soon To A Server Near You

CURT -- The CPU Usage Reporting Tool

Performance, Scalability and Availability checklist which can be used to check if costly CPU cycles are the reason for the impact.

Cache-Aware Memory Allocation Techniques for RTOS