Basics of Docker Security - Part 2
Introduction
In the part one of this blog post series , we touched on how the docker container works on an operating system and what are some of the common threats that can be exploited to understand the basics of container security. In this second part of basics of container security I would continue by explaining what are some of the best practices that one should adhere to make their container environment secure. This obviously would not cover all the attack surface but will give a very good understanding on the basics and beyond of how to secure your containers or how containers are already secured and what not to do to mess it up. This is going to be an intense one so buckle up. We will be covering the following topics.
Control Groups
Control groups is a Linux feature, which allows to put a limit on processes on how much a process can utilize resources such as RAM and CPU etc. This can be used on docker containers.
To check on the cgroups one can navigate to /sys/fs/cgroup/pids/docker/ folder. The cgroup file of a docker container is stored inside the folder named after container's hash. Following steps can be used to locate the corresponding files.
First we would launch a container from alpine image and note it’s hash vaue.
docker@docker:~$ docker run -itd --rm alpine # Running docker without any pids restriction
605eee91ed718c322d63e0c6f2f81255085a81cfbb01e5e691f93cac4c817059
Next we would navigate to /sys/fs/cgroup/pids/docker/ folder followed by the hash of the container location, and look at all the available files under it.
###
Navigate to the cgroup folder to check PID limit which is set to max
###
docker@docker:/sys/fs/cgroup/pids/docker/605eee91ed718c322d63e0c6f2f81255085a81cfbb01e5e691f93cac4c817059$ ls
cgroup.clone_children cgroup.procs notify_on_release pids.current pids.events pids.max tasks
docker@docker:/sys/fs/cgroup/pids/docker/605eee91ed718c322d63e0c6f2f81255085a81cfbb01e5e691f93cac4c817059$ cat pids.max
max
The file which we are interested to see is pids.max file. The value which this file holds is “max”. This means there is no limit to the resources that our running container can use.
Now, let us launch a container by limiting it’s threads. For the same we would use --pids-limit flag, and let us set the limit to 5.
###
Launch a docker container by limiting PID
and check the PID limit
###
docker@docker:~$ docker run -itd --pids-limit 5 alpine # PID limit set to 5
7a8136c5dae9e98b80c1a1b1eb40b2f7bea126f23c8a22e96ac6a58c41bb1a2
docker@docker:/sys/fs/cgroup/pids/docker/7a8136c5dae9e98b80c1a1b1eb40b2f7bea126f23c8a22e96ac6a58c41bb1a28$ cat pids.max
5
Navigate back to the /sys/fs/cgroup/pids/docker/ directory and take a look at pids.max file. We could see the value of the file is changed to 5.
Now if we try to run more than 5 parallel processes, the resource limit will not allow more than 5 threads to be created. Such controls would put a hard limit on how much a container can utilize the resources and will also prevent attacks such as fork-bomb attacks. The same is showcased below by using docker stats.
NOTE: While trying to replicate the above scenario I ran into a problem where if --pids-limit flag is set and the container is started in a detached mode, I was not able to get shell on the container using docker exec command. Therefore to demonstrate this, --it command without the d switch was used so that after launching the container I directly jump into the docker shell. Don't really know why but this worked!!
Privileges and Linux Capabilities
When a --privileged flag is used with a container, it will give all the Linux capabilities to the container. If an attacker gains access to the container they can take advantage of these capabilities.
cap_sys_admin, cap_sys_ptrace and cap_sys_module are some of the dangerous capabilities to name. What are linux capabilities and how they play a role in managing Linux environment is a very broad topic and will not be covered in this blog post, but we will understand it’s basics in relation to docker and how it is used. For example, when an attacker gains a shell on the container and if it has cap_sys_module enabled, it is possible to load a kernel module directly onto the host's kernel from within the container. Let’s just understand what all this really means.
We are going to run 2 containers, first with no privileges flag and second with privileges flag. Then we will install libcap to see what all capabilities are provided as part of the containers. The following commands can be used on Alpine OS.
# Run Docker Without Privileged flag
docker run -it alpine sh
# Install libcap
apk add -U libcap
# See the capabilities
capsh --print
# Run Docker With Privileged flag
docker run -it --privileged alpine sh
# Install libcap
apk add -U libcap
# See the capabilities
capsh --print
Container running without privileged flag.
Container running with privileged flag.
So we could see the following difference in the capabilities.
container 1:
Current: = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap+eip
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap
container 2:
Current: = cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read+eip
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read
Now what are these capabilities and what does more capabilities mean? The Linux capabilities page define capabilities as follows.
For the purpose of performing permission checks, traditional UNIX implementations distinguish two categories of processes: privileged processes (whose effective user ID is 0, referred to as superuser or root), and unprivileged processes (whose effective UID is nonzero). Privileged processes bypass all kernel permission checks, while unprivileged processes are subject to full permission checking based on the process's credentials (usually: effective UID, effective GID, and supplementary group list). Starting with Linux 2.2, Linux divides the privileges traditionally associated with superuser into distinct units, known as capabilities, which can be independently enabled and disabled. Capabilities are a per-thread attribute.
So, we can see that Docker by default restricts multiple capabilities and the privileged flag bypasses the default set capabilities and assigns every capability to the container, which is not at all recommended.
Docker also provide us ways to choose and pick single capabilities that we might want to add or remove. This is how it can be done.
We can add or remove capabilities from a docker container by using --cap-drop flag
docker run -it --cap-drop CHOWN alpine sh
We can also drop all the capabilities from the container and then add specific capabilities to the container.
docker run -it --cap-drop ALL --cap-add chown alpine sh
Namespaces
Now that we understand what Cgroups and Linux capabilities can do, namespace is what we need to understand next. Namespaces are used for controlling the communication with Linux Kernel. The purpose of namespace is to provide controlled isolation between containers and host and other running containers. The following namespaces are used by Docker Engine on Linux:
Let us understand these one by one. But before that we should know that some of the above mentioned namespaces such as PID, NET, IPC, MNT and UTS are automatically implemented to provide containers with complete isolation. Whereas the UID namespace is something which we should configure in order to enhance the security. Let us look at each namespace practically, to get more understanding.
PID Namespace
The PID or Process Identifier is a Linux feature that provides isolation between different running processes. This is done by providing a unique PID to group of processes, and giving them a unique namespace. Doing this allows the group of process to run in an isolated environment, unaware of other running processes in other namespaces.
By default all the docker containers run in different namespaces. To understand the same we can simply run 2 containers and see how they behave.
As we can see from the image above we ran two containers namely “thefirst” and “thesecond” and ran a sleep command on them which will run for 1 day. When we inspected the PIDs of these running processes inside each container it is “1” which is same for both. This clearly shows both the containers are running in an isolated environment, with their own set of PID management and no access to each other’s PIDs. But when we take a look on the host machine we could see the actual PIDs that have been assigned to these running processes.
For better understanding we could also take a look at how many PIDs are available under name space. For the same you could navigate to /proc/self/ns/pid to list the name space PID.
ls -l /proc/self/ns/pid
To see all the PID name spaces we can run the following command
sudo find /proc -maxdepth 3 -type l -name pid -exec readlink {} \; 2>/dev/null | sort -u
In the following image we could see all the PIDs running under the host.
Docker do provide us a feature to run different containers inside same namespace, so that the processes spawning in each container is aware of the processes in other container. The same is shown in the screenshot below.
Here, “thefirst” container is launched with process “sleep 1d”. Then “thesecond” container is launched with pid flag and and we point it to “thefirst” container. “thesecond” container too is launched with a sleep 1d process.
docker run -itd --name thesecond --pid=container:thefirst alpine sh -c 'sleep 1d'
Upon inspecting the process of the second container we could see both the sleep process being executed with different PIDs. Thus we can successfully conclude both the containers are aware of the processes running among them. We could also check the number of PIDs created by navigating to /proc/self/ns/pid folder and see only 1 PID name space is created.
NET Namespace
Whenever a docker container is created, it is created with its own isolated network interface. A virtual isolated network and ip is granted to the container so that the entire network stack of all the containers runs in isolation with each other.
All of this is done by default and there is nothing much as part of security enhancement that needs to be done for it. To visualize this better, we can first create a container.
docker run -itd --name thefirst alpine sleep infinity
The container is created and it will be kept running with the sleep command which will run infinitely. Next we can inspect the running container with it’s hash value to view the process id of the container on host.
sudo docker inspect -f '{{.State.Pid}}' 7303cebc0f7a3e7cb492d8521a70f8ad3309b109c824ee60f4f8385533fb7749
The process id in this case comes out to be 4473. We can view the net namespace created by Linux on the container by using following command.
领英推荐
sudo lsns -t net
As you type this command you could see the name space created for this container in the list provided. If the list is too long for you, you could grep by the process id which was 4473 in this case. All the name spaces for any docker container are created under /proc/PROC_ID/ns folder. All the above steps are shown in the screenshot below.
As you can see in the image above, when listing the proc folder’s name space we can see all the name spaces created for this container process. This will help understand how the Linux operating system maintains and tracks different namespaces for different processes. Also keep this in mind for the future name spaces that we are going to discuss.
IPC Namespace
IPC or Inter Process Communication namespace is responsible for separation of shared memory segments. Shared memories is a Linux feature which allows two processes to communicate with each other by means of a shared memory space. This is a very efficient and faster way of communication as compared to methods like pipes or sockets. Shared memory enables multiple processes to access the same memory location resulting in fast and efficient communication which ultimately leads to a better performance.
IPC and shared memory in itself is a huge topic which we are not going to dive into right now. But for the sake of understanding it through docker and how it is enabled by default let us take a look at some examples.
By default every time a container is run, a separate IPC namespace is created. This means all the running containers run in isolation and do not share any shared memory with each other. From a security standpoint this is a thumbs up and there is nothing much to be tinkered with here.
But what if your developer comes with a requirement where they need two or more containers to be able to share the shared memory using IPC? In such cases we should be equipped with adequate knowledge to make a decision which poses minimum security risk. The only reason for sharing IPC name space is for increasing the performance. Docker provides us following options to implement IPCs
--ipc=host
--ipc=shareable
--ipc=container:<id>
And now we know! Sharing the IPC with host should be the last option. If the IPC sharing is required, it should always be done between containers and not with host. The following example shows how to run docker containers with a sharable namespace and connect the IPC namespace between them.
// Run container and make the namespace sharable
docker run -itd --name thefirst --ipc=sharable alpine
// Run second container and connect with the sharable IPC namespace of thefirst container
docker run -itd --name thesecond --ipc=container:thefirst alpine
MNT Namespace
Whenever a new container is created it automatically creates a new mount namespace for that container. This we already saw above by looking into the ns folder present at /proc/<pid>/ns namespace folder.
For looking at the container mount namespace info we can do the following steps: -
docker run -itd --name mycontainer alpine
docker inspect -f '{{.State.Pid}}' mycontainer
findmnt -N 2651
As shown in the example above after running the container we can get the PID of that container. From that PID we can usen the findmnt command to view the mount namespace related information. The same is shown in the screenshot below. One can also use the cat /proc/<procID>/mountinfo command to view the same information, which we got using findmnt.
We can also use the nsenter command to enter inside the mount namespace by using the following command.
sudo nsenter --target <PID> --mount ls /
The above command for example will enter into the mentioned PIDs namespace and run ls command on the mounted container’s root directory.
UTS Namespace
Unix Timesharing System or UTS namespace, despite what the name might suggest is actually responsible for managing hostname and NIS isolation. As we have already seen whenever a container is run a new namespace is created, and UTS is one such namespace.
Whenever a new container is run it is provided with a random hostname. This is done because of the UTS namespace. To view the same one can shell into the docker container and do the hostname command to view the hostname. Everytime a new container is created, it is created with new hostname.
UID Namespace
Take the following scenario in consideration. You are a non root user inside a Linux machine and docker is installed on the machine. You are also a part of the same group as docker, because you need to run and manage docker containers. There is a way you could easily escalate your privileges to root user on a Linux environment. This is what we look at below, and how to mitigate it.
One of the most interesting and most misconfigured namespaces in container is the UID namespace. An important thing to understand is, a root user on the container is similar (not equal) to root user on host. If a container has a mounted directory from the host, a user will be able to manipulate the files inside the shared folder.
As one can see in the scenario above, a file “otherside.txt” was created by root user. As a non root user you should not be able to read or write to this file. But since docker is installed on this host machine what can be done is a container image can be spun up and we can mount the root directory of host to it and you got your privileges to roam around the root directory using the container shell. A container can be mounted with the root directory using the following command
docker run -it -v /:/shared alpine
Once done, you can navigate to the /shared folder and gain access to read and write the host’s root files. The same scenario is shown in the GIF playing above.
The way to overcome this is by using docker username spaces. This method can protect against privilege escalation attacks. For containers running as root can be re-mapped to a less privileged user of host. The re-mapped host user can be provided with a range of UIDs from 0 to 65536 without having privileges on the host machine.
The remapping is handled by two files. /etc/subuid and /etc/subgid, where subuid handles user id and subgid handles group id.
dockeremap:165536:65536
The format of the user defined under these files is as shown above. The dockeremap user is created by default. If not created, one can navigate to the respective files and create this user with a defined range. To configure the remapping and start all the docker containers with a re-mapped user, we must first stop the docker daemon.
sudo systemctl stop docker
or
sudo service docker stop
After stopping the docker service there are 2 ways using which the docker daemon can be initialized with a remapped user. The first is by using the --userns-remap=default switch
sudo dockerd --userns-remap=default
Second is to edit /etc/docker/daemon.json file. This file can be edited and the remap user(s) can be put here to enable the remapping by default.
{
"userns-remap": "default"
}
Save the file and restart docker, sudo service docker restart. Now, the dockeremap user should be mapped by default.
After remapping the user we could see we are unable to perform any actions which require docker root privileges from the host.
AppArmor Profiles
All the information learned above about Cgroups, Namespaces and Linux Privileges could be a real daunting task to manage individually. That is when these profiles come into picture. Through this blogpost I plan to introduce these concepts. Once you know them, I would leave it up to you on how deep you want to go and learn about these. Maybe in a future blogpost we could deep dive and learn more about them.
AppArmor or Application Armor is a Linux security module, which allows us to restrict programs capabilities with apparmor profiles. It can be used to protect Docker containers from security threats. To use this with docker, we need to associate an AppArmor security profile with each container. When we start a container, we must provide a custom AppArmor profile to it and Docker expects to find an AppArmor policy loaded and enforced.
To view if Apparmor is available or not type docker info and it should be present under Security Options. Next we can write an Apparmor profile
#include <tunables/global>
profile apparmor-profile flags=(attach_disconnected,mediate_deleted) {
#include <abstractions/base>
file,
network,
capability,
deny /tmp/** w,
deny /etc/passwd rwklx,
}
The deny /tmp/** w, denies any modification to the tmp directory the ** would prevent any modification inside the sub directories of tmp directory aswell. The deny /etc/passwd rwklx prevents any action on passwd file. Enforce the AppArmor policy and then run the container with the Apparmor profile
sudo apparmor_parser -r -W apparmor-profile
docker run -it --security-opt apparmor=apparmor-profile alpine sh
Seccomp Profile
Seccomp is another Linux feature which can be used for filtering system calls issued by a program. This acts as a firewall for system calls. We can write seccomp profiles, to filter what system calls can be run from within the container. We need to load these profiles on each container.
{
"defaultAction": "SCMP_ACT_ALLOW",
"architectures": [
"SCMP_ARCH_X86_64",
"SCMP_ARCH_X86",
"SCMP_ARCH_X32"
],
"syscalls": [
{
"name": "chmod",
"action": "SCMP_ACT_ERRNO",
"args": []
}
]
}
The above profile will prevent any chmod operations on the container. Now we can run a docker container using this profile
docker run -it --security-opt seccomp=seccomp-profile.json alpine sh
By default docker uses a seccomp profile which disables multiple commands on a newly built container. Because of this profile we are not able to use commands such as insmod without the --privileged flag. So if we use the --privileged flag with the seccomp profile it will overwrite the policy
docker run -it --privileged --security-opt seccomp=seccomp-profile.json alpine sh
With this I would end this blog post. If you have made it this far congratulations and thank you. I do hope you learned something new today.
References