Basics of Docker Security - Part 2
Basics of Docker Security

Basics of Docker Security - Part 2

Introduction

In the part one of this blog post series , we touched on how the docker container works on an operating system and what are some of the common threats that can be exploited to understand the basics of container security. In this second part of basics of container security I would continue by explaining what are some of the best practices that one should adhere to make their container environment secure. This obviously would not cover all the attack surface but will give a very good understanding on the basics and beyond of how to secure your containers or how containers are already secured and what not to do to mess it up. This is going to be an intense one so buckle up. We will be covering the following topics.

  • Control Groups

  • Privilege Flag and Linux Capabilities

  • Namespaces

  • Apparmour Profiles

  • Seccomp Profiles


Control Groups

Control groups is a Linux feature, which allows to put a limit on processes on how much a process can utilize resources such as RAM and CPU etc. This can be used on docker containers.

To check on the cgroups one can navigate to /sys/fs/cgroup/pids/docker/ folder. The cgroup file of a docker container is stored inside the folder named after container's hash. Following steps can be used to locate the corresponding files.

First we would launch a container from alpine image and note it’s hash vaue.

docker@docker:~$ docker run -itd --rm alpine # Running docker without any pids restriction
605eee91ed718c322d63e0c6f2f81255085a81cfbb01e5e691f93cac4c817059        
PID Limit

Next we would navigate to /sys/fs/cgroup/pids/docker/ folder followed by the hash of the container location, and look at all the available files under it.

###
Navigate to the cgroup folder to check PID limit which is set to max
###

docker@docker:/sys/fs/cgroup/pids/docker/605eee91ed718c322d63e0c6f2f81255085a81cfbb01e5e691f93cac4c817059$ ls
cgroup.clone_children  cgroup.procs  notify_on_release  pids.current  pids.events  pids.max  tasks
docker@docker:/sys/fs/cgroup/pids/docker/605eee91ed718c322d63e0c6f2f81255085a81cfbb01e5e691f93cac4c817059$ cat pids.max
max        

The file which we are interested to see is pids.max file. The value which this file holds is “max”. This means there is no limit to the resources that our running container can use.

PIDs Limit

Now, let us launch a container by limiting it’s threads. For the same we would use --pids-limit flag, and let us set the limit to 5.

###
Launch a docker container by limiting PID
and check the PID limit
###

docker@docker:~$ docker run -itd --pids-limit 5 alpine # PID limit set to 5
7a8136c5dae9e98b80c1a1b1eb40b2f7bea126f23c8a22e96ac6a58c41bb1a2

docker@docker:/sys/fs/cgroup/pids/docker/7a8136c5dae9e98b80c1a1b1eb40b2f7bea126f23c8a22e96ac6a58c41bb1a28$ cat pids.max
5        

Navigate back to the /sys/fs/cgroup/pids/docker/ directory and take a look at pids.max file. We could see the value of the file is changed to 5.

Check PID Limit

Now if we try to run more than 5 parallel processes, the resource limit will not allow more than 5 threads to be created. Such controls would put a hard limit on how much a container can utilize the resources and will also prevent attacks such as fork-bomb attacks. The same is showcased below by using docker stats.

Docker Stats
NOTE: While trying to replicate the above scenario I ran into a problem where if --pids-limit flag is set and the container is started in a detached mode, I was not able to get shell on the container using docker exec command. Therefore to demonstrate this, --it command without the d switch was used so that after launching the container I directly jump into the docker shell. Don't really know why but this worked!!

Privileges and Linux Capabilities

When a --privileged flag is used with a container, it will give all the Linux capabilities to the container. If an attacker gains access to the container they can take advantage of these capabilities.

cap_sys_admin, cap_sys_ptrace and cap_sys_module are some of the dangerous capabilities to name. What are linux capabilities and how they play a role in managing Linux environment is a very broad topic and will not be covered in this blog post, but we will understand it’s basics in relation to docker and how it is used. For example, when an attacker gains a shell on the container and if it has cap_sys_module enabled, it is possible to load a kernel module directly onto the host's kernel from within the container. Let’s just understand what all this really means.

We are going to run 2 containers, first with no privileges flag and second with privileges flag. Then we will install libcap to see what all capabilities are provided as part of the containers. The following commands can be used on Alpine OS.

# Run Docker Without Privileged flag
docker run -it alpine sh
# Install libcap
apk add -U libcap
# See the capabilities
capsh --print

# Run Docker With Privileged flag
docker run -it --privileged alpine sh
# Install libcap
apk add -U libcap
# See the capabilities
capsh --print        

Container running without privileged flag.

Container Capabilities

Container running with privileged flag.

Capabilities Difference

So we could see the following difference in the capabilities.

container 1:

Current: = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap+eip
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap

container 2:

Current: = cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read+eip
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read        

Now what are these capabilities and what does more capabilities mean? The Linux capabilities page define capabilities as follows.

https://man7.org/linux/man-pages/man7/capabilities.7.html

For the purpose of performing permission checks, traditional UNIX implementations distinguish two categories of processes: privileged processes (whose effective user ID is 0, referred to as superuser or root), and unprivileged processes (whose effective UID is nonzero). Privileged processes bypass all kernel permission checks, while unprivileged processes are subject to full permission checking based on the process's credentials (usually: effective UID, effective GID, and supplementary group list). Starting with Linux 2.2, Linux divides the privileges traditionally associated with superuser into distinct units, known as capabilities, which can be independently enabled and disabled. Capabilities are a per-thread attribute.

So, we can see that Docker by default restricts multiple capabilities and the privileged flag bypasses the default set capabilities and assigns every capability to the container, which is not at all recommended.

Docker also provide us ways to choose and pick single capabilities that we might want to add or remove. This is how it can be done.

We can add or remove capabilities from a docker container by using --cap-drop flag

docker run -it --cap-drop CHOWN alpine sh        

We can also drop all the capabilities from the container and then add specific capabilities to the container.

docker run -it --cap-drop ALL --cap-add chown alpine sh        

Namespaces

Now that we understand what Cgroups and Linux capabilities can do, namespace is what we need to understand next. Namespaces are used for controlling the communication with Linux Kernel. The purpose of namespace is to provide controlled isolation between containers and host and other running containers. The following namespaces are used by Docker Engine on Linux:

  • PID namespace for process isolation.

  • NET namespace for managing network interfaces.

  • IPC namespace for managing access to IPC resources.

  • MNT namespace for managing filesystem mount points.

  • UTS namespace for isolating kernel and version identifiers.

  • User ID namespace for privilege isolation

Let us understand these one by one. But before that we should know that some of the above mentioned namespaces such as PID, NET, IPC, MNT and UTS are automatically implemented to provide containers with complete isolation. Whereas the UID namespace is something which we should configure in order to enhance the security. Let us look at each namespace practically, to get more understanding.

PID Namespace

The PID or Process Identifier is a Linux feature that provides isolation between different running processes. This is done by providing a unique PID to group of processes, and giving them a unique namespace. Doing this allows the group of process to run in an isolated environment, unaware of other running processes in other namespaces.

By default all the docker containers run in different namespaces. To understand the same we can simply run 2 containers and see how they behave.

PID Namespace

As we can see from the image above we ran two containers namely “thefirst” and “thesecond” and ran a sleep command on them which will run for 1 day. When we inspected the PIDs of these running processes inside each container it is “1” which is same for both. This clearly shows both the containers are running in an isolated environment, with their own set of PID management and no access to each other’s PIDs. But when we take a look on the host machine we could see the actual PIDs that have been assigned to these running processes.

For better understanding we could also take a look at how many PIDs are available under name space. For the same you could navigate to /proc/self/ns/pid to list the name space PID.

ls -l /proc/self/ns/pid        

To see all the PID name spaces we can run the following command

sudo find /proc -maxdepth 3 -type l -name pid -exec readlink {} \; 2>/dev/null | sort -u        

In the following image we could see all the PIDs running under the host.

PID List

Docker do provide us a feature to run different containers inside same namespace, so that the processes spawning in each container is aware of the processes in other container. The same is shown in the screenshot below.

PID Sharing

Here, “thefirst” container is launched with process “sleep 1d”. Then “thesecond” container is launched with pid flag and and we point it to “thefirst” container. “thesecond” container too is launched with a sleep 1d process.

docker run -itd --name thesecond --pid=container:thefirst alpine sh -c 'sleep 1d'        

Upon inspecting the process of the second container we could see both the sleep process being executed with different PIDs. Thus we can successfully conclude both the containers are aware of the processes running among them. We could also check the number of PIDs created by navigating to /proc/self/ns/pid folder and see only 1 PID name space is created.

NET Namespace

Whenever a docker container is created, it is created with its own isolated network interface. A virtual isolated network and ip is granted to the container so that the entire network stack of all the containers runs in isolation with each other.

All of this is done by default and there is nothing much as part of security enhancement that needs to be done for it. To visualize this better, we can first create a container.

docker run -itd --name thefirst alpine sleep infinity        

The container is created and it will be kept running with the sleep command which will run infinitely. Next we can inspect the running container with it’s hash value to view the process id of the container on host.

sudo docker inspect -f '{{.State.Pid}}' 7303cebc0f7a3e7cb492d8521a70f8ad3309b109c824ee60f4f8385533fb7749        

The process id in this case comes out to be 4473. We can view the net namespace created by Linux on the container by using following command.

sudo lsns -t net        

As you type this command you could see the name space created for this container in the list provided. If the list is too long for you, you could grep by the process id which was 4473 in this case. All the name spaces for any docker container are created under /proc/PROC_ID/ns folder. All the above steps are shown in the screenshot below.

Docker Process

As you can see in the image above, when listing the proc folder’s name space we can see all the name spaces created for this container process. This will help understand how the Linux operating system maintains and tracks different namespaces for different processes. Also keep this in mind for the future name spaces that we are going to discuss.

IPC Namespace

IPC or Inter Process Communication namespace is responsible for separation of shared memory segments. Shared memories is a Linux feature which allows two processes to communicate with each other by means of a shared memory space. This is a very efficient and faster way of communication as compared to methods like pipes or sockets. Shared memory enables multiple processes to access the same memory location resulting in fast and efficient communication which ultimately leads to a better performance.

IPC and shared memory in itself is a huge topic which we are not going to dive into right now. But for the sake of understanding it through docker and how it is enabled by default let us take a look at some examples.

By default every time a container is run, a separate IPC namespace is created. This means all the running containers run in isolation and do not share any shared memory with each other. From a security standpoint this is a thumbs up and there is nothing much to be tinkered with here.

But what if your developer comes with a requirement where they need two or more containers to be able to share the shared memory using IPC? In such cases we should be equipped with adequate knowledge to make a decision which poses minimum security risk. The only reason for sharing IPC name space is for increasing the performance. Docker provides us following options to implement IPCs
--ipc=host
--ipc=shareable
--ipc=container:<id>        

  • The first option --ipc=host will lead to sharing of host’s IPC space with the container, meaning container will have access to host’s IPC space. This is not recommended at all since container isolation will be broken and it will gain access to host’s IPC space

  • --ipc=sharable will make current container’s space as sharable, meaning other running containers will be able to access this container’s IPC space if they wish to.

  • --ipc=container:<>id will connect the IPC space with the mentioned container id. For this the mentioned container of which you need to access the IPC space should be running with the ipc=shareable flag.

And now we know! Sharing the IPC with host should be the last option. If the IPC sharing is required, it should always be done between containers and not with host. The following example shows how to run docker containers with a sharable namespace and connect the IPC namespace between them.

IPC Name Space
// Run container and make the namespace sharable
docker run -itd --name thefirst --ipc=sharable alpine

// Run second container and connect with the sharable IPC namespace of thefirst container
docker run -itd --name thesecond --ipc=container:thefirst alpine        

MNT Namespace

Whenever a new container is created it automatically creates a new mount namespace for that container. This we already saw above by looking into the ns folder present at /proc/<pid>/ns namespace folder.

For looking at the container mount namespace info we can do the following steps: -

docker run -itd --name mycontainer alpine
docker inspect -f '{{.State.Pid}}' mycontainer
findmnt -N 2651        

As shown in the example above after running the container we can get the PID of that container. From that PID we can usen the findmnt command to view the mount namespace related information. The same is shown in the screenshot below. One can also use the cat /proc/<procID>/mountinfo command to view the same information, which we got using findmnt.

Mount Namespace Tree Details

We can also use the nsenter command to enter inside the mount namespace by using the following command.

sudo nsenter --target <PID> --mount ls /        

The above command for example will enter into the mentioned PIDs namespace and run ls command on the mounted container’s root directory.

UTS Namespace

Unix Timesharing System or UTS namespace, despite what the name might suggest is actually responsible for managing hostname and NIS isolation. As we have already seen whenever a container is run a new namespace is created, and UTS is one such namespace.

List Namespace

Whenever a new container is run it is provided with a random hostname. This is done because of the UTS namespace. To view the same one can shell into the docker container and do the hostname command to view the hostname. Everytime a new container is created, it is created with new hostname.

Hostname

UID Namespace

Take the following scenario in consideration. You are a non root user inside a Linux machine and docker is installed on the machine. You are also a part of the same group as docker, because you need to run and manage docker containers. There is a way you could easily escalate your privileges to root user on a Linux environment. This is what we look at below, and how to mitigate it.

One of the most interesting and most misconfigured namespaces in container is the UID namespace. An important thing to understand is, a root user on the container is similar (not equal) to root user on host. If a container has a mounted directory from the host, a user will be able to manipulate the files inside the shared folder.

Privilege Escalation

As one can see in the scenario above, a file “otherside.txt” was created by root user. As a non root user you should not be able to read or write to this file. But since docker is installed on this host machine what can be done is a container image can be spun up and we can mount the root directory of host to it and you got your privileges to roam around the root directory using the container shell. A container can be mounted with the root directory using the following command

docker run -it -v /:/shared alpine        
Once done, you can navigate to the /shared folder and gain access to read and write the host’s root files. The same scenario is shown in the GIF playing above.        

The way to overcome this is by using docker username spaces. This method can protect against privilege escalation attacks. For containers running as root can be re-mapped to a less privileged user of host. The re-mapped host user can be provided with a range of UIDs from 0 to 65536 without having privileges on the host machine.

The remapping is handled by two files. /etc/subuid and /etc/subgid, where subuid handles user id and subgid handles group id.

dockeremap:165536:65536

The format of the user defined under these files is as shown above. The dockeremap user is created by default. If not created, one can navigate to the respective files and create this user with a defined range. To configure the remapping and start all the docker containers with a re-mapped user, we must first stop the docker daemon.

sudo systemctl stop docker 
or 
sudo service docker stop        

After stopping the docker service there are 2 ways using which the docker daemon can be initialized with a remapped user. The first is by using the --userns-remap=default switch

sudo dockerd --userns-remap=default        
UID Namespace Remap

Second is to edit /etc/docker/daemon.json file. This file can be edited and the remap user(s) can be put here to enable the remapping by default.

{
  "userns-remap": "default"
}
        

Save the file and restart docker, sudo service docker restart. Now, the dockeremap user should be mapped by default.

After remapping the user we could see we are unable to perform any actions which require docker root privileges from the host.

Docker Remap

AppArmor Profiles

All the information learned above about Cgroups, Namespaces and Linux Privileges could be a real daunting task to manage individually. That is when these profiles come into picture. Through this blogpost I plan to introduce these concepts. Once you know them, I would leave it up to you on how deep you want to go and learn about these. Maybe in a future blogpost we could deep dive and learn more about them.

AppArmor or Application Armor is a Linux security module, which allows us to restrict programs capabilities with apparmor profiles. It can be used to protect Docker containers from security threats. To use this with docker, we need to associate an AppArmor security profile with each container. When we start a container, we must provide a custom AppArmor profile to it and Docker expects to find an AppArmor policy loaded and enforced.

To view if Apparmor is available or not type docker info and it should be present under Security Options. Next we can write an Apparmor profile

#include <tunables/global>
profile apparmor-profile flags=(attach_disconnected,mediate_deleted) {
  #include <abstractions/base>
  file,
  network,
  capability,
  deny /tmp/** w,
  deny /etc/passwd rwklx,
}        

The deny /tmp/** w, denies any modification to the tmp directory the ** would prevent any modification inside the sub directories of tmp directory aswell. The deny /etc/passwd rwklx prevents any action on passwd file. Enforce the AppArmor policy and then run the container with the Apparmor profile

sudo apparmor_parser -r -W apparmor-profile
docker run -it --security-opt apparmor=apparmor-profile alpine sh        

Seccomp Profile

Seccomp is another Linux feature which can be used for filtering system calls issued by a program. This acts as a firewall for system calls. We can write seccomp profiles, to filter what system calls can be run from within the container. We need to load these profiles on each container.

{
	"defaultAction": "SCMP_ACT_ALLOW",
	"architectures": [
		"SCMP_ARCH_X86_64",
		"SCMP_ARCH_X86",
		"SCMP_ARCH_X32"
	],
	"syscalls": [
		{
			"name": "chmod",
			"action": "SCMP_ACT_ERRNO",
			"args": []
		}

	]
}        

The above profile will prevent any chmod operations on the container. Now we can run a docker container using this profile

docker run -it --security-opt seccomp=seccomp-profile.json alpine sh        

By default docker uses a seccomp profile which disables multiple commands on a newly built container. Because of this profile we are not able to use commands such as insmod without the --privileged flag. So if we use the --privileged flag with the seccomp profile it will overwrite the policy

docker run -it --privileged --security-opt seccomp=seccomp-profile.json alpine sh        

With this I would end this blog post. If you have made it this far congratulations and thank you. I do hope you learned something new today.

References

https://medium.com/nerd-for-tech/how-to-run-containers-in-the-same-pid-namespace-cd67983516be

https://docs.docker.com/engine/security/userns-remap/

https://www.baeldung.com/linux/docker-network-namespace-invisible

https://dev.to/pemcconnell/docker-networking-network-namespaces-docker-and-dns-19f1

https://docs.docker.com/engine/security/userns-remap/

https://docs.docker.com/engine/security/apparmor/

https://dev.to/0xog_pg/using-shared-memory-in-linux-1p62







要查看或添加评论,请登录

社区洞察

其他会员也浏览了