tFinish & publish container blogpost - monochromatic - monochromatic blog: http://blog.z3bra.org
HTML git clone git://z3bra.org/monochromatic
DIR Log
DIR Files
DIR Refs
---
DIR commit ed6b12aac58c951e9924bdd1325c2fc9ea68a0b3
DIR parent 076c73eb2cf52b5b1fdac70165a64c1566c4b053
HTML Author: z3bra <willyatmailoodotorg>
Date: Thu, 24 Mar 2016 22:08:53 +0000
Finish & publish container blogpost
Diffstat:
M 2016/03/hand-crafted-containers.txt | 248 +++++++++++++++++++++++++++++--
M config.mk | 2 +-
M css/monochrome.css | 22 ++++------------------
M index.txt | 1 +
4 files changed, 242 insertions(+), 31 deletions(-)
---
DIR diff --git a/2016/03/hand-crafted-containers.txt b/2016/03/hand-crafted-containers.txt
t@@ -1,7 +1,16 @@
-# [Hand-made containers](#)
+# [Hand-crafted containers](#)
## — 18 March, 2016
-### 0. intro
+### tl;dr
+
+ # CTNAME=blah
+ # mkdir -p /ns/$CTNAME/bin /ns/$CTNAME/lib
+ # ldd /bin/echo | grep '/' | cut -d'>' -f2 | awk '{print $1}' | xargs -I% cp % /ns/$CTNAME/lib/
+ # cp /bin/echo /ns/$CTNAME/bin/
+ # ip netns add $CTNAME
+ # ip netns exec $CTNAME unshare -fpium --mount-proc env -i container=handcraft chroot /ns/$CTNAME /bin/echo 'Hello, world!'
+
+### 0. Intro
Containers are the latest trend, for a good reason: they leave room for new
ideas in terms of security, flexibility, performance and much more.
t@@ -23,7 +32,7 @@ an application (a complex one). In this regard, there is only a single type of
containers.
We can now focus on what's really important, how do they work?
-### 1. namespaces
+### 1. Namespaces
That's a keyword, so let's ask our internet god what it means:
t@@ -37,7 +46,7 @@ to a process.
When a namespace is created for a process, all its children will be created
within this namespace, and inherit the "limitations" of the parent.
-#### mount
+#### Mount
The process will be able to mount and unmount filesystems without affecting
the rest of the system. For example, if you unmount a partition within the
namespace, all the processes within it will see it as unmounted, while it
t@@ -52,7 +61,7 @@ This namespace concern shared memory, System V message queues and sempaphores.
Processes in the namespace will be unable to communicate with the host's
processes this way.
-#### network
+#### Network
Processes will have their own network stack. This includes the routing table,
firewall rules, sockets, and so on.
t@@ -60,16 +69,231 @@ firewall rules, sockets, and so on.
Processes' IDs will get a different mapping that they have on the host. They
will get renumbered, starting from 1.
-#### user
+#### User
The namespaces will have their own set of user and group IDs.
-### 2. making containers
+### 2. Making containers
Now that we know what containers are and how they work, it's time to make
-some!
+one!
+For the purpose of this article, we will try an build the simplest container
+capable of printing "Hello, world!".
+
+Here is the program:
+
+ $ more <<EOF> hello.c
+ #include <unistd.h>
+ int
+ main(int argc, char **argv)
+ {
+ write(1, "Hello, world!\n", 14);
+ return 0;
+ }
+ EOF
+ $ cc hello.c -o hello
+
+#### 2.0 `chroot(1)`
+This one is an old tool that will run a command or spawn an interactive
+shell after changing the root directory.
+It is used to isolate a process, or group of processes from the host's
+filesystem tree. This has long be used for security purposes
+(see [chroot jail](https://en.wikipedia.org/wiki/Chroot)), but escaping from
+chroot is rather easy for someone with root (UID 0) access.
+This is why `chroot` alone cannot be considered secure, but coupled with user
+namespace and privilege dropping, one can turn a chroot in a real jail.
+
+Back to the topic. Let's copy our `hello` binary into the chroot, and try to
+run it:
+
+ $ mkdir rootfs
+ $ cp ./hello ./rootfs/hello
+ # chroot ./rootfs ./hello
+ chroot: failed to run command "./hello": No such file or directory
+
+This is the worst error message you can get. Of course `./hello` exists!
+We just copied it. But what does this error mean then? Let's take a closer
+look at this binary:
+
+ $ file ./hello
+ ./hello: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-x86-64.so.2, for GNU/Linux 3.12.0, not stripped
+
+The output may differ slightly depending on your system, but the important
+part here is the following:
+
+> dynamically linked, interpreter /lib/ld-linux-x86-64.so.2
+
+Dynamically linked binaries cannot be run on their own. Long story short,
+`/lib/ld-linux-x86-64.so.2` is a program that is implicitely called to run all
+the dynamic binaries on a linux system, it's called the
+[linker](https://en.wikipedia.org/wiki/Dynamic_linker). So in order to have a
+binary run in the chroot, you need to copy over the linker AND all the libraries
+your binary links to. To get a list of these libraries, use the `ldd` command:
+
+ $ ldd hello
+ linux-vdso.so.1 (0x00007ffd3e7dc000)
+ libc.so.6 => /lib/libc.so.6 (0x00007fdc1a482000)
+ /lib/ld-linux-x86-64.so.2 (0x00007fdc1a82a000)
+
+You can ignore the [`vdso`](http://man7.org/linux/man-pages/man7/vdso.7.html)
+line as it's handled by the C library.
+Our `hello` binary depends on two files: `/lib/ld-linux-x86-64.so.2`, the linker,
+and `/lib/libc.so.6`, the C library (containing system calls like `write(2)`).
+
+In order to run our `hello` program, we'll have to copy them over in place. After
+that, our program should run totally fine:
+
+ $ mkdir -p rootfs/lib
+ $ cp /lib/ld-linux-x86-64.so.2 /lib/libc.so.6 ./rootfs/lib
+ # chroot ./rootfs ./hello
+ Hello, world!
+
+TADAAAA!! That was easy right?
+Another option is to simply compile our program *statically*. It means that all the
+needed objects from libraries will be compiled into the program, removing the need
+for a linker and libc in the chroot:
+
+ $ mkdir rootfs
+ $ cc hello.c -o hello -static -s
+ $ cp hello ./rootfs
+ # chroot ./rootfs ./hello
+ Hello, world!
+
+Let's take a look at the size of this "container". For scale, the
+"[Smallest possible docker container](https://docs.docker.com/articles/baseimages/#creating-a-simple-base-image-using-scratch)"
+weights 3.6Mib...
+
+ $ du -sh rootfs
+ 720K rootfs
+
+That's most likely the lightest container you've seen, right?
+
+#### 2.1 env
+To isolate our process from the host, we'll have to clean all the environment
+from all its variables, to make sure the container won't know anything about its
+host. We can do this with the `env` command:
+
+ $ export FOO="bar"
+ $ env -i /bin/sh
+ $ env # we are now in a subshell
+ PWD=/home/z3bra
+
+You can see that the subprocess doesn't have the `$FOO` variable in its
+environment, even though it has been exported earlier.
+You can set the environment by passing variables AFTER the `env -i` command,
+this is useful to set the `$container` variable which has been "standardized" as
+a way to tell processes they are running inside a container.
+
+We now have a way to isolate our `hello` process from the host's environment.
+
+ # env -i container="handcraft" chroot ./rootfs ./hello
+
+#### 2.2 `unshare(1)`
+This tool is the one that will actually isolate containers. It has been created
+especially for this purpose, and will let you run a process unshared from
+different namespaces: mount, user, network, PID, IPC and UTS.
+In the same order, each flag will separate your `command` from the given
+namespace. See `unshare(1)` for more informations:
+
+ unshare -m -U -n -p -i -u <command>
+
+We can actually leave the `-n` flag untouched, as some tools provide a better
+approach to network isolation (see `ip-netns(1)`, described later in this post).
+
+Another point worth mentionning is that if you want to isolate the process from
+the PID namespace, you should consider using the options `--fork --mount-proc`,
+so that the process will see a "virtualized" `/proc` that will represent the
+namespace, and not the host. For example:
+
+ # unshare -p --fork --mount-proc ps -faux
+ USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
+ root 1 0.0 0.0 13012 2276 pts/2 R+ 23:57 0:00 ps -aux
+
+We just found a way to isolate our program a bit more:
+
+ # unshare -fpiumU --mount-proc env -i container="handcraft" chroot ./rootfs ./hello
+
+For the curious, you can check the `nsenter(1)` program, that will help you
+run a process within another process namespace.
+
+#### 2.3 `ip-netns(1)`
+
+The `ip(1)` command includes a `netns` subcommand to manage network namespaces.
+It is useful to give network access to a process while keeping it away from the
+host's network stack.
+
+You need to be familiar with the concept of
+[bridges](https://en.wikipedia.org/wiki/Bridging_\(networking\)), and
+[virtual network interfaces](https://en.wikipedia.org/wiki/Virtual_network_interface)
+(veth) pairs here.
+Virtual ethernet devices pairs acts like both ends of a tube: when a packet is
+written on one end, it is also written on the other. This simple concept will
+help us get an internet acces *inside* the container, while using the network
+stack of the host.
+
+The process is easy: we will create a `veth` pair, move one end inside the
+container, and bridge the other side with a physical interface.
+Let's assume your physical interface is named `eth0`. We will create a bridge
+`br0`, add `eth0` on this bridge, and request an IP for this interface:
+
+ # brctl addbr br0
+ # brctl addif br0 eth0
+ # dhcpcd br0
+
+Then, we create a network namespace, a veth pair and move one end if this
+pair inside the namespace (we will name it "handcraft"):
+
+ # ip netns add handcraft
+ # ip link add veth1 type veth peer name eth1
+ # ip link set eth1 netns handcraft
+
+Now that our namespace has an interface able to communicate with the outside
+world, we can bridge it together with `eth0` and request an IP:
+
+ # brctl addif br0 veth1
+ # ip link set veth1 up
+ # ip netns exec dhcpcd eth1
+
+We now have a namespace 100% isolated from the host, that can reach the
+outside world over ethernet!
+You can run any command inside this namespace, and they will use the network
+stack we just created. For example:
+
+ # ip netns exec handcraft curl -s z3bra.org/slj
+
+We can now run our `hello` program with its own network stack (even though
+it doesn't make any sense!):
+
+ # ip netns exec handcraft unshare -fpiuUm --mount-proc env -i container="handcraft" chroot ./rootfs ./hello
+
+Don't feel ashamed by such a long-ass command, because that is what `lxc`,
+`docker`, and other container applications do behind your back!
+
+### 3. Bonus: cgroups
+
+Control groups are a feature of the kernel used to limit the resources
+used by a process, or a group of processes. Cgroups can limit CPU
+shares, RAM, network usage, disk I/O, ...
+
+I will not cover their usage here, as this article is already long, but
+They are totally worth mentionning as an improvement over our containers.
+
+### 4. Congratz
+
+Containers are a truly awesome concept. They make great use of new
+technologies, and all the tools presented above allow the standard users
+to exploit them in many different ways.
+Applications like LXC and docker both recreate a full operating system,
+even though they are used to run a single process (web server, database, ...).
+
+By knowing how this works under the hood, we will be able to use the
+container technology to isolate the application in a smarter way than
+shipping it along with a full operating system.
+
+For further reading, check out these links:
-2.0 chroot
-2.1 unshare / nsenter
-2.2 ip-netns
+* [http://doger.io](http://doger.io)
+* [http://git.r-36.net/ns-tools](http://git.r-36.net/ns-tools)
+* [https://github.com/arachsys/containers](https://github.com/arachsys/containers)
+* [https://github.com/p8952/bocker](https://github.com/p8952/bocker)
-3. cgroups
+Now get out there, and make some containers!
DIR diff --git a/config.mk b/config.mk
t@@ -1,4 +1,4 @@
-MD = ./markdown
+MD = markdown
NAME = monochromatic
PREFIX = /var/www/blog.z3bra.org
DIR diff --git a/css/monochrome.css b/css/monochrome.css
t@@ -85,27 +85,13 @@ header h1 a:hover {
/* }}} */
/* Coding style (<code>) {{{ */
-code, pre {
- color: inherit;
+pre {
+ color: #eee;
font-family: monospace;
font-size: 90%;
- padding: 2px;
- background-color: #eee;
- border: 1px solid #bbb;
+ background-color: #333;
+ border: 1px solid #eee;
border-radius: 4px;
-}
-
-/*
- * code:before, code:after {
- * content: "`";
- * }
- */
-
-pre code:before, pre code:after {
- content: none;
-}
-
-pre {
padding: 10px;
overflow-x: auto;
overflow-y: hidden;
DIR diff --git a/index.txt b/index.txt
t@@ -1,3 +1,4 @@
+* 0x001b - [Hand-crafted containers](/2016/03/hand-crafted-containers.html)
* 0x001a - [Make your own distro](/2016/01/make-your-own-distro.html)
* 0x0019 - [Install Alpine at online.net](/2015/08/install-alpine-at-onlinenet.html)
* 0x0018 - [cross-compiling with PCC and musl](/2015/08/cross-compiling-with-pcc-and-musl.html)