URI: 
       thand-crafted-containers.txt - monochromatic - monochromatic blog: http://blog.z3bra.org
  HTML git clone git://z3bra.org/monochromatic
   DIR Log
   DIR Files
   DIR Refs
       ---
       thand-crafted-containers.txt (11742B)
       ---
            1 # Hand-crafted containers
            2 
            3 18 March, 2016
            4 
            5 ## tl;dr
            6 
            7         # CTNAME=blah
            8         # mkdir -p /ns/$CTNAME/bin /ns/$CTNAME/lib
            9         # ldd /bin/echo | grep '/' | cut -d'>' -f2 | awk '{print $1}' | xargs -I% cp % /ns/$CTNAME/lib/
           10         # cp /bin/echo /ns/$CTNAME/bin/
           11         # ip netns add $CTNAME
           12         # ip netns exec $CTNAME unshare -fpium --mount-proc env -i container=handcraft chroot /ns/$CTNAME /bin/echo 'Hello, world!'
           13 
           14 ## 0. Intro
           15 
           16 Containers are the latest trend, for a good reason: they leave room for new
           17 ideas in terms of security, flexibility, performance and much more.
           18 
           19 But what are containers? It is a group of processes isolated together from the
           20 host operating system. This isolation can happen in different places
           21 (namespaces), be it in the network, the filesystem, the process tree, or all of
           22 them (there are more, in fact. More on this later).
           23 
           24 We can differentiate three types of containers:
           25 
           26 + operating system containers
           27 + application containers
           28 + I LIED!
           29 
           30 If we think about it, an operating system is a process `/sbin/init` that will
           31 spawn other subprocesses. This way, an operating system is nothing more than
           32 an application (a complex one). In this regard, there is only a single type of
           33 containers.  
           34 We can now focus on what's really important, how do they work?
           35 
           36 ## 1. Namespaces
           37 
           38 That's a keyword, so let's ask our internet god what it means:
           39 
           40 > In computing, a namespace is a set of symbols that are used to organize
           41 > objects of various kinds, so that these objects may be referred to by name.
           42 >
           43 > -- sincerely, [wikipedia](https://en.wikipedia.org/wiki/Namespace)
           44 
           45 In other words, a namespace is a way to refer to one or more isolations applied
           46 to a process.  
           47 When a namespace is created for a process, all its children will be created
           48 within this namespace, and inherit the "limitations" of the parent.
           49 
           50 ### Mount
           51 The process will be able to mount and unmount filesystems without affecting
           52 the rest of the system. For example, if you unmount a partition within the
           53 namespace, all the processes within it will see it as unmounted, while it
           54 will remain mounted for all others processes on the host.
           55 
           56 ### UTS (Unix Time-Sharing)
           57 This will give the ability to change the host and domain name in the namespace
           58 without changing it on the host.
           59 
           60 ### IPC (Inter-Process Communication)
           61 This namespace concern shared memory, System V message queues and sempaphores.
           62 Processes in the namespace will be unable to communicate with the host's
           63 processes this way.
           64 
           65 ### Network
           66 Processes will have their own network stack. This includes the routing table,
           67 firewall rules, sockets, and so on.
           68 
           69 ### PID (Process IDentification)
           70 Processes' IDs will get a different mapping that they have on the host. They
           71 will get renumbered, starting from 1.
           72 
           73 ### User
           74 The namespaces will have their own set of user and group IDs.
           75 
           76 ## 2. Making containers
           77 
           78 Now that we know what containers are and how they work, it's time to make
           79 one!
           80 For the purpose of this article, we will try an build the simplest container
           81 capable of printing "Hello, world!".
           82 
           83 Here is the program:
           84 
           85         $ more <<EOF> hello.c
           86         #include <unistd.h>
           87         int
           88         main(int argc, char **argv)
           89         {
           90                 write(1, "Hello, world!\n", 14);
           91                 return 0;
           92         }
           93         EOF
           94         $ cc hello.c -o hello
           95 
           96 ### 2.0 `chroot(1)`
           97 This one is an old tool that will run a command or spawn an interactive
           98 shell after changing the root directory.
           99 It is used to isolate a process, or group of processes from the host's
          100 filesystem tree. This has long be used for security purposes
          101 (see [chroot jail](https://en.wikipedia.org/wiki/Chroot)), but escaping from
          102 chroot is rather easy for someone with root (UID 0) access.
          103 This is why `chroot` alone cannot be considered secure, but coupled with user
          104 namespace and privilege dropping, one can turn a chroot in a real jail.
          105 
          106 Back to the topic. Let's copy our `hello` binary into the chroot, and try to
          107 run it:
          108 
          109         $ mkdir rootfs
          110         $ cp ./hello ./rootfs/hello
          111         # chroot ./rootfs ./hello
          112         chroot: failed to run command "./hello": No such file or directory
          113 
          114 This is the worst error message you can get. Of course `./hello` exists!
          115 We just copied it. But what does this error mean then? Let's take a closer
          116 look at this binary:
          117 
          118         $ file ./hello
          119         ./hello: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-x86-64.so.2, for GNU/Linux 3.12.0, not stripped
          120 
          121 The output may differ slightly depending on your system, but the important
          122 part here is the following: 
          123 
          124 > dynamically linked, interpreter /lib/ld-linux-x86-64.so.2
          125 
          126 Dynamically linked binaries cannot be run on their own. Long story short,
          127 `/lib/ld-linux-x86-64.so.2` is a program that is implicitly called to run all
          128 the dynamic binaries on a linux system, it's called the
          129 [linker](https://en.wikipedia.org/wiki/Dynamic_linker). So in order to have a
          130 binary run in the chroot, you need to copy over the linker AND all the libraries
          131 your binary links to. To get a list of these libraries, use the `ldd` command:
          132 
          133         $ ldd hello
          134         linux-vdso.so.1 (0x00007ffd3e7dc000)
          135         libc.so.6 => /lib/libc.so.6 (0x00007fdc1a482000)
          136         /lib/ld-linux-x86-64.so.2 (0x00007fdc1a82a000)
          137 
          138 You can ignore the [`vdso`](http://man7.org/linux/man-pages/man7/vdso.7.html)
          139 line as it's handled by the C library.
          140 Our `hello` binary depends on two files: `/lib/ld-linux-x86-64.so.2`, the linker,
          141 and `/lib/libc.so.6`, the C library (containing system calls like `write(2)`).
          142 
          143 In order to run our `hello` program, we'll have to copy them over in place. After
          144 that, our program should run totally fine:
          145 
          146         $ mkdir -p rootfs/lib
          147         $ cp /lib/ld-linux-x86-64.so.2 /lib/libc.so.6 ./rootfs/lib
          148         # chroot ./rootfs ./hello
          149         Hello, world!
          150 
          151 TADAAAA!! That was easy right?
          152 Another option is to simply compile our program *statically*. It means that all the
          153 needed objects from libraries will be compiled into the program, removing the need
          154 for a linker and libc in the chroot:
          155 
          156         $ mkdir rootfs
          157         $ cc hello.c -o hello -static -s
          158         $ cp hello ./rootfs
          159         # chroot ./rootfs ./hello
          160         Hello, world!
          161 
          162 Let's take a look at the size of this "container". For scale, the
          163 "[Smallest possible docker container](https://docs.docker.com/articles/baseimages/#creating-a-simple-base-image-using-scratch)"
          164 weighs 3.6Mib...
          165 
          166         $ du -sh rootfs
          167         720K    rootfs
          168 
          169 That's most likely the lightest container you've seen, right?
          170 
          171 ### 2.1 env
          172 To isolate our process from the host, we'll have to clean all the environment
          173 from all its variables, to make sure the container won't know anything about its
          174 host. We can do this with the `env` command:
          175 
          176         $ export FOO="bar"
          177         $ env -i /bin/sh
          178         $ env # we are now in a subshell
          179         PWD=/home/z3bra
          180 
          181 You can see that the subprocess doesn't have the `$FOO` variable in its
          182 environment, even though it has been exported earlier.
          183 You can set the environment by passing variables AFTER the `env -i` command,
          184 this is useful to set the `$container` variable which has been "standardized" as
          185 a way to tell processes they are running inside a container.
          186 
          187 We now have a way to isolate our `hello` process from the host's environment. 
          188 
          189         # env -i container="handcraft" chroot ./rootfs ./hello
          190 
          191 ### 2.2 `unshare(1)`
          192 This tool is the one that will actually isolate containers. It has been created
          193 especially for this purpose, and will let you run a process unshared from
          194 different namespaces: mount, user, network, PID, IPC and UTS.  
          195 In the same order, each flag will separate your `command` from the given
          196 namespace. See `unshare(1)` for more information:
          197 
          198         unshare -m -U -n -p -i -u <command>
          199 
          200 We can actually leave the `-n` flag untouched, as some tools provide a better
          201 approach to network isolation (see `ip-netns(1)`, described later in this post).
          202 
          203 Another point worth mentionning is that if you want to isolate the process from
          204 the PID namespace, you should consider using the options `--fork --mount-proc`,
          205 so that the process will see a "virtualized" `/proc` that will represent the
          206 namespace, and not the host. For example:
          207 
          208         # unshare -p --fork --mount-proc ps -faux
          209         USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
          210         root         1  0.0  0.0  13012  2276 pts/2    R+   23:57   0:00 ps -aux
          211 
          212 We just found a way to isolate our program a bit more:
          213 
          214         # unshare -fpiumU --mount-proc env -i container="handcraft" chroot ./rootfs ./hello
          215 
          216 For the curious, you can check the `nsenter(1)` program, that will help you
          217 run a process within another process namespace.
          218 
          219 ### 2.3 `ip-netns(1)`
          220 
          221 The `ip(1)` command includes a `netns` subcommand to manage network namespaces.
          222 It is useful to give network access to a process while keeping it away from the
          223 host's network stack.
          224 
          225 You need to be familiar with the concept of
          226 [bridges](https://en.wikipedia.org/wiki/Bridging_\(networking\)), and 
          227 [virtual network interfaces](https://en.wikipedia.org/wiki/Virtual_network_interface)
          228 (veth) pairs here.  
          229 Virtual ethernet devices pairs acts like both ends of a tube: when a packet is
          230 written on one end, it is also written on the other. This simple concept will
          231 help us get an internet access *inside* the container, while using the network
          232 stack of the host.
          233 
          234 The process is easy: we will create a `veth` pair, move one end inside the
          235 container, and bridge the other side with a physical interface.  
          236 Let's assume your physical interface is named `eth0`. We will create a bridge
          237 `br0`, add `eth0` on this bridge, and request an IP for this interface:
          238 
          239         # brctl addbr br0
          240         # brctl addif br0 eth0
          241         # dhcpcd br0
          242 
          243 Then, we create a network namespace, a veth pair and move one end if this
          244 pair inside the namespace (we will name it "handcraft"):
          245 
          246         # ip netns add handcraft
          247         # ip link add veth1 type veth peer name eth1
          248         # ip link set eth1 netns handcraft
          249 
          250 Now that our namespace has an interface able to communicate with the outside
          251 world, we can bridge it together with `eth0` and request an IP:
          252 
          253         # brctl addif br0 veth1
          254         # ip link set veth1 up
          255         # ip netns exec dhcpcd eth1
          256 
          257 We now have a namespace 100% isolated from the host, that can reach the
          258 outside world over ethernet!
          259 You can run any command inside this namespace, and they will use the network
          260 stack we just created. For example:
          261 
          262         # ip netns exec handcraft curl -s z3bra.org/slj
          263 
          264 We can now run our `hello` program with its own network stack (even though
          265 it doesn't make any sense!):
          266 
          267         # ip netns exec handcraft unshare -fpiuUm --mount-proc env -i container="handcraft" chroot ./rootfs ./hello
          268 
          269 Don't feel ashamed by such a long-ass command, because that is what `lxc`,
          270 `docker`, and other container applications do behind your back!
          271 
          272 ## 3. Bonus: cgroups
          273 
          274 Control groups are a feature of the kernel used to limit the resources
          275 used by a process, or a group of processes. Cgroups can limit CPU 
          276 shares, RAM, network usage, disk I/O, ...
          277 
          278 I will not cover their usage here, as this article is already long, but
          279 They are totally worth mentionning as an improvement over our containers.
          280 
          281 ## 4. Congratz
          282 
          283 ... for reading this far.
          284 
          285 Containers are a truly awesome concept. They make great use of new
          286 technologies, and all the tools presented above allow the standard users
          287 to exploit them in many different ways.  
          288 Applications like LXC and docker both recreate a full operating system,
          289 even though they are used to run a single process (web server, database, ...).
          290 
          291 By knowing how this works under the hood, we will be able to use the
          292 container technology to isolate the application in a smarter way than
          293 shipping it along with a full operating system.
          294 
          295 For further reading, check out these links:
          296 
          297 * [http://doger.io](http://doger.io)
          298 * [http://git.r-36.net/ns-tools](http://git.r-36.net/ns-tools)
          299 * [https://github.com/arachsys/containers](https://github.com/arachsys/containers)
          300 * [https://github.com/p8952/bocker](https://github.com/p8952/bocker)
          301 
          302 Now get out there, and make some containers!