z3bra.org/1/scm/monochromatic/commit/ed6b12aac58c951e9924bdd1325c2fc9ea68a0b3.gph

  URI:

       tFinish & publish container blogpost - monochromatic - monochromatic blog: http://blog.z3bra.org
  HTML git clone git://z3bra.org/monochromatic
   DIR Log
   DIR Files
   DIR Refs
       ---
   DIR commit ed6b12aac58c951e9924bdd1325c2fc9ea68a0b3
   DIR parent 076c73eb2cf52b5b1fdac70165a64c1566c4b053
  HTML Author: z3bra <willyatmailoodotorg>
       Date:   Thu, 24 Mar 2016 22:08:53 +0000
       
       Finish & publish container blogpost
       
       Diffstat:
         M 2016/03/hand-crafted-containers.txt |     248 +++++++++++++++++++++++++++++--
         M config.mk                           |       2 +-
         M css/monochrome.css                  |      22 ++++------------------
         M index.txt                           |       1 +
       
       4 files changed, 242 insertions(+), 31 deletions(-)
       ---
   DIR diff --git a/2016/03/hand-crafted-containers.txt b/2016/03/hand-crafted-containers.txt
       t@@ -1,7 +1,16 @@
       -# [Hand-made containers](#)
       +# [Hand-crafted containers](#)
        ## &mdash; 18 March, 2016
        
       -### 0. intro
       +### tl;dr
       +
       +        # CTNAME=blah
       +        # mkdir -p /ns/$CTNAME/bin /ns/$CTNAME/lib
       +        # ldd /bin/echo | grep '/' | cut -d'>' -f2 | awk '{print $1}' | xargs -I% cp % /ns/$CTNAME/lib/
       +        # cp /bin/echo /ns/$CTNAME/bin/
       +        # ip netns add $CTNAME
       +        # ip netns exec $CTNAME unshare -fpium --mount-proc env -i container=handcraft chroot /ns/$CTNAME /bin/echo 'Hello, world!'
       +
       +### 0. Intro
        
        Containers are the latest trend, for a good reason: they leave room for new
        ideas in terms of security, flexibility, performance and much more.
       t@@ -23,7 +32,7 @@ an application (a complex one). In this regard, there is only a single type of
        containers.  
        We can now focus on what's really important, how do they work?
        
       -### 1. namespaces
       +### 1. Namespaces
        
        That's a keyword, so let's ask our internet god what it means:
        
       t@@ -37,7 +46,7 @@ to a process.
        When a namespace is created for a process, all its children will be created
        within this namespace, and inherit the "limitations" of the parent.
        
       -#### mount
       +#### Mount
        The process will be able to mount and unmount filesystems without affecting
        the rest of the system. For example, if you unmount a partition within the
        namespace, all the processes within it will see it as unmounted, while it
       t@@ -52,7 +61,7 @@ This namespace concern shared memory, System V message queues and sempaphores.
        Processes in the namespace will be unable to communicate with the host's
        processes this way.
        
       -#### network
       +#### Network
        Processes will have their own network stack. This includes the routing table,
        firewall rules, sockets, and so on.
        
       t@@ -60,16 +69,231 @@ firewall rules, sockets, and so on.
        Processes' IDs will get a different mapping that they have on the host. They
        will get renumbered, starting from 1.
        
       -#### user
       +#### User
        The namespaces will have their own set of user and group IDs.
        
       -### 2. making containers
       +### 2. Making containers
        
        Now that we know what containers are and how they work, it's time to make
       -some!
       +one!
       +For the purpose of this article, we will try an build the simplest container
       +capable of printing "Hello, world!".
       +
       +Here is the program:
       +
       +        $ more <<EOF> hello.c
       +        #include <unistd.h>
       +        int
       +        main(int argc, char **argv)
       +        {
       +                write(1, "Hello, world!\n", 14);
       +                return 0;
       +        }
       +        EOF
       +        $ cc hello.c -o hello
       +
       +#### 2.0 `chroot(1)`
       +This one is an old tool that will run a command or spawn an interactive
       +shell after changing the root directory.
       +It is used to isolate a process, or group of processes from the host's
       +filesystem tree. This has long be used for security purposes
       +(see [chroot jail](https://en.wikipedia.org/wiki/Chroot)), but escaping from
       +chroot is rather easy for someone with root (UID 0) access.
       +This is why `chroot` alone cannot be considered secure, but coupled with user
       +namespace and privilege dropping, one can turn a chroot in a real jail.
       +
       +Back to the topic. Let's copy our `hello` binary into the chroot, and try to
       +run it:
       +
       +        $ mkdir rootfs
       +        $ cp ./hello ./rootfs/hello
       +        # chroot ./rootfs ./hello
       +        chroot: failed to run command "./hello": No such file or directory
       +
       +This is the worst error message you can get. Of course `./hello` exists!
       +We just copied it. But what does this error mean then? Let's take a closer
       +look at this binary:
       +
       +        $ file ./hello
       +        ./hello: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-x86-64.so.2, for GNU/Linux 3.12.0, not stripped
       +
       +The output may differ slightly depending on your system, but the important
       +part here is the following: 
       +
       +> dynamically linked, interpreter /lib/ld-linux-x86-64.so.2
       +
       +Dynamically linked binaries cannot be run on their own. Long story short,
       +`/lib/ld-linux-x86-64.so.2` is a program that is implicitely called to run all
       +the dynamic binaries on a linux system, it's called the
       +[linker](https://en.wikipedia.org/wiki/Dynamic_linker). So in order to have a
       +binary run in the chroot, you need to copy over the linker AND all the libraries
       +your binary links to. To get a list of these libraries, use the `ldd` command:
       +
       +        $ ldd hello
       +        linux-vdso.so.1 (0x00007ffd3e7dc000)
       +        libc.so.6 => /lib/libc.so.6 (0x00007fdc1a482000)
       +        /lib/ld-linux-x86-64.so.2 (0x00007fdc1a82a000)
       +
       +You can ignore the [`vdso`](http://man7.org/linux/man-pages/man7/vdso.7.html)
       +line as it's handled by the C library.
       +Our `hello` binary depends on two files: `/lib/ld-linux-x86-64.so.2`, the linker,
       +and `/lib/libc.so.6`, the C library (containing system calls like `write(2)`).
       +
       +In order to run our `hello` program, we'll have to copy them over in place. After
       +that, our program should run totally fine:
       +
       +        $ mkdir -p rootfs/lib
       +        $ cp /lib/ld-linux-x86-64.so.2 /lib/libc.so.6 ./rootfs/lib
       +        # chroot ./rootfs ./hello
       +        Hello, world!
       +
       +TADAAAA!! That was easy right?
       +Another option is to simply compile our program *statically*. It means that all the
       +needed objects from libraries will be compiled into the program, removing the need
       +for a linker and libc in the chroot:
       +
       +        $ mkdir rootfs
       +        $ cc hello.c -o hello -static -s
       +        $ cp hello ./rootfs
       +        # chroot ./rootfs ./hello
       +        Hello, world!
       +
       +Let's take a look at the size of this "container". For scale, the
       +"[Smallest possible docker container](https://docs.docker.com/articles/baseimages/#creating-a-simple-base-image-using-scratch)"
       +weights 3.6Mib...
       +
       +        $ du -sh rootfs
       +        720K    rootfs
       +
       +That's most likely the lightest container you've seen, right?
       +
       +#### 2.1 env
       +To isolate our process from the host, we'll have to clean all the environment
       +from all its variables, to make sure the container won't know anything about its
       +host. We can do this with the `env` command:
       +
       +        $ export FOO="bar"
       +        $ env -i /bin/sh
       +        $ env # we are now in a subshell
       +        PWD=/home/z3bra
       +
       +You can see that the subprocess doesn't have the `$FOO` variable in its
       +environment, even though it has been exported earlier.
       +You can set the environment by passing variables AFTER the `env -i` command,
       +this is useful to set the `$container` variable which has been "standardized" as
       +a way to tell processes they are running inside a container.
       +
       +We now have a way to isolate our `hello` process from the host's environment. 
       +
       +        # env -i container="handcraft" chroot ./rootfs ./hello
       +
       +#### 2.2 `unshare(1)`
       +This tool is the one that will actually isolate containers. It has been created
       +especially for this purpose, and will let you run a process unshared from
       +different namespaces: mount, user, network, PID, IPC and UTS.  
       +In the same order, each flag will separate your `command` from the given
       +namespace. See `unshare(1)` for more informations:
       +
       +        unshare -m -U -n -p -i -u <command>
       +
       +We can actually leave the `-n` flag untouched, as some tools provide a better
       +approach to network isolation (see `ip-netns(1)`, described later in this post).
       +
       +Another point worth mentionning is that if you want to isolate the process from
       +the PID namespace, you should consider using the options `--fork --mount-proc`,
       +so that the process will see a "virtualized" `/proc` that will represent the
       +namespace, and not the host. For example:
       +
       +        # unshare -p --fork --mount-proc ps -faux
       +        USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
       +        root         1  0.0  0.0  13012  2276 pts/2    R+   23:57   0:00 ps -aux
       +
       +We just found a way to isolate our program a bit more:
       +
       +        # unshare -fpiumU --mount-proc env -i container="handcraft" chroot ./rootfs ./hello
       +
       +For the curious, you can check the `nsenter(1)` program, that will help you
       +run a process within another process namespace.
       +
       +#### 2.3 `ip-netns(1)`
       +
       +The `ip(1)` command includes a `netns` subcommand to manage network namespaces.
       +It is useful to give network access to a process while keeping it away from the
       +host's network stack.
       +
       +You need to be familiar with the concept of
       +[bridges](https://en.wikipedia.org/wiki/Bridging_\(networking\)), and 
       +[virtual network interfaces](https://en.wikipedia.org/wiki/Virtual_network_interface)
       +(veth) pairs here.  
       +Virtual ethernet devices pairs acts like both ends of a tube: when a packet is
       +written on one end, it is also written on the other. This simple concept will
       +help us get an internet acces *inside* the container, while using the network
       +stack of the host.
       +
       +The process is easy: we will create a `veth` pair, move one end inside the
       +container, and bridge the other side with a physical interface.  
       +Let's assume your physical interface is named `eth0`. We will create a bridge
       +`br0`, add `eth0` on this bridge, and request an IP for this interface:
       +
       +        # brctl addbr br0
       +        # brctl addif br0 eth0
       +        # dhcpcd br0
       +
       +Then, we create a network namespace, a veth pair and move one end if this
       +pair inside the namespace (we will name it "handcraft"):
       +
       +        # ip netns add handcraft
       +        # ip link add veth1 type veth peer name eth1
       +        # ip link set eth1 netns handcraft
       +
       +Now that our namespace has an interface able to communicate with the outside
       +world, we can bridge it together with `eth0` and request an IP:
       +
       +        # brctl addif br0 veth1
       +        # ip link set veth1 up
       +        # ip netns exec dhcpcd eth1
       +
       +We now have a namespace 100% isolated from the host, that can reach the
       +outside world over ethernet!
       +You can run any command inside this namespace, and they will use the network
       +stack we just created. For example:
       +
       +        # ip netns exec handcraft curl -s z3bra.org/slj
       +
       +We can now run our `hello` program with its own network stack (even though
       +it doesn't make any sense!):
       +
       +        # ip netns exec handcraft unshare -fpiuUm --mount-proc env -i container="handcraft" chroot ./rootfs ./hello
       +
       +Don't feel ashamed by such a long-ass command, because that is what `lxc`,
       +`docker`, and other container applications do behind your back!
       +
       +### 3. Bonus: cgroups
       +
       +Control groups are a feature of the kernel used to limit the resources
       +used by a process, or a group of processes. Cgroups can limit CPU 
       +shares, RAM, network usage, disk I/O, ...
       +
       +I will not cover their usage here, as this article is already long, but
       +They are totally worth mentionning as an improvement over our containers.
       +
       +### 4. Congratz
       +
       +Containers are a truly awesome concept. They make great use of new
       +technologies, and all the tools presented above allow the standard users
       +to exploit them in many different ways.  
       +Applications like LXC and docker both recreate a full operating system,
       +even though they are used to run a single process (web server, database, ...).
       +
       +By knowing how this works under the hood, we will be able to use the
       +container technology to isolate the application in a smarter way than
       +shipping it along with a full operating system.
       +
       +For further reading, check out these links:
        
       -2.0 chroot
       -2.1 unshare / nsenter
       -2.2 ip-netns
       +* [http://doger.io](http://doger.io)
       +* [http://git.r-36.net/ns-tools](http://git.r-36.net/ns-tools)
       +* [https://github.com/arachsys/containers](https://github.com/arachsys/containers)
       +* [https://github.com/p8952/bocker](https://github.com/p8952/bocker)
        
       -3. cgroups
       +Now get out there, and make some containers!
   DIR diff --git a/config.mk b/config.mk
       t@@ -1,4 +1,4 @@
       -MD      = ./markdown
       +MD      = markdown
        
        NAME    = monochromatic
        PREFIX  = /var/www/blog.z3bra.org
   DIR diff --git a/css/monochrome.css b/css/monochrome.css
       t@@ -85,27 +85,13 @@ header h1 a:hover {
        /* }}} */
        
        /* Coding style (<code>) {{{ */
       -code, pre {
       -    color: inherit;
       +pre {
       +    color: #eee;
            font-family: monospace;
            font-size: 90%;
       -    padding: 2px;
       -    background-color: #eee;
       -    border: 1px solid #bbb;
       +    background-color: #333;
       +    border: 1px solid #eee;
            border-radius: 4px;
       -}
       -
       -/*
       - * code:before, code:after {
       - *     content: "`";
       - * }
       - */
       -
       -pre code:before, pre code:after {
       -    content: none;
       -}
       -
       -pre {
            padding: 10px;
            overflow-x: auto;
            overflow-y: hidden;
   DIR diff --git a/index.txt b/index.txt
       t@@ -1,3 +1,4 @@
       +* 0x001b - [Hand-crafted containers](/2016/03/hand-crafted-containers.html)
        * 0x001a - [Make your own distro](/2016/01/make-your-own-distro.html)
        * 0x0019 - [Install Alpine at online.net](/2015/08/install-alpine-at-onlinenet.html)
        * 0x0018 - [cross-compiling with PCC and musl](/2015/08/cross-compiling-with-pcc-and-musl.html)