Docker Code Curiosities

Jan 21st, 2015 8:17 am

Code is the best documentation place at the end, and despites the fact that reading it is time consuming and hard to understand, sometimes it could make you happier.

This post is about the unrelated things I found (for now) in docker code.

The meaning of life

Given the name of the blog, this one is not unnoticed to my eyes.

meaning_of_lifelink

func TestOutputAddEnv(t *testing.T) {
  input := "{\"foo\": \"bar\", \"answer_to_life_the_universe_and_everything\": 42}"
  o := NewOutput()
  result, err := o.AddEnv()
  if err != nil {
      t.Fatal(err)
  }
  o.Write([]byte(input))
  o.Close()
  if v := result.Get("foo"); v != "bar" {
      t.Errorf("Expected %v, got %v", "bar", v)
  }
  if v := result.GetInt("answer_to_life_the_universe_and_everything"); v != 42 {
      t.Errorf("Expected %v, got %v", 42, v)
  }
  if v := result.Get("this-value-doesnt-exist"); v != "" {
      t.Errorf("Expected %v, got %v", "", v)
  }
}

Remember, don’t forget your towel!

Names generator

I enjoyed this wink, I mean, the whole file. It is worth to read it all, since all the names are quoted and this quick reference could wake up some interest about someone.

things_clearlink

begin:
  name := fmt.Sprintf("%s_%s", left[rand.Intn(len(left))], right[rand.Intn(len(right))])
  if name == "boring_wozniak" /* Steve Wozniak is not boring */ {
      goto begin
  }

I have to say I agree.

At least they are not as hard as the ones in linux code.

Linux Kernel Namespaces Pt I

Jan 5th, 2015 8:03 pm

Namespaces are used to provide a process or a group of processes with the idea of being the only process or group of processes in the system.

They are a way to detach processes from a specific kernel layer assigning them to a new one. Or in other words, they are indirections layers for global resources.

Lets imagine this as an extension to classical chroot() syscall. When setting a new root calling chroot, kernel was isolating new branch from existing one, and thus creating a new namespace for the process.

Namespaces now provide the basis for a complete lightweight virtualization system, in the form of containers.

Currently, linux support following namespaces

NameSpace	from Kernel	Descrition
UTS	2.6.19	domain and hostname
IPC	2.6.19	queues, semaphores, and shared memmory
PID	2.6.19	pid
NS	2.4.19	Filesystems
NET	2.4.24	IP, routes, network devices…
USER	3.8	Uid, Guid,…

I let individual namespaces explanations as simple as this, or in other words, for another day.

Namespaces API

In Linux kernel there are not distinction between process and threads implementions, threads are just light weight processes. Threads are also created by calling clone() but with different arguments (CLONE_VM mainly). From the kernel point of view, a process/thread is a task.

Namespaces can be nested. Limit for nesting namespaces is 32.

nest_limitlink

if (parent_ns->level > 32)
  return -EUSERS;

Structs

Next are most important structs parts used by namespaces API.

task_structlink

struct task_struct {
[...]
/* process credentials */
  const struct cred __rcu *cred; /* effective (overridable) subjective task
                  * credentials (COW) */
[...]
/* namespaces */
  struct nsproxy          *nsproxy;

nsproxylink

struct nsproxy {
  atomic_t count;
  struct uts_namespace *uts_ns;
  struct ipc_namespace *ipc_ns;
  struct mnt_namespace *mnt_ns;
  struct pid_namespace *pid_ns_for_children;
  struct net           *net_ns;
};

credentialslink

struct cred {
[...]
  struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
[...]

user_namespacelink

struct user_namespace {
[...]
  struct user_namespace *parent;
  struct ns_common      ns;
[...]
};

System calls

Namespaces can be created or modified by clone(), unshare() and setns() system calls. All of them are not POSIX system calls, so only available in linux.

clone() system call is more specific than fork() or vfork() system calls. They are alike, because they are implemented by calling do_fork() function with different flags and args.

fork_linuxlink

SYSCALL_DEFINE0(fork)
{
  return do_fork(SIGCHLD, 0, 0, NULL, NULL);
[...]
SYSCALL_DEFINE0(vfork)
{
  return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
          0, NULL, NULL);
[...]
SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp, int __user *, parent_tidptr, int __user *, child_tidptr, int, tls_val)
{
  return do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr);

copy_process() function in do_fork() calls copy_namespaces(). In case any namespace flags present, it just uses parent namespaces. (fork(), vfork() behavior)

if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWPID | CLONE_NEWNET)))) {
      get_nsproxy(old_ns);
      return 0; }
  
static inline void get_nsproxy(struct nsproxy *ns)
  {
      atomic_inc(&ns->count);
  }

If any is present, it creates a new nsproxy struct and copies all namespaces.

create_new_namespaceslink

new_nsp = create_nsproxy();
new_nsp->mnt_ns = copy_mnt_ns(flags, tsk->nsproxy->mnt_ns, user_ns, new_fs);
new_nsp->uts_ns = copy_utsname(flags, user_ns, tsk->nsproxy->uts_ns);
new_nsp->ipc_ns = copy_ipcs(flags, user_ns, tsk->nsproxy->ipc_ns);
new_nsp->pid_ns_for_children = copy_pid_ns(flags, user_ns, tsk->nsproxy->pid_ns_for_children);
new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);

return new_nsp;

Child process is responsible for changing any namespace data.

sethostname(hostname, strlen(hostname);

unshare() system call allows a process to disassociate parts of its execution context that are currently being shared with other processes.

It allows to control shared execution context without creating a new process.

SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
{
  err = unshare_fs(unshare_flags, &new_fs);
  err = unshare_fd(unshare_flags, &new_fd);
  err = unshare_userns(unshare_flags, &new_cred);
  err = unshare_nsproxy_namespaces(unshare_flags, &new_nsproxy, new_cred, new_fs);

Flags like CLONE_NEW* flags have same effect as they have in clone(). All others ones that unshare() accepts has reverse effect. Since they were copied in above create_new_namespaces example.

When a task ends, all namespaces they belong to that does not have any other process attached are cleaned. This means, mounts unmounted, network interfaced destroyed, etc.

put_nsproxylink

static inline void put_nsproxy(struct nsproxy *ns)
{
  if (atomic_dec_and_test(&ns->count)) {
      free_nsproxy(ns);
  }
}

Last API function to comment is setns() that simply allows a process to join an existing namespace.

setnslink

SYSCALL_DEFINE2(setns, int, fd, int, nstype)
{
  file = proc_ns_fget(fd);
  ns = get_proc_ns(file_inode(file));
  new_nsproxy = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
  err = ns->ops->install(new_nsproxy, ns);

  switch_task_namespaces(tsk, new_nsproxy);

fd argument specifies the namespace to join. It is translated to a nscommon struct calling get_proc_ns(file_inode(file)). We will cover fds in next point.

define get_proc_ns(inode) ((struct ns_common *)(inode)->i_private)

Namespaces in /proc

Per process namespaces can be found under /proc/$pid/ns.

$ ls -l /proc/$$/ns
total 0
lrwxrwxrwx 1 ubuntu ubuntu 0 Jan  5 21:12 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 ubuntu ubuntu 0 Jan  5 21:12 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 ubuntu ubuntu 0 Jan  5 21:12 net -> net:[4026531962]
lrwxrwxrwx 1 ubuntu ubuntu 0 Jan  5 21:12 pid -> pid:[4026531836]
lrwxrwxrwx 1 ubuntu ubuntu 0 Jan  5 21:12 user -> user:[4026531837]
lrwxrwxrwx 1 ubuntu ubuntu 0 Jan  5 21:12 uts -> uts:[4026531838]

Each process namespace has an inode number, that corresponds with a namespace struct. If two tasks share same number, they belongs to same namespace. Inode for ns files in namespaces is not the same as stat -c %i shows. They are sym links. For namespaces, like sockets or pipes inode number is shown in form type:[inode].

# for pid in 643 23681 32178 ; do readlink /proc/$pid/ns/mnt ; done
mnt:[4026531840]
mnt:[4026532430]
mnt:[4026532430]
# for pid in 643 23681 32178 ; do md5sum /proc/$pid/mounts ; done
8dddf7d919672a56849bb487840b94e0  /proc/643/mounts
70159c37e8c8f16c61ceaa047ad9528a  /proc/23681/mounts
70159c37e8c8f16c61ceaa047ad9528a  /proc/32178/mounts
# unshare -m /bin/bash 
# readlink /proc/$$/ns/mnt
mnt:[4026532493]
# for pid in  643 23681 32178; do stat -c %i /proc/$pid/ns/mnt ; done
1308868
1308731
1308686

Proc namespaces files are implemented through proc_ns_operations struct.

proc_ns_operationslink

struct proc_ns_operations {
  const char *name;
  int type;
  struct ns_common *(*get)(struct task_struct *task);
  void (*put)(struct ns_common *ns);
  int (*install)(struct nsproxy *nsproxy, struct ns_common *ns);
};

extern const struct proc_ns_operations netns_operations;
extern const struct proc_ns_operations utsns_operations;
extern const struct proc_ns_operations ipcns_operations;
extern const struct proc_ns_operations pidns_operations;
extern const struct proc_ns_operations userns_operations;
extern const struct proc_ns_operations mntns_operations;

There are availabe two commands that correspond to each system call (like most of the shell commands that are just called like any system call). unshare for unshare() and nsenter for setns().

LibContainer Overview

Jan 3rd, 2015 6:14 pm

Libcontainer is now the default docker execution environment. It is driver (named native) and a library.

In other words, it is a replacement (since version 0.9) for formerly LXC execution environment (that can be easily brought back using -e switch).

This library is developed by docker.io, written in go and C/C++, in order to support a wider range of isolation technologies. It also can be used in python through python bindings.

docker execdriver

It is meant to be a cross-system abstraction layer being an attempt to standarize the way apps are packed up, delivered, and run in isolation.

Libcontainer as a stand alone project, makes possible to other game players adopting it. Google, parallels (openvz), redhat, ubuntu (lxc) are also contributing to this project.

libcontainer_place

This way, container features available in linux kernel API are provided as a unique library in a consistent way. LibContainer addresses the problem of having an unique kernel API and several implementations. (call them, LXC, libvirt, lmctfy, …)

Libcontainer enables containers to work with Linux namespaces, control groups, capabilities, AppArmor, security profiles, network interfaces and firewalling rules in a consistent and predictable way.

Currently, docker can support these kernel features out-of-the-box, since it no longer depends on LXC.

It introduces a new container specification as we can see here

It includes namespaces, standard filesystem setup, a default Linux capability set, and information about resource reservations. It also has information about any populated environment settings for the processes running inside a container.

Let`s see a little example of how it was replaced on docker code.

nsinit and nsenter are the user-land tools that now are needed in order to replace lxc-* user-land tools. nsinit is part of libcontainer package and it is meant to load a config json file. It replaces lxc-start.

nsenter replaces lxc-attach and it is not part of libcontainer package, but it is part of util-linux package.

When formerly we ran docker run ..., (by default) it called lxc-start to run the container.

lxc-startlink

params := []string{
  "lxc-start",
  "-n", c.ID,
  "-f", configPath,
}

Currently, native execdriver is called.

native_exec_driverlink

func (d *driver) Run(c *execdriver.Command, pipes *execdriver.Pipes, startCallback execdriver.StartCallback) (execdriver.ExitStatus, error) {
  // take the Command and populate the libcontainer.Config from it
  container, err := d.createContainer(c)

native_syscallslink

var namespaceInfo = map[libcontainer.NamespaceType]int{
  libcontainer.NEWNET:  syscall.CLONE_NEWNET,
  libcontainer.NEWNS:   syscall.CLONE_NEWNS,
  libcontainer.NEWUSER: syscall.CLONE_NEWUSER,
  libcontainer.NEWIPC:  syscall.CLONE_NEWIPC,
  libcontainer.NEWUTS:  syscall.CLONE_NEWUTS,
  libcontainer.NEWPID:  syscall.CLONE_NEWPID,
}

General go syscalls can be found here.. These ones used above are Linux especific.

Master Libcontainer branch supports right now lxc and native execdrivers. Hence it means that it can only used in linux for now.

supported_exec_driverslink

package execdrivers

import (
...
  "github.com/docker/docker/daemon/execdriver/lxc"
  "github.com/docker/docker/daemon/execdriver/native"
...
)

Other systems, like FreeBSD, implements ‘containers’ using Jails. It is an older concept, and some people note that it could not be as sofisticated as linux one is.

Here is as an example of an execution driver in development for FreeBSD Jails. Well, it is an execution driver for FreeBSD, therefore not using libcontainer.

Unfortunately, FreeBSD (and others flavours) mention on docker & libcontainer are only for go language.

A full list of OS level virtualization implementations can be found on wikipedia.

Hubot: A Bot for Automating Daily Tasks or Anything Else

Jan 2nd, 2015 7:23 pm

Hubot is your company’s robot. Install him in your company to dramatically improve and reduce employee efficiency.

In other words, hubot is a chat bot crafted by github team that can run custom written scritps. This allow us to automatize any kind of tasks like merging branches, deploying releases, do monitoring queries, inform (it listen on a port), etc. It has a lot of adapters (where it reads from and writes to), even shell, so we can enjoy its company almost everywhere.

It is written in coffescript and ran with nodejs.

Hubot setup

In order to ease this process, we are going to setup and deploy it with docker. Following Dockerfile allowed me to run and test Hubot.

hubot_dockerfilelink

FROM dockerfile/nodejs
MAINTAINER  Marvin

WORKDIR /root
RUN npm install -g yo generator-hubot

RUN useradd -ms /bin/bash marvin
ENV HOME /home/marvin
# variables needed by hubot scripts
ADD env-vars.sh /home/marvin/.profile
RUN chown marvin /home/marvin/.profile

USER marvin
WORDIR /home/marvin
RUN echo n | yo hubot --defaults
RUN npm install hubot-slack hubot-scripts githubot --save

# enable plugins
RUN echo [ \'github-merge.coffee\' ] > hubot-scripts.json

CMD ["/home/marvin/bin/hubot", "--name", "marvin"]

EXPOSE 8080

Build an image with docker build -t cname/hubot . or just use the commands for testing it. Nodejs setup instructions can be found on joyent’s nodejs repo

This also installs slack adapter, hubot-scripts and githubot npm packages, hence you can figure out how next steps will be.

Let me introduce you to Marvin, my hubot instance, named like Hitchhicker’s guide to galaxy assistant robot.

                 _____________________________
                /                             \
   //\              |      Extracting input for    |
  ////\    _____    |   self-replication process   |
 //////\  /_____\   \                             /
 ======= |[^_/\_]|   /----------------------------
  |   | _|___@@__|__
  +===+/  ///     \_\
   | |_\ /// HUBOT/\\
   |___/\//      /  \\
     \      /   +---+
      \____/    |   |
       | //|    +===+
        \//      |xx|

marvin> marvin: the rules
1. A robot may not injure a human being or, through inaction, allow a human being to come to harm.
2. A robot must obey any orders given to it by human beings, except where such orders would conflict with the First Law.
3. A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

Janky is a hithub project that implements CI on top Jenkins and Hubot. Chatops term is used to refer chat based deployments and monitoring.

Github integration

One of the things I found interesting using hubot for is easy merging and deploy process. In this article I am just going to treat merging as an example, but concepts shown here could be easily used for integrating it with jenkins or any other tools.

We can find some git related scripts under scripts folder in hubot-scripts github project. This time I have tested github-merge.coffee plugin. It uses githubot project for github API access.

We need to set some variables in order this to work. They are made available to hubot as environment variables, so you can set them on env-vars.sh script mentioned in Dockerfile.

export HUBOT_GITHUB_TOKEN=$token
export HUBOT_GITHUB_API_VERSION=$api_version
export HUBOT_GITHUB_USER=$gh_user
export HUBOT_GITHUB_API=$gh_url
export HUBOT_CONCURRENT_REQUESTS=$req_limit

Enabling a plugin is as easy as adding it to hubot-scripts.json or external-scripts.json.

[ 'github-merge.coffee' ]

Once configured we can check if plugin was loaded sucessfuly and works properly.

marvin> marvin help
[...]
marvin merge project_name/<head> into <base> - merges the selected branches or SHA commits
[...]
marvin> marvin merge nevermind/PXTRM-test-switch-bug into develop
marvin> Merge PXTRM-test-switch-bug into develop
marvin> marvin merge nevermind/PXTRM-merge-conflict-on-purpose into develop
marvin> [Fri Jan 02 2015 17:03:46 GMT+0000 (UTC)] ERROR 409 Merge conflict

hubot_merged_branch

It seems to work, and since it is using github merge API we are not messing any branch up.

Slack integration

I have chosen slack as default adapter since it is chat tool we are using at office. Hubots default one is campfire.

An organization, if you already don’t have one, needs to be created on slack. You can do it on sign up process. When this is done, just configure integrations and look for hubot there. You will get a token and even you can set a custom avatar for you shinny bot. Make this token availabe as environment varaible, and… that’s it.

export HUBOT_SLACK_TOKEN=$slack_token

Both processes are really straighforward.

Let’s see the results.

joins_channel

hubot_merge_and_plugin_example

Enjoy your bot!

Linux Kernel Capabilities

Jan 1st, 2015 8:57 pm

Capabilities were created as an alternative to classical two level privilege system: root and user. They split a root acount into their privileges. This way, linux kernel allows a process to perform certain root tipical tasks without giving a process full root privileges.

A process or a file, can be granted with a given capability. Each capability is independent from each other.

For instance, a user process with just CAP_NET_BIND_SERVICE capability can open ports bellow 1024, however it can not kill any process or use chroot.

All linux kernel capabilities list can be found on man pages as well as code.

Instead of checking effective UID of user, modern kernels checks for capabilities, so they allow the privileged operation if capability bit is set in the effective set.

init_process_privilegeslink

struct cred init_cred = {
  .usage                  = ATOMIC_INIT(4),
#ifdef CONFIG_DEBUG_CREDENTIALS
  .subscribers            = ATOMIC_INIT(2),
  .magic                  = CRED_MAGIC,
#endif
  .uid                    = GLOBAL_ROOT_UID,
  .gid                    = GLOBAL_ROOT_GID,
  .suid                   = GLOBAL_ROOT_UID,
  .sgid                   = GLOBAL_ROOT_GID,
  .euid                   = GLOBAL_ROOT_UID,
  .egid                   = GLOBAL_ROOT_GID,
  .fsuid                  = GLOBAL_ROOT_UID,
  .fsgid                  = GLOBAL_ROOT_GID,
  .securebits             = SECUREBITS_DEFAULT,
  .cap_inheritable        = CAP_EMPTY_SET,
  .cap_permitted          = CAP_FULL_SET,
  .cap_effective          = CAP_FULL_SET,
  .cap_bset               = CAP_FULL_SET,
  .user                   = INIT_USER,
  .user_ns                = &init_user_ns,
  .group_info             = &init_groups,
};
} kernel_cap_t;

There are four sets of capabilities:

effective: capabilities that a process is allowed.
permitted: capabilities that a process is permited. This allows to enable, disable or drop capabilities.
inheritable: capabilities that a process can give to another process called, for instance, by calling exec() system call.
bounding set: Limit from capabilities can not be grown. They just can be dropped.

Credentials, therefore, are mostly a set of uids/guis, management flags, capabilities, namaspaces and cgroups.

As formerly happened with UID, GID and mode, capabilities are also part of VFS. They are called File Capabilities. They are store in f_cred struct.

file_structlink

struct file {
  union {
      struct llist_node fu_llist;
      struct rcu_head   fu_rcuhead;
  } f_u;
  struct path                     f_path;
  struct inode                    *f_inode;    /* cached value */
  const struct file_operations   *f_op;

  /*
  * Protects f_ep_links, f_flags.
  * Must not be taken from IRQ context.
  */
  spinlock_t              f_lock;
  atomic_long_t           f_count;
  unsigned int            f_flags;
  fmode_t                 f_mode;
  struct mutex            f_pos_lock;
  loff_t                  f_pos;
  struct fown_struct      f_owner;
  const struct cred       *f_cred;
  struct file_ra_state    f_ra;

  u64                     f_version;
#ifdef CONFIG_SECURITY
  void                    *f_security;
#endif
  /* needed for tty driver, and maybe others */
  void                    *private_data;

#ifdef CONFIG_EPOLL
  /* Used by fs/eventpoll.c to link all the hooks to this file */
  struct list_head        f_ep_links;
  struct list_head        f_tfile_llink;
#endif /* #ifdef CONFIG_EPOLL */
  struct address_space    *f_mapping;
} __attribute__((aligned(4)));  /* lest something weird decides that 2 is OK */

This way, we can, for example ping some host on the internet using CAP_NET_RAW capability.

Following is an example of setting a capability in command line interface.

# setcap cap_dac_override,cap_sys_tty_config+ep /usr/bin/beep

Blog Archives

I left my leg in Jaglan Beta

I’d far rather be happy than right any day.