quentin.dufour.io/_posts/2024-08-10-fast-ci-build-with-nix.md

148 lines
11 KiB
Markdown
Raw Normal View History

2024-08-10 15:44:14 +00:00
---
layout: post
title: Fast CI builds
date: 2024-08-10
status: published
sitemap: true
category: operation
description: Fast CI builds
---
Historically, in the good old Jenkins days,
a CI build would occure in a workspace
that was kept across build. So your previous artifacts
could be re-used if they did not change (for example, `make` would detect
that some files did not change since that last build and thus did not recompile them).
Also it was assumed that all dependencies were directly installed on the machine.
Both of these properties allowed for very fast and efficient builds: only
what changed needed to be rebuilt.
This approach had many shortcomings: stale cache would break builds (or wrongly make it work),
improper dependency tracking would make building on a new machine very hard, etc.
In the end, developers stop trusting the CI that remain broken, bugs start cripling the project and are not noticed,
and finally the codebase becomes unmaintainable.
To avoid these problems, developers started to use a new generation of CI relying on VM (like Travis CI)
or containers (like Drone). All builds start with a fresh environment, often a well-known distribution like Ubuntu.
Then, for each builds, all the dependencies are installed and the build is done from scratch.
Such approach greatly helped developers better track their dependencies and make sure that building their project from scratch remains possible.
However, build times skyrocketted. You can wait more than 10 minutes before running a command that would actually check your code.
And as recommended by many people[^1][^2][^3] the whole build cycle (lint, build, test) shoud remains below 10 minutes to be useful.
To speed-up the CI, various optimizations have been explored. CI sometimes propose some sort of caching API, and when it does not, an object store like S3 can be used.
This cache is used either by directly copying the dependency folder[^4] (for example the `target/` folder for Rust or the `node_modules/` for Node.JS), or through dedicated tools like `sccache`[^5]. In this scenario, fetching/updating the cache involves a non negligible amount of filesystem+network I/O. Another approach relies on providing your own build image that will often be cached on workers. This image can contain your toolchain (for example Rust + Cargo + Clippy + etc.), but also your project dependencies (by copying your `Cargo.toml` or `package.json` file) and pre-fetching/compiling them. This approach still involves some maintenance burden: image must be rebuilt and published each time a dependency is changed, it's project specific, it can easily break, you still do not track correctly your dependencies, etc.
**Can we cache without making our builds fragile?**
## Nix to the rescue
Following our short discussions, the question that surface is wether or not we can cache efficiently without making our build fragile. Ideally, our project would be split in parts compiled in strict isolation, dependencies between parts would be stricly tracked, cache would be kept locally, and new job would only focus on rebuilding the changed components (avoiding steps like restoring cache & co).
That's what Nix can do, at least in a theory. A SaaS CI ecosystem start developping around it with solutions like [Garnix](https://garnix.io/) or [Hercules CI](https://hercules-ci.com/).
But personnaly, I am more interested in FOSS solutions, and thus existing solutions like [Hydra](https://github.com/NixOS/hydra) or [Typhon](https://typhon-ci.org/) seem more baroque. Worse, often a CI system based on Docker is already deployed in your organization (like [Woodpecker](https://woodpecker-ci.org/), [Gitlab Runner](https://docs.gitlab.com/runner/), [Forgejo Actions](https://forgejo.org/docs/v1.20/user/actions/), etc.), and so you didn't really have a choice here: you must use what's already there.
## The Docker way
In the following, I will describe a docker deployment that should be generic enough to be adapted to any Docker-based CI system. It's inspired by my own experience[^6] and a blog post by Kevin Cox[^7].
First, we will spawn a unique `nix-daemon` on the worker, outside of the CI system:
```bash
docker run -i \
-v nix:/nix \
--privileged \
nixpkgs/nix:nixos-22.05 \
nix-daemon
```
Then we will mount this `nix` volume as read-only in our jobs. The job will be able to access the store to run the programs it needs. It can add new things to the store by scheduling builds in the daemon through a dedicated UNIX socket. This approach is called *Multi-user Nix: trusted building*[^8].
```bash
docker run -it --rm \
-e "NIX_REMOTE=unix:///mnt/nix/var/nix/daemon-socket/socket?root=/mnt" \
-e "NIX_CONFIG=extra-experimental-features = nix-command flakes" \
-v nix:/mnt/nix:ro \
-v `pwd`:/workspace \
-w /workspace \
nixpkgs/nix:nixos-24.05 \
nix build .#
```
Note how the nix daemon and the nix interactive instance have a different version. It's possible as, in the interactive instance, we did not mount the daemon store on the default path (`/nix`) but on another one (`/mnt/nix`) and instructed it to use it in the `NIX_REMOTE` environment variable. This point is important as it enables you to decouple the lifecycle of your worker daemons from the one of your projects, which drastically ease maintenance.
## A woodpecker integration
Basically, you want to run your nix-daemon next to your woodpecker agent, for example in a docker-compose. Then, you need to pass specific parameters to your woodpecker agent such that our volume and environment variables are automatically injected to all your builds:
```yml
version: '3.4'
services:
nix-daemon:
image: nixpkgs/nix:nixos-22.05
restart: always
command: nix-daemon
privileged: true
volumes:
- "nix:/nix"
woodpecker-runner:
image: woodpeckerci/woodpecker-agent:v2.4.1
restart: always
environment:
# -- our NixOS / CI specific env
- WOODPECKER_BACKEND_DOCKER_VOLUMES=woodpecker_nix:/mnt/nix:ro
- WOODPECKER_ENVIRONMENT=NIX_REMOTE:unix:///mnt/nix/var/nix/daemon-socket/socket?root=/mnt,NIX_CONFIG:extra-experimental-features = nix-command flakes
# -- change these for each agent
- WOODPECKER_HOSTNAME=i_forgot_to_change_my_runner_name
- WOODPECKER_AGENT_SECRET=xxxx
# -- should not need change
2024-08-10 15:46:31 +00:00
- WOODPECKER_SERVER=woodpecker.example:1111
2024-08-10 15:44:14 +00:00
volumes:
- "/var/run/docker.sock:/var/run/docker.sock"
volumes:
nix:
```
2024-08-10 17:25:28 +00:00
Note that the volume is named `woodpeck_nix` and not `nix` in the woodpacker agent configuration (`WOODPECKER_BACKEND_DOCKER_VOLUMES` environment declaration). It's because our `docker-compose.yml` is in a `woodpecker` folder and docker compose prefixes the created volumes with the name of the deployment, by default the parent folder name. The prefix is not needed elsewhere, as elsewhere, the resolution is dynamically done by compose. But the `WOODPECKER_BACKEND_DOCKER_VOLUMES` declaration is not part of compose, it will be used later by woodpecker when interacting directly with the Docker API.
2024-08-10 15:44:14 +00:00
Then, in your project `.woodpecker.yml`, you can seemlessly use nix and enjoy efficient and quick caching:
```yml
steps:
- name: build
image: nixpkgs/nix:nixos-24.05
commands:
- nix build .#
```
## Limitations
Anyone having access to your CI will have a read access to your nix store.
People will also be able to store data in your `/nix/store`.
Finally, if I remember correctly, there are some attacks to alter the content of a derivation (such that a content in `/nix/store` is not the product of the hashed derivation). In other words, it's mainly a single-tenant solution.
So a great evolution would be a multi-tenant system, either by improving the nix-daemon isolation, or by running one nix-daemon per-project or per-user/per-organization. Today, none of these solutions is possible.
Another limitation is garbage collection: if the nix-daemon can do some garbage collection, none of its policy is interesting for a CI. Mainly, if you activate it, it will ditch everything as it is connected to "no root path" from its point of view. A LRU cache policy would be a great addition. At least, you can manually trigger a garbage collection once your disk is full...
---
[^1]: [How long should your CI take](https://graphite.dev/blog/how-long-should-ci-take). *Various industry resources suggest an ideal CI time of around 10 minutes for completing a full build, test, and analysis cycle. As Kent Beck, author of Extreme Programming, said, “A build that takes longer than ten minutes will be used much less often, missing the opportunity for feedback. A shorter build doesnt give you time to drink your coffee.”*
[^2]: [Measure and Improve Your CI Speed with Semaphore](https://semaphoreci.com/blog/2017/03/16/measure-and-improve-your-ci-speed.html). *Were convinced that having a build slower than 10 minutes is not proper continuous integration. When a build takes longer than 10 minutes, we waste too much precious time and energy waiting, or context switching back and forth. We merge rarely, making every deploy more risky. Refactoring is hard to do well.*
[^3]: [Continuous Integration Certification](https://martinfowler.com/bliki/ContinuousIntegrationCertification.html). *Finally he asks if, when the build fails, its usually back to green within ten minutes. With that last question only a few hands remain. Those are the people who pass his certification test.*
[^4]: [Rust CI Cache](https://blog.arriven.wtf/posts/rust-ci-cache/). *We can cache the build artifacts by caching the target directory of our workspace.*
[^5]: [My ideal Rust workflow](https://fasterthanli.me/articles/my-ideal-rust-workflow). *The basic idea behind sccache, at least in the way I have it set up, it's that it's invoked instead of rustc, and takes all the inputs (including compilation flags, certain environment variables, source files, etc.) and generates a hash. Then it just uses that hash as a cache key, using in this case an S3 bucket in us-east-1 as storage.*
[^6]: I [tried writing a CI](https://git.deuxfleurs.fr/Deuxfleurs/albatros/src/commit/373c1f8d76b11a5638b2a4aa753417c67f0c2e13/hcl/nixcache/builder.hcl) on top of Nomad that would wrap a dockerized NixOS, and also deployed [a Woodpecker/Drone CI NixOS runner](https://git.deuxfleurs.fr/Deuxfleurs/nixcfg/src/commit/ca01149e165b3ad1c9549735caa658efda380cd3/cluster/prod/app/woodpecker-ci/integration/docker-compose.yml).
[^7]: [Nix Build Caching Inside Docker Containers](https://kevincox.ca/2022/01/02/nix-in-docker-caching). *I wanted to see if I could cache dependencies without uploading, downloading or copying them around for each job.*
[^8]: [Untrusted CI: Using Nix to get automatic trusted caching of untrusted builds](https://www.tweag.io/blog/2019-11-21-untrusted-ci/). *This means that untrusted contributors can upload a “build recipe” to a privileged Nix daemon which takes care of running the build as an unprivileged user in a sandboxed context, and of persisting the build output to the local Nix store afterward.*