This commit is contained in:
Quentin 2024-08-10 17:44:14 +02:00
parent fab99bab6d
commit c0e8df4741
Signed by: quentin
GPG key ID: E9602264D639FF68
2 changed files with 160 additions and 5 deletions

View file

@ -33,9 +33,9 @@ qui se base sur la bibliothèque C [libvips](https://www.libvips.org/) qui a des
## Défis techniques
À tout service informatique, se pose des questions de deux ordres : fonctionnelles et opérationnelles.
Pour tout service informatique se pose des questions de deux ordres : fonctionnel et opérationnel.
Le périmètre fonctionnel est bien défini, pour preuve l'homogénéité de fonctionnement de ces services.
On peut au besoin, se baser sur [l'image API 3.0](https://iiif.io/api/image/3.0/#4-image-requests) de l'IIF si on veut.
On peut au besoin se baser sur [l'image API 3.0](https://iiif.io/api/image/3.0/#4-image-requests) de l'IIF si on veut.
L'aspect opérationnel quant à lui revêt des défis non triviaux, spécifiquement quant on a une approche *computing within limits*.
En effet, la conversion d'une image n'est pas une opération négligeable en terme de consommation de CPU & RAM.
@ -67,10 +67,15 @@ Lorsque qu'une trop grande charge de travail est envoyée au service, ce dernier
Une fois la charge absorbée, le service *recover* et repasse dans son mode normal.
Ce *failure mode* doit forcément être très efficace, sinon il ne sert à rien.
On peut d'abord envisager un mode de fonctionnement très direct pour notre *failure mode* : envoyer un code d'erreur HTTP, comme le standard `503 service unavailable` ou le non-standard `529 service overloaded`. Plus ambitieux, on peut envoyer une image placeholder à la place, sans directive de cache bien entendu, ce qui permettrait de donner une indication visuelle plus claire aux internautes, et potentiellement de moins casser le site web.
Cette image placeholder serait pré-calculée au démarrage du service pour tous les formats supportés (JPEG, HEIC, etc.) et stockée en mémoire vive. Se pose encore la question de la taille : si on envoie une taille différente de celle attendue, on peut "casser" le rendu du site. À contrario, générer une image à la bonne taille à la volée demande des calculs, bien que si on complète avec une couleur uniforme, ces calculs puissent possiblement être triviaux en fonction du format considéré. Enfin, le problème majeur, c'est que les images sont intégrées de pleins de manières différentes à travers un site web, parfois mélangées avec des filtres : comment s'assurer que notre placeholder sera correctement reçu et compris ?
On peut d'abord envisager un mode de fonctionnement très direct pour notre *failure mode* : envoyer un code d'erreur HTTP, comme le standard `503 service unavailable` ou le non-standard `529 service overloaded`.
*Dans le cadre du développement d'une première itération, la solution des codes d'erreur semble de loin être la meilleure.*
Plus ambitieux, on peut envoyer une image placeholder à la place, sans directive de cache bien entendu, ce qui permettrait de donner une indication visuelle plus claire aux internautes, et potentiellement de moins casser le site web. Cette image placeholder serait pré-calculée au démarrage du service pour tous les formats supportés (JPEG, HEIC, etc.) et stockée en mémoire vive.
Se pose encore la question de la taille : si on envoie une taille différente de celle attendue, on peut "casser" le rendu du site. À contrario, générer une image à la bonne taille à la volée demande des calculs, bien que si on complète avec une couleur uniforme, ces calculs puissent possiblement être triviaux en fonction du format considéré.
Enfin, le problème majeur, c'est que les images sont intégrées de pleins de manières différentes à travers un site web, parfois mélangées avec des filtres : comment s'assurer que notre placeholder sera correctement reçu et compris ?
*Dans le cadre du développement d'une première itération, la solution des codes d'erreur semble préférable.*
## Files d'attente

View file

@ -0,0 +1,150 @@
---
layout: post
title: Fast CI builds
date: 2024-08-10
status: published
sitemap: true
category: operation
description: Fast CI builds
---
Historically, in the good old Jenkins days,
a CI build would occure in a workspace
that was kept across build. So your previous artifacts
could be re-used if they did not change (for example, `make` would detect
that some files did not change since that last build and thus did not recompile them).
Also it was assumed that all dependencies were directly installed on the machine.
Both of these properties allowed for very fast and efficient builds: only
what changed needed to be rebuilt.
This approach had many shortcomings: stale cache would break builds (or wrongly make it work),
improper dependency tracking would make building on a new machine very hard, etc.
In the end, developers stop trusting the CI that remain broken, bugs start cripling the project and are not noticed,
and finally the codebase becomes unmaintainable.
To avoid these problems, developers started to use a new generation of CI relying on VM (like Travis CI)
or containers (like Drone). All builds start with a fresh environment, often a well-known distribution like Ubuntu.
Then, for each builds, all the dependencies are installed and the build is done from scratch.
Such approach greatly helped developers better track their dependencies and make sure that building their project from scratch remains possible.
However, build times skyrocketted. You can wait more than 10 minutes before running a command that would actually check your code.
And as recommended by many people[^1][^2][^3] the whole build cycle (lint, build, test) shoud remains below 10 minutes to be useful.
To speed-up the CI, various optimizations have been explored. CI sometimes propose some sort of caching API, and when it does not, an object store like S3 can be used.
This cache is used either by directly copying the dependency folder[^4] (for example the `target/` folder for Rust or the `node_modules/` for Node.JS), or through dedicated tools like `sccache`[^5]. In this scenario, fetching/updating the cache involves a non negligible amount of filesystem+network I/O. Another approach relies on providing your own build image that will often be cached on workers. This image can contain your toolchain (for example Rust + Cargo + Clippy + etc.), but also your project dependencies (by copying your `Cargo.toml` or `package.json` file) and pre-fetching/compiling them. This approach still involves some maintenance burden: image must be rebuilt and published each time a dependency is changed, it's project specific, it can easily break, you still do not track correctly your dependencies, etc.
**Can we cache without making our builds fragile?**
## Nix to the rescue
Following our short discussions, the question that surface is wether or not we can cache efficiently without making our build fragile. Ideally, our project would be split in parts compiled in strict isolation, dependencies between parts would be stricly tracked, cache would be kept locally, and new job would only focus on rebuilding the changed components (avoiding steps like restoring cache & co).
That's what Nix can do, at least in a theory. A SaaS CI ecosystem start developping around it with solutions like [Garnix](https://garnix.io/) or [Hercules CI](https://hercules-ci.com/).
But personnaly, I am more interested in FOSS solutions, and thus existing solutions like [Hydra](https://github.com/NixOS/hydra) or [Typhon](https://typhon-ci.org/) seem more baroque. Worse, often a CI system based on Docker is already deployed in your organization (like [Woodpecker](https://woodpecker-ci.org/), [Gitlab Runner](https://docs.gitlab.com/runner/), [Forgejo Actions](https://forgejo.org/docs/v1.20/user/actions/), etc.), and so you didn't really have a choice here: you must use what's already there.
## The Docker way
In the following, I will describe a docker deployment that should be generic enough to be adapted to any Docker-based CI system. It's inspired by my own experience[^6] and a blog post by Kevin Cox[^7].
First, we will spawn a unique `nix-daemon` on the worker, outside of the CI system:
```bash
docker run -i \
-v nix:/nix \
--privileged \
nixpkgs/nix:nixos-22.05 \
nix-daemon
```
Then we will mount this `nix` volume as read-only in our jobs. The job will be able to access the store to run the programs it needs. It can add new things to the store by scheduling builds in the daemon through a dedicated UNIX socket. This approach is called *Multi-user Nix: trusted building*[^8].
```bash
docker run -it --rm \
-e "NIX_REMOTE=unix:///mnt/nix/var/nix/daemon-socket/socket?root=/mnt" \
-e "NIX_CONFIG=extra-experimental-features = nix-command flakes" \
-v nix:/mnt/nix:ro \
-v `pwd`:/workspace \
-w /workspace \
nixpkgs/nix:nixos-24.05 \
nix build .#
```
Note how the nix daemon and the nix interactive instance have a different version. It's possible as, in the interactive instance, we did not mount the daemon store on the default path (`/nix`) but on another one (`/mnt/nix`) and instructed it to use it in the `NIX_REMOTE` environment variable. This point is important as it enables you to decouple the lifecycle of your worker daemons from the one of your projects, which drastically ease maintenance.
## A woodpecker integration
Basically, you want to run your nix-daemon next to your woodpecker agent, for example in a docker-compose. Then, you need to pass specific parameters to your woodpecker agent such that our volume and environment variables are automatically injected to all your builds:
```yml
version: '3.4'
services:
nix-daemon:
image: nixpkgs/nix:nixos-22.05
restart: always
command: nix-daemon
privileged: true
volumes:
- "nix:/nix"
woodpecker-runner:
image: woodpeckerci/woodpecker-agent:v2.4.1
restart: always
environment:
# -- our NixOS / CI specific env
- WOODPECKER_BACKEND_DOCKER_VOLUMES=woodpecker_nix:/mnt/nix:ro
- WOODPECKER_ENVIRONMENT=NIX_REMOTE:unix:///mnt/nix/var/nix/daemon-socket/socket?root=/mnt,NIX_CONFIG:extra-experimental-features = nix-command flakes
# -- change these for each agent
- WOODPECKER_HOSTNAME=i_forgot_to_change_my_runner_name
- WOODPECKER_AGENT_SECRET=xxxx
- WOODPECKER_MAX_WORKFLOWS=4
# -- should not need change
- WOODPECKER_SERVER=woodpecker-grpc.deuxfleurs.fr:14453
- WOODPECKER_HEALTHCHECK=false
- WOODPECKER_GRPC_SECURE=true
- WOODPECKER_LOG_LEVEL=info
- WOODPECKER_DEBUG_PRETTY=true
volumes:
- "/var/run/docker.sock:/var/run/docker.sock"
volumes:
nix:
```
Then, in your project `.woodpecker.yml`, you can seemlessly use nix and enjoy efficient and quick caching:
```yml
steps:
- name: build
image: nixpkgs/nix:nixos-24.05
commands:
- nix build .#
```
## Limitations
Anyone having access to your CI will have a read access to your nix store.
People will also be able to store data in your `/nix/store`.
Finally, if I remember correctly, there are some attacks to alter the content of a derivation (such that a content in `/nix/store` is not the product of the hashed derivation). In other words, it's mainly a single-tenant solution.
So a great evolution would be a multi-tenant system, either by improving the nix-daemon isolation, or by running one nix-daemon per-project or per-user/per-organization. Today, none of these solutions is possible.
Another limitation is garbage collection: if the nix-daemon can do some garbage collection, none of its policy is interesting for a CI. Mainly, if you activate it, it will ditch everything as it is connected to "no root path" from its point of view. A LRU cache policy would be a great addition. At least, you can manually trigger a garbage collection once your disk is full...
---
[^1]: [How long should your CI take](https://graphite.dev/blog/how-long-should-ci-take). *Various industry resources suggest an ideal CI time of around 10 minutes for completing a full build, test, and analysis cycle. As Kent Beck, author of Extreme Programming, said, “A build that takes longer than ten minutes will be used much less often, missing the opportunity for feedback. A shorter build doesnt give you time to drink your coffee.”*
[^2]: [Measure and Improve Your CI Speed with Semaphore](https://semaphoreci.com/blog/2017/03/16/measure-and-improve-your-ci-speed.html). *Were convinced that having a build slower than 10 minutes is not proper continuous integration. When a build takes longer than 10 minutes, we waste too much precious time and energy waiting, or context switching back and forth. We merge rarely, making every deploy more risky. Refactoring is hard to do well.*
[^3]: [Continuous Integration Certification](https://martinfowler.com/bliki/ContinuousIntegrationCertification.html). *Finally he asks if, when the build fails, its usually back to green within ten minutes. With that last question only a few hands remain. Those are the people who pass his certification test.*
[^4]: [Rust CI Cache](https://blog.arriven.wtf/posts/rust-ci-cache/). *We can cache the build artifacts by caching the target directory of our workspace.*
[^5]: [My ideal Rust workflow](https://fasterthanli.me/articles/my-ideal-rust-workflow). *The basic idea behind sccache, at least in the way I have it set up, it's that it's invoked instead of rustc, and takes all the inputs (including compilation flags, certain environment variables, source files, etc.) and generates a hash. Then it just uses that hash as a cache key, using in this case an S3 bucket in us-east-1 as storage.*
[^6]: I [tried writing a CI](https://git.deuxfleurs.fr/Deuxfleurs/albatros/src/commit/373c1f8d76b11a5638b2a4aa753417c67f0c2e13/hcl/nixcache/builder.hcl) on top of Nomad that would wrap a dockerized NixOS, and also deployed [a Woodpecker/Drone CI NixOS runner](https://git.deuxfleurs.fr/Deuxfleurs/nixcfg/src/commit/ca01149e165b3ad1c9549735caa658efda380cd3/cluster/prod/app/woodpecker-ci/integration/docker-compose.yml).
[^7]: [Nix Build Caching Inside Docker Containers](https://kevincox.ca/2022/01/02/nix-in-docker-caching). *I wanted to see if I could cache dependencies without uploading, downloading or copying them around for each job.*
[^8]: [Untrusted CI: Using Nix to get automatic trusted caching of untrusted builds](https://www.tweag.io/blog/2019-11-21-untrusted-ci/). *This means that untrusted contributors can upload a “build recipe” to a privileged Nix daemon which takes care of running the build as an unprivileged user in a sandboxed context, and of persisting the build output to the local Nix store afterward.*