tries to evaluate our design assumptions to the real world implementation,
similarly to what we have done [on Garage](https://garagehq.deuxfleurs.fr/blog/2022-perf/).*
<!-- more -->
---
## Methodology
Brendan Gregg, a very respected figure in the world of system performances, says that, for many reasons,
[~100% of benchmarks are wrong](https://www.brendangregg.com/Slides/Velocity2015_LinuxPerfTools.pdf).
This benchmark will be wrong too in multiple ways:
1. It will not say anything about Aerogramme performances in real world deployments
2. It will not say anything about Aerogramme performances compared to other email servers
However, I pursue a very specific goal with this benchmark: validating if the assumptions we have done
during the design phase, in term of compute and memory complexity, holds for real.
I will observe only two metrics: the CPU time used by the program (everything except idle and iowait based on the [psutil](https://pypi.org/project/psutil/) code) - for the computing complexity - and the [Resident Set Size](https://en.wikipedia.org/wiki/Resident_set_size) (data held RAM) - for the memory complexity.
<!--My baseline will be the compute and space complexity of the code that I have in mind. For example,
I know we have a "3 layers" data model: an index stored in RAM, a summary of the emails stored in K2V, a database, and the full email stored in S3, an object store.
Commands that can be solved only with the index should use a very low amount of RAM compared to . In turn, commands that require the full email will require to fetch lots of data from S3.-->
## Testing environment
I ran all the tests on my personal computer, a Dell Inspiron 7775 with an AMD Ryzen 7 1700, 16GB of RAM, an encrypted SSD, on NixOS 23.11.
The setup is made of Aerogramme (compiled in release mode) connected to a local, single node, Garage server.
Observations and graphs are done all in once thanks to the [psrecord](https://github.com/astrofrog/psrecord) tool.
I did not try to make the following values reproducible as it is more an exploration than a definitive review.
## Mailbox dataset
I will use [a dataset of 100 emails](https://git.deuxfleurs.fr/Deuxfleurs/aerogramme/src/commit/0b20d726bbc75e0dfd2ba1900ca5ea697645a8f1/tests/emails/aero100.mbox.zstd) I have made specifically for the occasion.
It contains some emails with various attachments, some emails with lots of text, emails generated by many different clients (Thunderbird, Geary, Sogo, Alps, Outlook iOS, GMail iOS, Windows Mail, Postbox, Mailbird, etc.), etc.
The mbox file weighs 23MB uncompressed.
One question that arise is: how representative of a real mailbox is this dataset? While a definitive response is not possible, I compared the email sizes of this dataset to the 2 367 emails in my personal inbox.
Below I plotted the empirical distribution for both my dataset and my personal inbox (note that the x axis is not linear but logarithimic).
*[Get the 100 emails dataset](https://git.deuxfleurs.fr/Deuxfleurs/aerogramme/src/commit/0b20d726bbc75e0dfd2ba1900ca5ea697645a8f1/tests/emails/aero100.mbox.zstd) - [Get the CSV used to plot this graph](https://git.deuxfleurs.fr/Deuxfleurs/aerogramme/src/branch/perf/cpu-ram-bottleneck/tests/emails/mailbox_email_sizes.csv)*
We see that the curves are close together and follow the same pattern: most emails are between 1kB and 100kB, and then we have a long tail (up to 20MB in my inbox, up to 6MB in the dataset).
It's not that surprising: on many places on the Internet, the limit on emails is set to 25MB. Overall I am quite satisfied by this simple dataset, even if having one or two bigger emails could make it even more representative of my real inbox...
Mailboxes with only 100 emails are not that common (mine has 2k emails...), so to emulate bigger mailboxes, I simply inject the dataset multiple times (eg. 20 times for 2k emails).
## Command dataset
Having a representative mailbox is a thing, but we also need to know what are the typical commands that are sent by IMAP clients.
As I have setup a test instance of Aerogramme (see [my FOSDEM talk](https://fosdem.org/2024/schedule/event/fosdem-2024-2642--servers-aerogramme-a-multi-region-imap-server/)),
I was able to extract 4 619 IMAP commands sent by various clients. Many of them are identical, and in the end, only 248 are truly unique.
The following bar plot depicts the command distribution per command name; top is the raw count, bottom is the unique count.
*[Get the IMAP command log](https://git.deuxfleurs.fr/Deuxfleurs/aerogramme/src/branch/perf/cpu-ram-bottleneck/tests/emails/imap_commands_dataset.log) - [Get the CSV used to plot this graph](https://git.deuxfleurs.fr/Deuxfleurs/aerogramme/src/branch/perf/cpu-ram-bottleneck/tests/emails/imap_commands_summary.csv)*
First, we can handle separately some commands: LOGIN, CAPABILITY, ENABLE, SELECT, EXAMINE, CLOSE, UNSELECT, LOGOUT as they are part of a **connection workflow**.
We do not plan on studying them directly as they will be used in all other tests.
CHECK, NOOP, IDLE, and STATUS are different approaches to detect a change in the current mailbox (or in other mailboxes in the case of STATUS),
I assimilate these commands as a **notification** mechanism.
FETCH, SEARCH and LIST are **query** commands, the first two ones for emails, the last one for mailboxes.
FETCH is from far the most used command (1187 occurencies) with the most variations (128 unique combination of parameters).
SEARCH is also used a lot (658 occurencies, 14 unique).
APPEND, STORE, EXPUNGE, MOVE, COPY, LSUB, SUBSCRIBE, CREATE, DELETE are commands to **write** things: flags, emails or mailboxes.
They are not used a lot but some writes are hidden in other commands (CLOSE, FETCH), and when mails arrive, they are delivered through a different protocol (LMTP) that does not appear here.
In the following, we will assess that APPEND behaves more or less than a LMTP delivery.
<!--
Focus on `FETCH` (128 unique commands), `SEARCH` (14 unique commands)
At this level of maturity, the main goal for Aerogramme is predictable & stable resource usage server side.
Indeed, there is nothing more annoying than a user, honest or malicious, breaking the server while running
a resource intensive command.
**Querying bodies** - They may allocate the full mailbox in RAM. For some parameters (eg. FETCH BODY),
it might be possible to precompute data, but for some others (eg. SEARCH TEXT "something") it's not possible, so precomputing
is not a definitive solution. Also being slow is acceptable here: we just want to avoid being resource intensive. The solution I envision is to "stream"
the fetched emails and process them one by one. For some commands like FETCH 1:* BODY[], that are to the best of my knowledge,
never run by real clients, it will not be enough however. Indeed, Aerogramme will drop the fetched emails, but it will have an in-memory copy inside
the response object awaiting for delivery. So we might want to implement response streaming too.
**argon2** - Login resource usage is due to argon2, but compared to many other protocols, authentications in IMAP occure way more often.
For example, if a user configure their email client to poll their mailbox every 5 minutes, this client will authenticate 288 times in a single day!
argon2 resource usage is a tradeoff with the bruteforcing difficulty, reducing its resource usage is thus a possible performance mitigation
solution at the cost of reduced security. Other options might reside in the evolution of our authentication system: even if I don't
know what is the current state of implementation of OAUTH in existing IMAP clients, it could be considered as an option.
Indeed, the user enters their password once, during the configuration of their account, and then a token is generated and stored in the client.
The credentials could be stored in the token (think a JSON Web Token for example), avoiding the expensive KDF on each connection.
**IDLE** - IDLE RAM usage is the same as other commands, but IDLE keeps the RAM allocated for way longer.
To illustrate my point, let's suppose an IMAP session uses 5MB of RAM,
and two populations of 1 000 users. We suppose users monitor only one mailbox (which is not true in many cases).
The first population, aka *the poll population*, configures their client to poll the server every 5 minutes, polling takes 2 seconds (LOGIN + SELECT + LOGOUT).
The second population, aka *the push population*, configures their client to "receive push messages" - ie. using IMAP IDLE.
For the *push population*, the RAM usage will
be a stable 5GB (5MB * 1 000 users): all users will be always connected.
With a 1MB session, we could reduce the RAM usage to 1GB: any improvement on the session base RAM
will be critical to IDLE with the current design.
For the *poll population*, we can split the time in 150 ticks (5 minutes / 2 seconds = 150 ticks).
It seems the problem can be mapped to a [Balls into bins problem](https://en.wikipedia.org/wiki/Balls_into_bins_problem)
with random allocation (yes, I know, assuming random allocation might not hold in many situations).
Based on the Wikipedia formula, if I am not wrong, we can suppose that, with high probability, at most 13 clients will be connected
at once, which means 65MB of RAM usage (5MB * 13 clients/tick = 65MB/tick).
With these *back-of-the-envelope calculations*, we understand how crucial the IDLE RAM consumption is compared
to other commands, and how the base RAM consumption of a user will impact the service.
**Large email streaming** - Finally, email streaming could really improve RAM usage for large emails, especially on APPEND and LMTP delivery,
or even on FETCH when the body is required. But implementing such a feature would require an email parser that
can work on streams, which in turns [is not something trivial](https://github.com/rust-bakery/nom/issues/1160).
While it seems untimely to act now, these spots are great candidates for a closer monitoring for future performance evaluations.
Fixing these points, above the simple mitigations, will involve important design changes in Aerogramme, which means
in the end: writing lot of code! That's why I think Aerogramme can work with these "limitations" for now,
and we will take decisions about these points when it will be *really* required.
we expect better ressource usages, and I am convinced that for many people, today the answer is still "Yes, it does".
Based on this benchmark, I identified 3 low-hanging fruits to improve performances that do not require major design changes: 1) in FETCH+SEARCH queries, handling emails one after another instead of loading the full mailbox in memory, 2) streaming FETCH responses instead of aggregating them in memory, and 3) reducing the RAM usage of a base user by tweaking its Garage connectors configuration.