WIP resource usage Aerogramme

This commit is contained in:
Quentin 2024-02-17 13:02:25 +01:00
parent 378aebd3a5
commit 03d58045ab
Signed by: quentin
GPG key ID: E9602264D639FF68
4 changed files with 134 additions and 61 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 32 KiB

View file

@ -51,6 +51,8 @@ Below I plotted the empirical distribution for both my dataset and my personal i
![ECDF mailbox](ecdf_mbx.svg) ![ECDF mailbox](ecdf_mbx.svg)
*[Get the 100 emails dataset](https://git.deuxfleurs.fr/Deuxfleurs/aerogramme/src/commit/0b20d726bbc75e0dfd2ba1900ca5ea697645a8f1/tests/emails/aero100.mbox.zstd) - [Get the CSV used to plot this graph](https://git.deuxfleurs.fr/Deuxfleurs/aerogramme/src/branch/perf/cpu-ram-bottleneck/tests/emails/mailbox_email_sizes.csv)*
We see that the curves are close together and follow the same pattern: most emails are between 1kB and 100kB, and then we have a long tail (up to 20MB in my inbox, up to 6MB in the dataset). We see that the curves are close together and follow the same pattern: most emails are between 1kB and 100kB, and then we have a long tail (up to 20MB in my inbox, up to 6MB in the dataset).
It's not that surprising: on many places on the Internet, the limit on emails is set to 25MB. Overall I am quite satisfied by this simple dataset, even if having one or two bigger emails could make it even more representative of my real inbox... It's not that surprising: on many places on the Internet, the limit on emails is set to 25MB. Overall I am quite satisfied by this simple dataset, even if having one or two bigger emails could make it even more representative of my real inbox...
@ -65,6 +67,8 @@ The following bar plot depicts the command distribution per command name; top is
![Commands](command-run.svg) ![Commands](command-run.svg)
*[Get the IMAP command log](https://git.deuxfleurs.fr/Deuxfleurs/aerogramme/src/branch/perf/cpu-ram-bottleneck/tests/emails/imap_commands_dataset.log) - [Get the CSV used to plot this graph](https://git.deuxfleurs.fr/Deuxfleurs/aerogramme/src/branch/perf/cpu-ram-bottleneck/tests/emails/imap_commands_summary.csv)*
First, we can handle separately some commands: LOGIN, CAPABILITY, ENABLE, SELECT, EXAMINE, CLOSE, UNSELECT, LOGOUT as they are part of a **connection workflow**. First, we can handle separately some commands: LOGIN, CAPABILITY, ENABLE, SELECT, EXAMINE, CLOSE, UNSELECT, LOGOUT as they are part of a **connection workflow**.
We do not plan on studying them directly as they will be used in all other tests. We do not plan on studying them directly as they will be used in all other tests.
@ -104,6 +108,99 @@ UID SEARCH BEFORE 2024-02-09
``` ```
--> -->
In the following, I will keep these 3 categories: **writing**, **notification**, and **query** to evaluate Aerogramme's ressource usage
based on command patterns observed on real IMAP commands and the provided dataset.
---
## Write Commands
We start by the write commands as it will enable us to fill the mailboxes for the following evaluations.
I inserted the full dataset (100 emails) to 16 accounts (in other words, in the end, the server handles 1 600 emails) with APPEND.
*[Get the Python script](https://git.deuxfleurs.fr/Deuxfleurs/aerogramme/src/branch/main/tests/instrumentation/mbox-to-imap.py)*
### Filling a mailbox
![Append Custom Build](01-append-tokio-console-musl.png)
First, I observed this *scary* linear memory increase. It seems we are not releasing some memory,
and that's an issue! I quickly suspected tokio-console of being the culprit.
A quick search lead me to an issue entitled [Continuous memory leak with console_subscriber #184](https://github.com/tokio-rs/console/issues/184)
that confirmed my intuition.
Instead of waiting for an hour or trying to tweak the retention time, I built Aerogramme without tokio console.
*So in a first approach, we observed the impact of tokio console instead of our code! Still, we want to
have performances as predictable as possible.*
![Append Cargo Release](02-append-glibc.png)
Which got us to this second pattern: a stable but high memory usage compared to previous run.
It appears I built the binary with `cargo release`, which creates a binary that dynamically link to the GNU libc.
The previous binary was built with our custom Nix toolchain that statically link musl libc to our binary.
In the process, we changed the allocator: it seems the GNU libc allocator allocates bigger chunks at once.
*It would be wrong to conclude the musl libc allocator is more efficient: allocating and unallocating
memory on the kernel side is costly, and thus it might be better for the allocator to keep some kernel allocated memory
for future memory allocations that will not require system calls. This is another example of why this benchmark is wrong: we observe
the memory allocated by the allocator, not the memory used by program itself.*
For the next graph, I removed tokio-console and built Aerogramme with a static musl libc.
![Append Custom Build](03-append-musl.png)
The observed patterns match way better what I was expecting.
We observe 16 spikes of memory allocation, around 50MB, followed by a 25MB memory usage.
In the end, we drop to ~18MB.
In this scenario, we can say that a user needs between 32MB of RAM and 7MB.
In the previous runs, we were doing the inserts sequentially. But in the real world, multiple users interact with the server
at the same time. In the next run, we run the same test but in parrallel.
![Append Parallel](04-append-parallel.png)
We see 2 spikes: a short one at the beggining, and a longer one at the end.
The first spike is probably due to the argon2 decoding, a key derivation function
that is purposedly built to be expensive in term of RAM and CPU.
The second spike is due to the fact that big emails (multiple MB) are at the end of the dataset,
and they are stored fully in RAM before being sent. However, our biggest email weighs 6MB,
and we are running 16 threads, so we should expect around a memory usage that is around 100MB,
not 400MB. This difference would be a good starting point for an investigation: we might
copy a same email multiple times in RAM.
It seems that in this first test that Aerogramme is particularly sensitive to 1) login commands due to argon2 and 2) large emails.
### Copy and Move
You might need to organize your folders, copying or moving your email across your mailboxes.
COPY is a standard IMAP command, MOVE is an extension.
I will focus on a brutal test: copying 1k emails from the INBOX to Sent, then moving these 1k emails to Archive.
Below is the graph depicting Aerogramme resource usage during this test.
![Copy and move](copy-move.png)
Memory usage remains stable and low (below 25MB), but the operations are CPU intensive (close to 100% for 40 seconds).
Both COPY and MOVE depict the same pattern: indeed, as emails are considered immutable, Aerogramme only handle pointers in both cases
and do not really copy their content.
Real world clients would probably not send such brutal commands, but would do it progressively, either one by one, or with small batches,
to keep the UI responsive.
While CPU optimizations could probably be imagined, I find this behavior satisfying, especially as memory remains stable and low.
### Setting flags
Setting flags (Seen, Deleted, Answered, NonJunk, etc.) is done through the STORE command.
Our run will be made in 3 parts: 1) putting one flag on one email, 2) putting 16 flags on one email, and 3) putting one flag on 1k emails.
The result is depicted in the graph below.
![Store flags](store.png)
<!-- <!--
`STORE` (19 unique commands). `STORE` (19 unique commands).
UID, not uid, silent, not silent, add not set, standard flags mainly. UID, not uid, silent, not silent, add not set, standard flags mainly.
@ -114,76 +211,52 @@ STORE 2 +FLAGS.SILENT \Answered
``` ```
--> -->
In the following, I will keep these 3 categories: **writing**, **notification**, and **query** to evaluate Aerogramme's ressource usage The first and last spike are due respectively to the LOGIN/SELECT and CLOSE/LOGOUT commands.
based on command patterns observed on real IMAP commands. We thus have 3 CPU spikes, one for each command, memory remains stable.
The last command is bar far the most expensive, and indeed, it has to generate 1k events in our event log and rebuild many things in the index.
However, there is no reason for the 2nd command to be less expensive than the first one except from the fact it reuses some ressources / cache entries
from the first request.
--- Interacting with the index is really efficient in term of memory. Generating many changes
lead to high CPU (and possibly lot of IO), but from our dataset we observe most changes are done on one or two emails
and never on all the mailbox.
## Write Commands
I inserted the full dataset (100 emails) to 16 accounts (the server now handles 1 600 emails then).
*[See the script](https://git.deuxfleurs.fr/Deuxfleurs/aerogramme/src/branch/main/tests/instrumentation/mbox-to-imap.py)*
`APPEND`
![Append Custom Build](01-append-tokio-console-musl.png)
First, I observed this *scary* linear memory increase. It seems we are not releasing some memory,
and that's an issue! I quickly suspected tokio-console of being the culprit.
A quick search lead me to an issue entitled [Continuous memory leak with console_subscriber #184](https://github.com/tokio-rs/console/issues/184)
that confirmed my intuition.
Instead of waiting for an hour or trying to tweak the retention time, I tried a build without tokio console.
*So in a first approach, we observed the impact of tokio console instead of our code! Still, we want to
have performances as predictable as possible.*
![Append Cargo Release](02-append-glibc.png)
Which got us to this second pattern: a stable but high memory usage compared to previous run.
It appears I built the binary with `cargo release`, which creates a binary that dynamically link to the GNU libc.
While the previous binary was made with our custom Nix toolchain that statically compiles the Musl libc.
In the process, we changed the allocator: it seems the GNU libc allocator allocates bigger chunks at once.
*It would be wrong to conclude the musl libc allocator is more efficient: allocating and unallocating
memory on the kernel side is costly, and thus it might be better for the allocator to keep some kernel allocated memory
for future memory allocations that will not require system calls. This is another example of why this benchmark is wrong: we observe
the memory allocated by the allocator, not the memory used by program itself.*
For the next graph, I removed tokio-console but built Aerogramme with static musl libc.
![Append Custom Build](03-append-musl.png)
We observe 16 spikes of memory allocation, around 50MB, followed by a 25MB memory usage. In the end,
we drop to ~18MB. We do not try to analyze the spike for now. However, we can assume the 25MB memory usage accounts for the base memory consumption
plus the index of the user's mailbox. Once the last user logged out, memory drops to 18MB.
In this scenario, a user accounts for around 7MB.
*We will see later that some other use cases lead to a lower per-user RAM consumption.
An hypothesis: we are doing some requests on S3 with the aws-sdk library that is intended to be configured once
per process, and handles internally the threading logic. In our case, we instantiate it once per user,
tweaking its configuration might help. Again, we are not observing - only - our code!*
In the previous runs, we were doing the inserts sequentially. But in the real world, multiple users interact with the server
at the same time. In the next run, we run the same test but in parrallel.
![Append Parallel](04-append-parallel.png)
We see 2 spikes: a short one at the beggining, and a longer one at the end.
Interacting with flags should not be an issue for Aerogramme in the near future.
## Notification Commands ## Notification Commands
`NOOP` & `CHECK` Notification commands are expected to be run regularly in background by clients.
They are particularly sensitive as they are correlated to your number of users,
independently of the number of emails they receive. I split them in 2 parts:
the ones that are intermittent, and like HTTP, closes the connection after being run,
and the ones that are continuous, where the socket is maintained open forever.
*TODO* ### The cost of a refresh
`STATUS` NOOP, CHECK, STATUS are commands that trigger a refresh of the IMAP
view, and are part of the "intermittent" commands. In some ways, the SELECT and/or EXAMINE
commands could also be interpreted as a notification command: a client that is configured
to poll a mailbox every 15 minutes will not use the NOOP, running EXAMIME will be enough.
*TODO* In our case, all these commands are similar in the sense that they load or refresh the in-memory index
of the targeted mailbox. To illustrate my point, I will run SELECT, NOOP, CHECK, and STATUS on another mailbox in a row.
`IDLE` ![Refresh plot](refresh.png)
The first CPU spike is LOGIN/SELECT, the second is NOOP, the third CHECK, the last one STATUS.
CPU spikes are short, memory usage is stable.
Refresh commands should not be an issue for Aerogramme in the near future.
### Continuously connected clients
IDLE (and NOTIFY that is currently not implemented in Aerogramme) are commands
that maintain a socket opened. These commands are sensitive, as while many protocols
are one shot, and then your users spread their requests over time, with these commands,
all your users are continuously connected.
In the graph below, we plot the resource usage of 16 users that log into the system,
select inbox, and switch to IDLE, then, one by one, they receive an email and are notified.
![Idle Parallel](05-idle-parallel.png) ![Idle Parallel](05-idle-parallel.png)

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB