description: Matrix has multiple solutions to store is media on S3, we review them and point their drawbacks
category: operation
tags:
---
By default, Matrix Synapse stores its media on the local filesystem which rises many issues.
It exposes your users to loss of data, availability issues but mainly scalability/sizing issues.
Especially as we live in an era where users expect no resource limitation, where software are not
designed to garbage collect or even track resource usage, it is really hard to plan ahead resources you will use.
In practise, it leads to 2 observations: resource overprovisioning and distributed filesystems.
The first one often leads to wasted resources while the second one is often hard to manage and require expensive hardware and network.
Thankfully, as we store blob data, we do not need the full power of a filesystem and a more lightweight API like S3 is enough.
In Matrix Synapse language, these solutions are referred as storage provider.
In this article, we will see how we migrated from GlusterFS to Matrix's S3 storage provider + our [Garage](garagehq.deuxfleurs.fr/) backend.
## Internals
First, Matrix's developpers make a difference between a *media provider* and a *storage provider*.
It appears that files are always stored in the *media provider* even if a *storage provider* is registered, and there is no way
to change this behavior in the code. And unfortunately the *media provider* can only use the filesystem.
For example when fetching a media, we can see [in the code](
https://github.com/matrix-org/synapse/blob/b996782df51eaa5dd30635a7c59c93994d3a735e/synapse/rest/media/v1/media_storage.py#L185-L198) that the filesystem is always probed first, and only then our remote backend.
We also see [in the code](
https://github.com/matrix-org/synapse/blob/b996782df51eaa5dd30635a7c59c93994d3a735e/synapse/rest/media/v1/media_storage.py#L202-L211) that the *media provider* can be referred as the local cache and that some parts of the code may require that a file is in the local cache.
As a conclusion, the best we can do is to keep the *media provider* as a local cache.
Registering the module like that will only be useful for our new media, `store_local: True` and `store_remote: True` means that newly media will be uploaded to our S3 target and we want to check that upload suceed before notifying the user (`store_synchronous: True`). The rationale for there store options is to enable administators to handle the upload with a *pull approach* rather than with our *push approach*. In practise, for the *pull approach*, administrators have to call regularly a script (with a cron for example) to copy the files on the target. A script is provided by the extension developpers named `s3_media_upload`.
- postgres credentials + endpoint must be stored in a `database.yml` file
- s3 credentials must be configured as per the [boto convention](https://boto3.amazonaws.com/v1/documentation/api/1.9.46/guide/configuration.html) and the endpoint can be specified on the command line
- the path to the local cache/media repository is also passed through the command line
This script needs to store some states between command executions and thus will create a sqlite in your working directory named `cache.db`. Do not delete it!
In practise, your database configuration may be created as follow:
```bash
cat > database.yaml <<EOF
user: xxxxx
password: xxxxx
database: xxxxxx
host: xxxxxxxx
port: 5432
EOF
```
And S3 can be configured through environment variables:
```bash
export AWS_ACCESS_KEY_ID=""
export AWS_SECRET_ACCESS_KEY=""
export AWS_DEFAULT_REGION="garage"
```
We are now ready, the other parameters will be passed on the command line.
## Use the tool
First we must build a list of media that we want to send to S3.
I guess that developpers designed this tool with the idea that S3 is an archive target and that we want to keep recent data locally.
That's why a duration is required, because they want to send only old data to S3.
Here, we will fetch media that are at least one day (`1d`) old, but you can set 1 month (`1m`) to keep more media locally or 0 day (`0d`) if you want close to no local cache. For more details, check [the source code](https://github.com/matrix-org/synapse-s3-storage-provider/blob/main/scripts/s3_media_upload#L140-L185).
```bash
./s3_media_upload update-db 1d
```
Filters media that are not on the local filesystem, either because they were already uploaded to our S3 backend or because they are lost. [See the code](https://github.com/matrix-org/synapse-s3-storage-provider/blob/main/scripts/s3_media_upload#L188-L217).
*Please not that I deactivated the progress bar because it is buggy on my docker exec inside a screen inside a ssh session.*
To use it, you must set the following environment variables:
- For AWS: `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_DEFAULT_REGION`, `ENDPOINT`, `BUCKET`
- For Postgres: `PG_USER`, `PG_PASS`, `PG_DB`, `PG_HOST`, `PG_PORT`
- For the filesystem: `MEDIA_PATH`, we suppose `s3_media_upload` is in your `PATH`.
## matrix-media-repo
I presented the "native" way to handle media on Matrix Synapse but there is also a community managed project named [`matrix-media-repo`](https://docs.t2bot.io/matrix-media-repo) with a slightly different goal. The author wanted to have a common media repository for multiple servers to reduce storage costs.
matrix-media-repo is not implementation independent: instead, it shadows the matrix endpoint used for the media `/_matrix/media` and thus is compatible with any matrix server, like dendrite or conduit. Its main advantage over our solution is that it does not have this mandatory cache, it can directly upload and serve from a S3 backend, simplifying the management.
Depending on your reverse proxy, it might be possible that if `matrix-media-repo` is down, users are redirected to the original endpoint that should not be used anymore, leading to loss of data and strange behaviors. It seems that [an option](https://github.com/matrix-org/synapse/blob/v1.42.0/synapse/config/server.py#L265-L269) in Synapse allows to deactivate the media-repo, it might save you some time if it works.
## Conclusion
Using a S3 target with Matrix is not trivial. `matrix-media-repo` seems to be a better solution but in practise it has also its own drawbacks. For now, even if not optimal, our deployed solutions works well and it's what matters.