--- layout: post slug: matrix-synapse-s3-storage status: published sitemap: true title: Storing Matrix media on a S3 backend description: Matrix has multiple solutions to store is media on S3, we review them and point their drawbacks category: operation tags: --- By default, Matrix Synapse stores its media on the local filesystem which rises many issues. It exposes your users to loss of data, availability issues but mainly scalability/sizing issues. Especially as we live in an era where users expect no resource limitation, where software are not designed to garbage collect or even track resource usage, it is really hard to plan ahead resources you will use. In practise, it leads to 2 observations: resource overprovisioning and distributed filesystems. The first one often leads to wasted resources while the second one is often hard to manage and require expensive hardware and network. Thankfully, as we store blob data, we do not need the full power of a filesystem and a more lightweight API like S3 is enough. In Matrix Synapse language, these solutions are referred as storage provider. In this article, we will see how we migrated from GlusterFS to Matrix's S3 storage provider + our [Garage](garagehq.deuxfleurs.fr/) backend. ## Internals First, Matrix's developpers make a difference between a *media provider* and a *storage provider*. It appears that files are always stored in the *media provider* even if a *storage provider* is registered, and there is no way to change this behavior in the code. And unfortunately the *media provider* can only use the filesystem. For example when fetching a media, we can see [in the code]( https://github.com/matrix-org/synapse/blob/b996782df51eaa5dd30635a7c59c93994d3a735e/synapse/rest/media/v1/media_storage.py#L185-L198) that the filesystem is always probed first, and only then our remote backend. We also see [in the code]( https://github.com/matrix-org/synapse/blob/b996782df51eaa5dd30635a7c59c93994d3a735e/synapse/rest/media/v1/media_storage.py#L202-L211) that the *media provider* can be referred as the local cache and that some parts of the code may require that a file is in the local cache. As a conclusion, the best we can do is to keep the *media provider* as a local cache. The concept of cache is very artificial as there is no integrated tool for cache eviction: it is our responsability to garbage collect the cache. ## Migration We can easily configure the S3 synapse provider in our `homeserver.yaml`: ```yaml media_storage_providers: - module: s3_storage_provider.S3StorageProviderBackend store_local: True store_remote: True store_synchronous: True config: bucket: matrix region_name: garage endpoint_url: XXXXXXXXXXXXXX access_key_id: XXXXXXXXXXXXXX secret_access_key: XXXXXXXXXXX ``` Registering the module like that will only be useful for our new media, `store_local: True` and `store_remote: True` means that newly media will be uploaded to our S3 target and we want to check that upload suceed before notifying the user (`store_synchronous: True`). The rationale for there store options is to enable administators to handle the upload with a *pull approach* rather than with our *push approach*. In practise, for the *pull approach*, administrators have to call regularly a script (with a cron for example) to copy the files on the target. A script is provided by the extension developpers named `s3_media_upload`. This script is also the sole way to migrate old media (that cannot be *pushed*) so we will still have to use it. First, we need some setup to use this tool: - postgres credentials + endpoint must be stored in a `database.yml` file - s3 credentials must be configured as per the [boto convention](https://boto3.amazonaws.com/v1/documentation/api/1.9.46/guide/configuration.html) and the endpoint can be specified on the command line - the path to the local cache/media repository is also passed through the command line This script needs to store some states between command executions and thus will create a sqlite in your working directory named `cache.db`. Do not delete it! In practise, your database configuration may be created as follow: ```bash cat > database.yaml < database.yaml <