garagehq.deuxfleurs.fr/content/blog/2022-introducing-garage.md
Alex Auvolat ce75a7795d
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone/pr Build is passing
fixes on fixes
2022-07-08 13:55:16 +02:00

9.3 KiB
Raw Permalink Blame History

+++ title="Introducing Garage, our self-hosted distributed object storage solution" date=2022-02-01 +++

Deuxfleurs is a non-profit based in France that aims to defend and promote individual freedom and rights on the Internet. In their quest to build a decentralized, resilient self-hosting infrastructure, they have found that currently, existing software is often ill-suited to such a particular deployment scenario. In the context of data storage, Garage was built to provide a highly available data store that exploits redundancy over different geographical locations, and does its best to not be too impacted by network latencies.


Hello! We are Deuxfleurs, a non-profit based in France working to promote self-hosting and small-scale hosting.

What does that mean? Well, we figured that big tech monopolies such as Google, Facebook or Amazon today hold disproportionate power and are becoming quite dangerous to us, citizens of the Internet. They know everything we are doing, saying, and even thinking, and they are not making good use of that information. The interests of these companies are those of the capitalist elite: they are most interested in making huge profits by exploiting the Earth's precious resources, producing, advertising, and selling us massive amounts of stuff we don't need. They don't truly care about the needs of the people, nor do they care that planetary destruction is under way because of them.

Big tech monopolies are in a particularly strong position to influence our behaviors, consciously or not, because we rely on them for selecting the online content we read, watch, or listen to. Advertising is omnipresent, and because they know us so well, they can subvert us into thinking that a mindless consumer society is what we truly want, whereas we most likely would choose otherwise if we had the chance to think by ourselves.

We don't want that. That's not what the Internet is for. Freedom is freedom from influence: the ability to do things by oneself, for oneself, on one's own terms. Self-hosting is both the means by which we reclaim this freedom on the Internet by not using services of big tech monopolies and thus removing ourselves from their influence and the result of applying our critical thinking and our technical abilities to build the Internet that suits us.

Self-hosting means that we don't use cloud services. Instead, we store our personal data on computers that we own, which we run at home. We build local communities to share the services that we run with non-technical people. We communicate with other groups that do the same (or, sometimes, that don't) thanks to standard protocols such as HTTP, e-mail, or Matrix, that allow a global community to exist outside of big tech monopolies.

Self-hosting is a hard problem

As I said, self-hosting means running our own hardware at home, and providing 24/7 Internet services from there. We have many reasons for doing this. One is because this is the only way we can truly control who has access to our data. Another one is that it helps us be aware of the physical substrate of which the Internet is made: making the Internet run has an environmental cost that we want to evaluate and keep under control. The physical hardware also gives us a sense of community, calling to mind all of the people that could currently be connected and making use of our services, and reminding us of the purpose for which we are doing this.

If you have a home, you know that bad things can happen there too. The power grid is not infallible, and neither is your Internet connection. Fires and floods happen. And the computers we are running can themselves crash at any moment, for any number of reasons. Self-hosted solutions today are often not equipped to face such challenges and might suffer from unavailability or data loss as a consequence.

If we want to grow our communities, and attract more people that might be sympathetic to our vision of the world, we need a baseline of quality for the services we provide. Users can tolerate some flaws or imperfections, in the name of defending and promoting their ideals, but if the services are catastrophic, being unavailable at critical times, or losing users' precious data, the compromise is much harder to make and people will be tempted to go back to a comfortable lifestyle bestowed by big tech companies.

Fixing availability, making services reliable even when hosted at unreliable locations or on unreliable hardware is one of the main objectives of Deuxfleurs, and in particular of the project Garage which we are building.

Distributed systems to the rescue

Distributed systems, or distributed computing, is a set of techniques that can be applied to make computer services more reliable, by making them run on several computers at once. It so happens that a few of us have studied distributed systems, which helps a lot (some of us even have PhDs!)

The following concepts of distributed computing are particularly relevant to us:

  • Crash tolerance is when a service that runs on several computers at once can continue operating normally even when one (or a small number) of the computers stops working.

  • Geo-distribution is when the computers that make up a distributed system are not all located in the same facility. Ideally, they would even be spread over different cities, so that outages affecting one region do not prevent the rest of the system from working.

We set out to apply these concepts at Deuxfleurs to build our infrastructure, in order to provide services that are replicated over several machines in several geographical locations, so that we are able to provide good availability guarantees to our users. We try to use as most as possible software packages that already existed and are freely available, for example the Linux operating system and the HashiCorp suite (Nomad and Consul).

Unfortunately, in the domain of distributed data storage, the available options weren't entirely satisfactory in our case, which is why we launched the development of our own solution: Garage. We will talk more in other blog posts about why Garage is better suited to us than alternative options. In this post, I will simply try to give a high-level overview of what Garage is.

What is Garage, exactly?

Garage is a distributed storage solution, that automatically replicates your data on several servers. Garage takes into account the geographical location of servers, and ensures that copies of your data are located at different locations when possible for maximal redundancy, a unique feature in the landscape of distributed storage systems.

Garage implements the Amazon S3 protocol, a de-facto standard that makes it compatible with a large variety of existing software. For instance it can be used as a storage backend for many self-hosted web applications such as NextCloud, Matrix, Mastodon, Peertube, and many others, replacing the local file system of a server with a distributed storage layer. Garage can also be used to synchronize your files or store your backups with utilities such as Rclone or Restic. Last but not least, Garage can be used to host static websites, such as the one you are currently reading, which is served directly by the Garage cluster we host at Deuxfleurs.

Garage leverages the theory of distributed systems, and in particular Conflict-free Replicated Data Types (CRDTs in short), a set of mathematical tools that help us write distributed software that runs faster, by avoiding some kinds of unnecessary chit-chat between servers. In a future blog post, we will show how this allows us to significantly outperform Minio, our closest competitor (another self-hostable implementation of the S3 protocol).

On the side of software engineering, we are committed to making Garage a tool that is reliable, lightweight, and easy to administrate. Garage is written in the Rust programming language, which helps us ensure the stability and safety of the software, and allows us to build software that is fast and uses little memory.

Conclusion

The current version of Garage is version 0.6, which is a beta release. This means that it hasn't yet been tested by many people, and we might have ignored some edge cases in which it would not perform as expected.

However, we are already actively using Garage at Deuxfleurs for many uses, and it is working exceptionally well for us. We are currently using it to store backups of personal files, to store the media files that we send and receive over the Matrix network, as well as to host a small but increasing number of static websites. Our current deployment hosts about 200 000 files spread in 50 buckets, for a total size of slightly above 500 GB. These numbers can seem small when compared to the datasets you could expect your typical cloud provider to be handling, however these sizes are fairly typical of the small-scale self-hosted deployments we are targeting, and our Garage cluster is in no way nearing its capacity limit.

Today, we are proudly releasing Garage's new website, with updated documentation pages. Poke around to try to understand how the software works, and try installing your own instance! Your feedback is precious to us, and we would be glad to hear back from you on our issue tracker, by e-mail, or on our Matrix channel (#garage:deuxfleurs.fr).