Merge pull request 'Proofreading after-the-fact' (#10) from proofread-ipfs-article into master

Reviewed-on: Deuxfleurs/garagehq.deuxfleurs.fr#10
This commit is contained in:
Alex 2022-07-08 13:58:20 +02:00
commit e95289c483
4 changed files with 80 additions and 81 deletions

View file

@ -4,16 +4,16 @@ date=2022-02-02
+++ +++
*FOSDEM is an international meeting about Free Software, organized from Brussels. *FOSDEM is an international meeting about Free Software, organized from Brussels.
On next Sunday, Febuary 6th, 2022, we will be there to present Garage.* On next Sunday, February 6th, 2022, we will be there to present Garage.*
<!-- more --> <!-- more -->
--- ---
In 2000, a belgian free software activist going by the name of Raphael Baudin In 2000, a Belgian free software activist going by the name of Raphael Baudin
set out to create a small event for free software developpers in Brussels. set out to create a small event for free software developers in Brussels.
This event quickly became the "Free and Open Source Developers' European Meeting", This event quickly became the "Free and Open Source Developers' European Meeting",
shorthand FOSDEM. 22 years later, FOSDEM is a major event for free software developpers shorthand FOSDEM. 22 years later, FOSDEM is a major event for free software developers
around the world. And for this year, we have the immense pleasure of announcing around the world. And for this year, we have the immense pleasure of announcing
that the Deuxfleurs association will be there to present Garage. that the Deuxfleurs association will be there to present Garage.
@ -23,18 +23,18 @@ in the last few years. Nothing too unfamiliar to us, as the organization is usin
the same tools as we are: a combination of Jitsi and Matrix. the same tools as we are: a combination of Jitsi and Matrix.
We are of course extremely honored that our presentation was accepted. We are of course extremely honored that our presentation was accepted.
If technical details are your thing, we invite you to come share this event with us. If technical details are your thing, we invite you to come and share this event with us.
In all cases, the event will be recorded and available as a VOD (Video On Demand) In all cases, the event will be recorded and available as a VOD (Video On Demand)
afterwards. Concerning the details of the organization: afterward. Concerning the details of the organization:
**When?** On Sunday, Febuary 6th, 2022, from 10:30 AM to 11:00 AM CET. **When?** On Sunday, February 6th, 2022, from 10:30 AM to 11:00 AM CET.
**What for?** Introducing the Garage storage platform. **What for?** Introducing the Garage storage platform.
**By whom?** The presentation will be made by Alex, **By whom?** The presentation will be made by Alex,
other developpers will be present to answer questions. other developers will be present to answer questions.
**For who?** The presentation is targetted to a technical audience that is knowledgable in software developpement or systems administration. **For who?** The presentation is targeted to a technical audience that is knowledgeable in software development or systems administration.
**Price:** FOSDEM'22 is an entirely free event. **Price:** FOSDEM'22 is an entirely free event.
@ -46,7 +46,7 @@ afterwards. Concerning the details of the organization:
And if you are not so much of a technical person, but you're dreaming of And if you are not so much of a technical person, but you're dreaming of
a more ethical and emancipatory digital world, a more ethical and emancipatory digital world,
keep in tune with news comming from the Deuxfleurs association keep in tune with news coming from the Deuxfleurs association
as we will likely have other events very soon! as we will likely have other events very soon!

View file

@ -6,7 +6,7 @@ date=2022-02-01
*Deuxfleurs is a non-profit based in France that aims to defend and promote *Deuxfleurs is a non-profit based in France that aims to defend and promote
individual freedom and rights on the Internet. In their quest to build a individual freedom and rights on the Internet. In their quest to build a
decentralized, resilient self-hosting infrastructure, they have found that decentralized, resilient self-hosting infrastructure, they have found that
currently existing software is often ill suited to such a particular deployment currently, existing software is often ill-suited to such a particular deployment
scenario. In the context of data storage, Garage was built to provide a highly scenario. In the context of data storage, Garage was built to provide a highly
available data store that exploits redundancy over different geographical available data store that exploits redundancy over different geographical
locations, and does its best to not be too impacted by network latencies.* locations, and does its best to not be too impacted by network latencies.*
@ -23,8 +23,8 @@ Facebook or Amazon today hold disproportionate power and are becoming quite
dangerous to us, citizens of the Internet. They know everything we are doing, dangerous to us, citizens of the Internet. They know everything we are doing,
saying, and even thinking, and they are not making good use of that saying, and even thinking, and they are not making good use of that
information. The interests of these companies are those of the capitalist information. The interests of these companies are those of the capitalist
elite: they are mostly interested in making huge profits by exploiting the elite: they are most interested in making huge profits by exploiting the
Earth's precious resources, producing, advertising and selling us massive Earth's precious resources, producing, advertising, and selling us massive
amounts of stuff we don't need. They don't truly care about the needs of the amounts of stuff we don't need. They don't truly care about the needs of the
people, nor do they care that planetary destruction is under way because of people, nor do they care that planetary destruction is under way because of
them. them.
@ -56,17 +56,17 @@ As I said, self-hosting means running our own hardware at home, and providing
24/7 Internet services from there. We have many reasons for doing this. One is 24/7 Internet services from there. We have many reasons for doing this. One is
because this is the only way we can truly control who has access to our data. because this is the only way we can truly control who has access to our data.
Another one is that it helps us be aware of the physical substrate of which the Another one is that it helps us be aware of the physical substrate of which the
Internet is made: making the Internet run has an environmental cost which we Internet is made: making the Internet run has an environmental cost that we
want to evaluate and keep under control. The physical hardware also gives us a want to evaluate and keep under control. The physical hardware also gives us a
sense of community, calling to mind all of the people that could currently be sense of community, calling to mind all of the people that could currently be
connected and making use of our services, and reminding us of the purpose for connected and making use of our services, and reminding us of the purpose for
which we are doing this. which we are doing this.
If you have a home, you know that bad things can happen there too. The power If you have a home, you know that bad things can happen there too. The power
grid is not infallible, neither is your Internet connection. Fires and floods grid is not infallible, and neither is your Internet connection. Fires and floods
happen. And the computers we are running can themselves crash at any moment, happen. And the computers we are running can themselves crash at any moment,
for any number of reasons. Self-hosted solutions today are often not equipped for any number of reasons. Self-hosted solutions today are often not equipped
to face such challenges, and might suffer from unavailability or data loss to face such challenges and might suffer from unavailability or data loss
as a consequence. as a consequence.
If we want to grow our communities, and attract more people that might be If we want to grow our communities, and attract more people that might be
@ -78,7 +78,7 @@ data, the compromise is much harder to make and people will be tempted to go
back to a comfortable lifestyle bestowed by big tech companies. back to a comfortable lifestyle bestowed by big tech companies.
Fixing availability, making services reliable even when hosted at unreliable Fixing availability, making services reliable even when hosted at unreliable
locations or on unreliable hardware, is one of the main objectives of locations or on unreliable hardware is one of the main objectives of
Deuxfleurs, and in particular of the project Garage which we are building. Deuxfleurs, and in particular of the project Garage which we are building.
### Distributed systems to the rescue ### Distributed systems to the rescue
@ -123,9 +123,9 @@ landscape of distributed storage systems.
Garage implements the Amazon S3 protocol, a de-facto standard that makes it Garage implements the Amazon S3 protocol, a de-facto standard that makes it
compatible with a large variety of existing software. For instance it can be compatible with a large variety of existing software. For instance it can be
used as a storage back-end for many self-hosted web applications such as used as a storage backend for many self-hosted web applications such as
NextCloud, Matrix, Mastodon, Peertube, and many others, replacing the local NextCloud, Matrix, Mastodon, Peertube, and many others, replacing the local
file system of a server by a distributed storage layer. Garage can also be file system of a server with a distributed storage layer. Garage can also be
used to synchronize your files or store your backups with utilities such as used to synchronize your files or store your backups with utilities such as
Rclone or Restic. Last but not least, Garage can be used to host static Rclone or Restic. Last but not least, Garage can be used to host static
websites, such as the one you are currently reading, which is served directly websites, such as the one you are currently reading, which is served directly
@ -135,7 +135,7 @@ Garage leverages the theory of distributed systems, and in particular
*Conflict-free Replicated Data Types* (CRDTs in short), a set of mathematical *Conflict-free Replicated Data Types* (CRDTs in short), a set of mathematical
tools that help us write distributed software that runs faster, by avoiding tools that help us write distributed software that runs faster, by avoiding
some kinds of unnecessary chit-chat between servers. In a future blog post, some kinds of unnecessary chit-chat between servers. In a future blog post,
we will show how this allow us to significantly outperform Minio, our closest we will show how this allows us to significantly outperform Minio, our closest
competitor (another self-hostable implementation of the S3 protocol). competitor (another self-hostable implementation of the S3 protocol).
On the side of software engineering, we are committed to making Garage On the side of software engineering, we are committed to making Garage
@ -155,7 +155,7 @@ it is working exceptionally well for us. We are currently using it to store
backups of personal files, to store the media files that we send and receive backups of personal files, to store the media files that we send and receive
over the Matrix network, as well as to host a small but increasing number of over the Matrix network, as well as to host a small but increasing number of
static websites. Our current deployment hosts about 200 000 files spread in 50 static websites. Our current deployment hosts about 200 000 files spread in 50
buckets, for a total size of slightly above 500 GB. These number can seem small buckets, for a total size of slightly above 500 GB. These numbers can seem small
when compared to the datasets you could expect your typical cloud provider to when compared to the datasets you could expect your typical cloud provider to
be handling, however these sizes are fairly typical of the small-scale be handling, however these sizes are fairly typical of the small-scale
self-hosted deployments we are targeting, and our Garage cluster is in no way self-hosted deployments we are targeting, and our Garage cluster is in no way

View file

@ -18,7 +18,7 @@ We discuss the different bottlenecks and limitations of the software stack in it
It is an intended design decision: trusting each other enables Garage to spread data over the machines instead of duplicating it. It is an intended design decision: trusting each other enables Garage to spread data over the machines instead of duplicating it.
Still, you might want to share and collaborate with the rest of the world, and it can be done in 2 ways with Garage: through the integrated HTTP server that can serve your bucket as a static website, Still, you might want to share and collaborate with the rest of the world, and it can be done in 2 ways with Garage: through the integrated HTTP server that can serve your bucket as a static website,
or by connecting it to an application that will act as a "proxy" between Garage and the rest of the world. or by connecting it to an application that will act as a "proxy" between Garage and the rest of the world.
We refer as proxy software that know how to speak federated protocols (eg. Activity Pub, Solid, RemoteStorage, etc.) or distributed/p2p protocols (eg. BitTorrent, IPFS, etc.).--> We refer as proxy software that knows how to speak federated protocols (eg. Activity Pub, Solid, RemoteStorage, etc.) or distributed/p2p protocols (eg. BitTorrent, IPFS, etc.).-->
## Some context ## Some context
@ -26,20 +26,20 @@ People often struggle to see the difference between IPFS and Garage, so let's st
Personally, I see IPFS as the intersection between BitTorrent and a file system. BitTorrent remains to this day one of the most efficient ways to deliver Personally, I see IPFS as the intersection between BitTorrent and a file system. BitTorrent remains to this day one of the most efficient ways to deliver
a copy of a file or a folder to a very large number of destinations. It however lacks some form of interactivity: once a torrent file has been generated, you can't simply a copy of a file or a folder to a very large number of destinations. It however lacks some form of interactivity: once a torrent file has been generated, you can't simply
add or remove files from it. By presenting itself more like a file system, IPFS is able to handle this use case out-of-the-box. add or remove files from it. By presenting itself more like a file system, IPFS is able to handle this use case out of the box.
<!--IPFS is a content-addressable network built in a peer-to-peer fashion. <!--IPFS is a content-addressable network built in a peer-to-peer fashion.
With simple words, it means that you query the content you want with its identifier without having to know *where* it is hosted on the network, and especially on which machine. In simple words, it means that you query the content you want with its identifier without having to know *where* it is hosted on the network, and especially on which machine.
As a side effect, you can share content over the Internet without any configuration (no firewall, NAT, fixed IP, DNS, etc.).--> As a side effect, you can share content over the Internet without any configuration (no firewall, NAT, fixed IP, DNS, etc.).-->
<!--However, IPFS does not enforce any property on the durability and availablity of your data: the collaboration mentioned earlier is <!--However, IPFS does not enforce any property on the durability and availability of your data: the collaboration mentioned earlier is
done only on a spontaneous approach. So at first, if you want to be sure that your content remains alive, you must keep it on your node. done only on a spontaneous approach. So at first, if you want to be sure that your content remains alive, you must keep it on your node.
And if nobody makes a copy of your content, you will loose it as soon as your node goes offline and/or crashes. And if nobody makes a copy of your content, you will lose it as soon as your node goes offline and/or crashes.
Furthermore, if you need multiple nodes to store your content, IPFS is not able to automatically place content on your nodes, Furthermore, if you need multiple nodes to store your content, IPFS is not able to automatically place content on your nodes,
enforce a given replication amount, check the integrity of your content, and so on.--> enforce a given replication amount, check the integrity of your content, and so on.-->
However, you would probably not rely on BitTorrent to durably store the encrypted holiday pictures you shared with your friends, However, you would probably not rely on BitTorrent to durably store the encrypted holiday pictures you shared with your friends,
as content on the BitTorrent tends to vanish when no one in the network has a copy of it anymore. The same applies to IPFS. as content on BitTorrent tends to vanish when no one in the network has a copy of it anymore. The same applies to IPFS.
Even if at some time everyone has a copy of the pictures on their hard disk, people might delete these copies after a while without you knowing it. Even if at some time everyone has a copy of the pictures on their hard disk, people might delete these copies after a while without you knowing it.
You also can't easily collaborate on storing this common treasure. For example, there is no automatic way to say that Alice and Bob You also can't easily collaborate on storing this common treasure. For example, there is no automatic way to say that Alice and Bob
are in charge of storing the first half of the archive while Charlie and Eve are in charge of the second half. are in charge of storing the first half of the archive while Charlie and Eve are in charge of the second half.
@ -50,18 +50,18 @@ are in charge of storing the first half of the archive while Charlie and Eve are
[Resilio](https://www.resilio.com/individuals/) and [Syncthing](https://syncthing.net/) both feature protocols inspired by BitTorrent to synchronize a tree of your file system between multiple computers. [Resilio](https://www.resilio.com/individuals/) and [Syncthing](https://syncthing.net/) both feature protocols inspired by BitTorrent to synchronize a tree of your file system between multiple computers.
Reviewing these solutions is out of the scope of this article, feel free to try them by yourself!* Reviewing these solutions is out of the scope of this article, feel free to try them by yourself!*
Garage, on the contrary, is designed to automatically spread your content over all your available nodes, in a manner that makes the best possible use of your storage space. Garage, on the other hand, is designed to automatically spread your content over all your available nodes, in a manner that makes the best possible use of your storage space.
At the same time, it ensures that your content is always replicated exactly 3 times across the cluster (or less if you change a configuration parameter), At the same time, it ensures that your content is always replicated exactly 3 times across the cluster (or less if you change a configuration parameter),
on different geographical zones when possible. on different geographical zones when possible.
<!--To access this content, you must have an API key, and have a correctly configured machine available over the network (including DNS/IP address/etc.). If the amount of traffic you receive is way larger than what your cluster can handle, your cluster will become simply unresponsive. Sharing content across people that do not trust each other, ie. who operate independant clusters, is not a feature of Garage: you have to rely on external software.--> <!--To access this content, you must have an API key, and have a correctly configured machine available over the network (including DNS/IP address/etc.). If the amount of traffic you receive is way larger than what your cluster can handle, your cluster will become simply unresponsive. Sharing content across people that do not trust each other, ie. who operate independent clusters, is not a feature of Garage: you have to rely on external software.-->
However, this means that when content is requested from a Garage cluster, there are only 3 nodes that are capable of returning it to the user. However, this means that when content is requested from a Garage cluster, there are only 3 nodes capable of returning it to the user.
As a consequence, when content becomes popular, these nodes might become a bottleneck. As a consequence, when content becomes popular, this subset of nodes might become a bottleneck.
Moreover, all resources created (keys, files, buckets) are tightly coupled to the Garage cluster on which they exist; Moreover, all resources (keys, files, buckets) are tightly coupled to the Garage cluster on which they exist;
servers from different clusters can't collaborate to serve together the same data (without additional software). servers from different clusters can't collaborate to serve together the same data (without additional software).
➡️ **Garage is designed to durably store content.** ➡️ **Garage is designed to durably store content.**
In this blog post, we will explore whether we can combine both properties by connecting an IPFS node to a Garage cluster. In this blog post, we will explore whether we can combine efficient delivery and strong durability by connecting an IPFS node to a Garage cluster.
## Try #1: Vanilla IPFS over Garage ## Try #1: Vanilla IPFS over Garage
@ -73,7 +73,7 @@ The Peergos project has a fork because it seems that the plugin is known for hit
([#105](https://github.com/ipfs/go-ds-s3/issues/105), [#205](https://github.com/ipfs/go-ds-s3/pull/205)). ([#105](https://github.com/ipfs/go-ds-s3/issues/105), [#205](https://github.com/ipfs/go-ds-s3/pull/205)).
This is the one we will try in the following. This is the one we will try in the following.
The easiest solution to use this plugin in IPFS is to bundle it in the main IPFS daemon, and thus recompile IPFS from source. The easiest solution to use this plugin in IPFS is to bundle it in the main IPFS daemon, and recompile IPFS from sources.
Following the instructions on the README file allowed me to spawn an IPFS daemon configured with S3 as the block store. Following the instructions on the README file allowed me to spawn an IPFS daemon configured with S3 as the block store.
I had a small issue when adding the plugin to the `plugin/loader/preload_list` file: the given command lacks a newline. I had a small issue when adding the plugin to the `plugin/loader/preload_list` file: the given command lacks a newline.
@ -89,26 +89,25 @@ A content identifier (CID) was assigned to this picture:
QmNt7NSzyGkJ5K9QzyceDXd18PbLKrMAE93XuSC2487EFn QmNt7NSzyGkJ5K9QzyceDXd18PbLKrMAE93XuSC2487EFn
``` ```
The photo it now accessible on the whole network. The photo is now accessible on the whole network.
For example you can inspect it [from the official gateway](https://explore.ipld.io/#/explore/QmNt7NSzyGkJ5K9QzyceDXd18PbLKrMAE93XuSC2487EFn): For example, you can inspect it [from the official gateway](https://explore.ipld.io/#/explore/QmNt7NSzyGkJ5K9QzyceDXd18PbLKrMAE93XuSC2487EFn):
![A screenshot of the IPFS explorer](./explorer.png) ![A screenshot of the IPFS explorer](./explorer.png)
At the same time, I was monitoring Garage (through [the OpenTelemetry stack we have implemented earlier this year](/blog/2022-v0-7-released/)). At the same time, I was monitoring Garage (through [the OpenTelemetry stack we implemented earlier this year](/blog/2022-v0-7-released/)).
Just after launching the daemon and before doing anything, we had this surprisingly active Grafana plot: Just after launching the daemon - and before doing anything - I was met by this surprisingly active Grafana plot:
![Grafana API request rate when IPFS is idle](./idle.png) ![Grafana API request rate when IPFS is idle](./idle.png)
<center><i>Legend: y axis = requests per 10 seconds, x axis = time</i></center><p></p> <center><i>Legend: y axis = requests per 10 seconds, x axis = time</i></center><p></p>
It means that on average, we have around 250 requests per second. Most of these requests are checks that an IPFS block does not exist locally. It shows that on average, we handle around 250 requests per second. Most of these requests are in fact the IPFS daemon checking if a block exists in Gargage.
These requests are triggered by the DHT service of IPFS: since my node is reachable over the Internet, it acts as a public DHT server and has to answer global These requests are triggered by IPFS's DHT service: since my node is reachable over the Internet, it acts as a public DHT server and has to answer global
block requests over the whole network. Each time it receives a request for a block, it sends a request to its storage back-end (in our case, to Garage) to see if it exists. block requests over the whole network. Each time it receives a request for a block, it sends a request to its storage back-end (in our case, to Garage) to see if a copy exists locally.
*We will try to tweak the IPFS configuration later - we know that we can deactivate the DHT server. For now, we will continue with the default parameters.* *We will try to tweak the IPFS configuration later - we know that we can deactivate the DHT server. For now, we will continue with the default parameters.*
When I start interacting with IPFS by sending a file or browsing the default proposed catalogs (i.e. the full XKCD archive), When I started interacting with the IPFS node by sending a file or browsing the default proposed catalogs (i.e. the full XKCD archive),
I hit limits with our monitoring stack which, in its default configuration, is not able to ingest the traces of I quickly hit limits with our monitoring stack which, in its default configuration, is not able to ingest the large amount of tracing data produced by the high number of S3 requests originating from the IPFS daemon.
so many requests being processed by Garage.
We have the following error in Garage's logs: We have the following error in Garage's logs:
``` ```
@ -120,7 +119,7 @@ In my opinion, such a simple task of sharing a picture should not require so man
As a comparison, this whole webpage, with its pictures, triggers around 10 requests on Garage when loaded, not thousands. As a comparison, this whole webpage, with its pictures, triggers around 10 requests on Garage when loaded, not thousands.
I think we can conclude that this first try was a failure. I think we can conclude that this first try was a failure.
The S3 storage plugin for IPFS does too many request and would need some important work to be optimized. The S3 storage plugin for IPFS does too many requests and would need some important work to be optimized.
However, we are aware that the people behind Peergos are known to run their software based on IPFS in production with an S3 backend, However, we are aware that the people behind Peergos are known to run their software based on IPFS in production with an S3 backend,
so we should not give up too fast. so we should not give up too fast.
@ -131,15 +130,15 @@ Internally, it is built on IPFS and is known to have a [deep integration with th
One important point of this integration is that your browser is able to bypass both the Peergos daemon and the IPFS daemon One important point of this integration is that your browser is able to bypass both the Peergos daemon and the IPFS daemon
to write and read IPFS blocks directly from the S3 API server. to write and read IPFS blocks directly from the S3 API server.
*I don't know exactly if Peergos is still considered as alpha quality, or if a beta version was released, *I don't know exactly if Peergos is still considered alpha quality, or if a beta version was released,
but keep in mind that it might be more experimental that you'd like!* but keep in mind that it might be more experimental than you'd like!*
<!--To give ourselves some courage in this adventure, let's start with a nice screenshot of their web UI: <!--To give ourselves some courage in this adventure, let's start with a nice screenshot of their web UI:
![Peergos Web UI](./peergos.jpg)--> ![Peergos Web UI](./peergos.jpg)-->
Starting Peergos on top of Garage required some small patches on both sides, but in the end, I was able to get it working. Starting Peergos on top of Garage required some small patches on both sides, but in the end, I was able to get it working.
I was able to upload my file, see it in the interface, create a link to share it, rename it, move it in a folder, and so on: I was able to upload my file, see it in the interface, create a link to share it, rename it, move it to a folder, and so on:
![A screenshot of the Peergos interface](./upload.png) ![A screenshot of the Peergos interface](./upload.png)
@ -149,7 +148,7 @@ A quick look at Grafana showed again a very active Garage:
![Screenshot of a grafana plot showing requests per second over time](./grafa.png) ![Screenshot of a grafana plot showing requests per second over time](./grafa.png)
<center><i>Legend: y axis = requests per 10 seconds on log(10) scale, x axis = time</i></center><p></p> <center><i>Legend: y axis = requests per 10 seconds on log(10) scale, x axis = time</i></center><p></p>
Again, the workload is dominated by `HeadObject` requests. Again, the workload is dominated by S3 `HeadObject` requests.
After taking a look at `~/.peergos/.ipfs/config`, it seems that the IPFS configuration used by the Peergos project is quite standard, After taking a look at `~/.peergos/.ipfs/config`, it seems that the IPFS configuration used by the Peergos project is quite standard,
which means that, as before, we are acting as a DHT server and having to answer to thousands of block requests every second. which means that, as before, we are acting as a DHT server and having to answer to thousands of block requests every second.
@ -158,8 +157,8 @@ This traffic is all generated by Peergos.
The `OPTIONS` HTTP verb is here because we use the direct access feature of Peergos, The `OPTIONS` HTTP verb is here because we use the direct access feature of Peergos,
meaning that our browser is talking directly to Garage and has to use CORS to validate requests for security. meaning that our browser is talking directly to Garage and has to use CORS to validate requests for security.
Internally, IPFS splits files in blocks of less than 256 kB. My picture is thus split in 2 blocks, requiring 2 requests over Garage to fetch it. Internally, IPFS splits files into blocks of less than 256 kB. My picture is thus split into 2 blocks, requiring 2 requests over Garage to fetch it.
But even knowing that IPFS splits files in small blocks, I can't explain why we have so many `GetObject` requests. But even knowing that IPFS splits files into small blocks, I can't explain why we have so many `GetObject` requests.
## Try #3: Optimizing IPFS ## Try #3: Optimizing IPFS
@ -168,7 +167,7 @@ Routing = dhtclient
![](./grafa2.png) ![](./grafa2.png)
--> -->
We have seen in our 2 previous tries that the main source of load was the federation, and in particular the DHT server. We have seen in our 2 previous tries that the main source of load was the federation and in particular the DHT server.
In this section, we'd like to artificially remove this problem from the equation by preventing our IPFS node from federating In this section, we'd like to artificially remove this problem from the equation by preventing our IPFS node from federating
and see what pressure is put by Peergos alone on our local cluster. and see what pressure is put by Peergos alone on our local cluster.
@ -188,7 +187,7 @@ we might have a non-federated but quite efficient end-to-end encrypted "cloud st
with our clients directly hitting the S3 API! with our clients directly hitting the S3 API!
For setups where federation is a hard requirement, For setups where federation is a hard requirement,
the next step would be to gradually allow our node to connect to the IPFS network, the next step would be to gradually allow our node to connect to the IPFS network
while ensuring that the traffic to the Garage cluster remains low. while ensuring that the traffic to the Garage cluster remains low.
For example, configuring our IPFS node as a `dhtclient` instead of a `dhtserver` would exempt it from answering public DHT requests. For example, configuring our IPFS node as a `dhtclient` instead of a `dhtserver` would exempt it from answering public DHT requests.
Keeping an in-memory index (as a hash map and/or a Bloom filter) of the blocks stored on the current node Keeping an in-memory index (as a hash map and/or a Bloom filter) of the blocks stored on the current node
@ -198,20 +197,20 @@ server on the regular file system, and reserve a second process configured with
However, even with these optimizations, the best we can expect is the traffic we have on the previous plot. However, even with these optimizations, the best we can expect is the traffic we have on the previous plot.
From a theoretical perspective, it is still higher than the optimal number of requests. From a theoretical perspective, it is still higher than the optimal number of requests.
On S3, storing a file, downloading a file and listing available files are all actions that can be done in a single request. On S3, storing a file, downloading a file, and listing available files are all actions that can be done in a single request.
Even if all requests don't have the same cost on the cluster, processing a request has a non-negligible fixed cost. Even if all requests don't have the same cost on the cluster, processing a request has a non-negligible fixed cost.
## Are S3 and IPFS incompatible? ## Are S3 and IPFS incompatible?
Tweaking IPFS in order to try and make it work on an S3 backend is all and good, Tweaking IPFS in order to try and make it work on an S3 backend is all and good,
but in some sense, the assumptions made by IPFS are funamentally incompatible with using S3 as a block storage. but in some sense, the assumptions made by IPFS are fundamentally incompatible with using S3 as block storage.
First, data on IPFS is split in relatively small chunks: all IPFS blocks must be less than 1 MB, with most being 256 KB or less. First, data on IPFS is split in relatively small chunks: all IPFS blocks must be less than 1 MB, with most being 256 KB or less.
This means that large files or complex directory hierarchies will need thousands of blocks to be stored, This means that large files or complex directory hierarchies will need thousands of blocks to be stored,
each of which is mapped to a single object in the S3 storage back-end. each of which is mapped to a single object in the S3 storage back-end.
On the other side, S3 implementations such as Garage are made to handle very large objects efficiently, On the other side, S3 implementations such as Garage are made to handle very large objects efficiently,
and they also provide their own primitives for rapidly listing all the objects present in a bucket or a directory. and they also provide their own primitives for rapidly listing all the objects present in a bucket or a directory.
There is thus a huge loss in performance when data is stored in IPFS's block format, because this format does not There is thus a huge loss in performance when data is stored in IPFS's block format because this format does not
take advantage of the optimizations provided by S3 back-ends in their standard usage scenarios. Instead, it take advantage of the optimizations provided by S3 back-ends in their standard usage scenarios. Instead, it
requires storing and retrieving thousands of small S3 objects even for very simple operations such requires storing and retrieving thousands of small S3 objects even for very simple operations such
as retrieving a file or listing a directory, incurring a fixed overhead each time. as retrieving a file or listing a directory, incurring a fixed overhead each time.
@ -223,38 +222,38 @@ When a node is missing a file or a directory it wants to read, it has to do as m
as there are IPFS blocks in the object to be read. as there are IPFS blocks in the object to be read.
On the receiving end, this means that any fully-fledged IPFS node has to answer large numbers On the receiving end, this means that any fully-fledged IPFS node has to answer large numbers
of requests for blocks required by users everywhere on the network, which is what we observed in our experiment above. of requests for blocks required by users everywhere on the network, which is what we observed in our experiment above.
We were however surprised to observe that many requests comming from the IPFS network were for blocks We were however surprised to observe that many requests coming from the IPFS network were for blocks
which our node wasn't locally storing a copy of: this means that somewhere in the IPFS protocol, an overly optimistic which our node didn't have a copy of: this means that somewhere in the IPFS protocol, an overly optimistic
assumption is made on where data could be found in the network, and this ends up translating in many requests assumption is made on where data could be found in the network, and this ends up translating into many requests
between nodes that return negative results. between nodes that return negative results.
When IPFS blocks are stored on a local filesystem, answering these requests fast might be possible. When IPFS blocks are stored on a local filesystem, answering these requests fast might be possible.
However when using an S3 server as a storage back-end, this becomes prohibitively costly. However, when using an S3 server as a storage back-end, this becomes prohibitively costly.
If one wanted to design a distributed storage system for IPFS data blocks, they would probably need to start at a lower level. If one wanted to design a distributed storage system for IPFS data blocks, they would probably need to start at a lower level.
Garage itself makes use of a block storage mechanism that allows small-sized blocks to be stored on a cluster and accessed Garage itself makes use of a block storage mechanism that allows small-sized blocks to be stored on a cluster and accessed
rapidly by nodes that need to access them. rapidly by nodes that need to access them.
However passing through the entire abstraction that provides an S3 API is wastefull and redundant, as this API is However passing through the entire abstraction that provides an S3 API is wasteful and redundant, as this API is
designed to provide advanced functionnality such as mutating objects, associating metadata with objects, listing objects, etc. designed to provide advanced functionality such as mutating objects, associating metadata with objects, listing objects, etc.
Plugging the IPFS daemon directly into a lower-level distributed block storage like Plugging the IPFS daemon directly into a lower-level distributed block storage like
Garage's might yield way better results by bypassing all of this complexity. Garage's might yield way better results by bypassing all of this complexity.
## Conclusion ## Conclusion
Running IPFS over an S3 storage back-end does not quite work out of the box in term of performances. Running IPFS over an S3 storage backend does not quite work out of the box in terms of performance.
We have identified that the main problem is linked with the DHT service, Having identified that the main problem is linked to the DHT service,
and proposed some improvements (disabling the DHT server, keeping an in-memory index of the blocks, using the S3 back-end only for user data). we proposed some improvements (disabling the DHT server, keeping an in-memory index of the blocks, and using the S3 back-end only for user data).
From an IPFS design perspective, it seems however that the numerous small blocks handled by the protocol From an IPFS design perspective, it seems however that the numerous small blocks handled by the protocol
do not map trivially to efficient use of the S3 API, and thus could be a limiting factor to any optimization work. do not map trivially to efficient use of the S3 API, and thus could be a limiting factor to any optimization work.
As part of my testing journey, I also stumbled upon some posts about performance issues on IPFS (eg. [#6283](https://github.com/ipfs/go-ipfs/issues/6283)) As part of my testing journey, I also stumbled upon some posts about performance issues on IPFS (eg. [#6283](https://github.com/ipfs/go-ipfs/issues/6283))
that are not linked with the S3 connector. I might be negatively influenced by my failure to connect IPFS with S3, that are not linked with the S3 connector. I might be negatively influenced by my failure to connect IPFS with S3,
but at this point I'm tempted to think that IPFS is intrinsically resource-intensive. but at this point, I'm tempted to think that IPFS is intrinsically resource-intensive from a block activity perspective.
On our side at Deuxfleurs, we will continue our investigations towards more *minimalistic* software. On our side at Deuxfleurs, we will continue our investigations towards more *minimalistic* software.
This choice makes sense for us as we want to reduce the ecological impact of our services This choice makes sense for us as we want to reduce the ecological impact of our services
by deploying less servers, that use less energy, and that are renewed less frequently. by deploying fewer servers, that use less energy, and are renewed less frequently.
After discussing with Peergos maintainers, we identified that it is possible to run Peergos without IPFS. After discussing with Peergos maintainers, we identified that it is possible to run Peergos without IPFS.
With some optimizations on the block size, we envision great synergies between Garage and Peergos that could lead to With some optimizations on the block size, we envision great synergies between Garage and Peergos that could lead to

View file

@ -22,7 +22,7 @@ Feel free to [reach out to us](mailto:garagehq@deuxfleurs.fr) if you are packagi
Speaking about the changes of this new version, it obviously includes many bug fixes. Speaking about the changes of this new version, it obviously includes many bug fixes.
We listed them in our [changelogs](https://git.deuxfleurs.fr/Deuxfleurs/garage/releases), so take a look, we might have fixed some issues you were having! We listed them in our [changelogs](https://git.deuxfleurs.fr/Deuxfleurs/garage/releases), so take a look, we might have fixed some issues you were having!
Besides bugfixes, there are two new major features in this release: better integration with Kubernetes, and support for observability via OpenTelemetry. Besides bug fixes, there are two new major features in this release: better integration with Kubernetes, and support for observability via OpenTelemetry.
## Kubernetes integration ## Kubernetes integration
@ -46,7 +46,7 @@ kubernetes_skip_crd = true
If you want to try Garage on K8S, we currently only provide some basic [example files](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/7e1ac51b580afa8e900206e7cc49791ed0a00d94/script/k8s). These files register a [ConfigMap](https://kubernetes.io/docs/concepts/configuration/configmap/), a [ClusterRoleBinding](https://kubernetes.io/docs/reference/access-authn-authz/rbac/#rolebinding-and-clusterrolebinding), and a [StatefulSet](https://kubernetes.io/fr/docs/concepts/workloads/controllers/statefulset/) with a [Persistent Volumes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/). If you want to try Garage on K8S, we currently only provide some basic [example files](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/7e1ac51b580afa8e900206e7cc49791ed0a00d94/script/k8s). These files register a [ConfigMap](https://kubernetes.io/docs/concepts/configuration/configmap/), a [ClusterRoleBinding](https://kubernetes.io/docs/reference/access-authn-authz/rbac/#rolebinding-and-clusterrolebinding), and a [StatefulSet](https://kubernetes.io/fr/docs/concepts/workloads/controllers/statefulset/) with a [Persistent Volumes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/).
Once these files deployed, you will be able to interact with Garage as follow: Once these files are deployed, you will be able to interact with Garage as follow:
```bash ```bash
kubectl exec -it garage-0 --container garage -- /garage status kubectl exec -it garage-0 --container garage -- /garage status
@ -59,14 +59,14 @@ kubectl exec -it garage-0 --container garage -- /garage status
You can then follow the [regular documentation](https://garagehq.deuxfleurs.fr/documentation/cookbook/real-world/#creating-a-cluster-layout) to complete the configuration of your cluster. You can then follow the [regular documentation](https://garagehq.deuxfleurs.fr/documentation/cookbook/real-world/#creating-a-cluster-layout) to complete the configuration of your cluster.
If you target a production deployment, you should avoid binding admin rights to your cluster to create Garage's CRD. You will also need to expose some [Services](https://kubernetes.io/docs/concepts/services-networking/service/) to make your cluster reachable. Keep also in mind that Garage is a stateful service, so you must be very careful of how you handle your data in Kubernetes in order not to lose it. In the near future, we plan to release a proper Helm chart and write "best practises" in our documentation. If you target a production deployment, you should avoid binding admin rights to your cluster to create Garage's CRD. You will also need to expose some [Services](https://kubernetes.io/docs/concepts/services-networking/service/) to make your cluster reachable. Keep also in mind that Garage is a stateful service, so you must be very careful of how you handle your data in Kubernetes in order not to lose it. In the near future, we plan to release a proper Helm chart and write "best practices" in our documentation.
If Kubernetes is not your thing, know that we are running Garage on a Nomad+Consul cluster, which is also well supported. If Kubernetes is not your thing, know that we are running Garage on a Nomad+Consul cluster, which is also well supported.
We have not documented it yet but you can get a look at [our Nomad service](https://git.deuxfleurs.fr/Deuxfleurs/infrastructure/src/commit/1e5e4af35c073d04698bb10dd4ad1330d6c62a0d/app/garage/deploy/garage.hcl). We have not documented it yet but you can get a look at [our Nomad service](https://git.deuxfleurs.fr/Deuxfleurs/infrastructure/src/commit/1e5e4af35c073d04698bb10dd4ad1330d6c62a0d/app/garage/deploy/garage.hcl).
## OpenTelemetry support ## OpenTelemetry support
[OpenTelemetry](https://opentelemetry.io/) standardizes how software generates and collects system telemetry information, namely metrics, logs and traces. [OpenTelemetry](https://opentelemetry.io/) standardizes how software generates and collects system telemetry information, namely metrics, logs, and traces.
By implementing this standard in Garage, we hope that it will help you to better monitor, manage and tune your cluster. By implementing this standard in Garage, we hope that it will help you to better monitor, manage and tune your cluster.
Note that to fully leverage this feature, you must be already familiar with monitoring stacks like [Prometheus](https://prometheus.io/)+[Grafana](https://grafana.com/) or [ElasticSearch](https://www.elastic.co/elasticsearch/)+[Kibana](https://www.elastic.co/kibana/). Note that to fully leverage this feature, you must be already familiar with monitoring stacks like [Prometheus](https://prometheus.io/)+[Grafana](https://grafana.com/) or [ElasticSearch](https://www.elastic.co/elasticsearch/)+[Kibana](https://www.elastic.co/kibana/).
@ -87,7 +87,7 @@ It includes a docker-compose file and a pre-configured Grafana dashboard.
You can use them if you want to reproduce the following examples. You can use them if you want to reproduce the following examples.
Grafana is particularly adapted to understand how your cluster is performing from a "bird's eye view". Grafana is particularly adapted to understand how your cluster is performing from a "bird's eye view".
For example, the following graph shows S3 API calls sent to your node per time-unit. For example, the following graph shows S3 API calls sent to your node per time unit.
You can use it to better understand how your users are interacting with your cluster. You can use it to better understand how your users are interacting with your cluster.
![A screenshot of a plot made by Grafana depicting the number of requests per time units grouped by endpoints](api_rate.png) ![A screenshot of a plot made by Grafana depicting the number of requests per time units grouped by endpoints](api_rate.png)
@ -95,21 +95,21 @@ You can use it to better understand how your users are interacting with your clu
Thanks to this graph, we know that starting at 14:55, an important upload has been started. Thanks to this graph, we know that starting at 14:55, an important upload has been started.
This upload is made of many small files, as we see many PutObject calls that are often used for small files. This upload is made of many small files, as we see many PutObject calls that are often used for small files.
It also has some large objects, as we observe some multipart uploads requests. It also has some large objects, as we observe some multipart uploads requests.
Conversely, at this time, no read are done as the corresponding read enpoints (ListBuckets, ListObjectsV2, etc.) receive 0 request per time unit. Conversely, at this time, no reads are done as the corresponding read endpoints (ListBuckets, ListObjectsV2, etc.) receive 0 request per time unit.
Garage also collects metrics from lower level parts of the system. Garage also collects metrics from lower-level parts of the system.
You can use them to better understand how Garage is interacting with your OS and your hardware. You can use them to better understand how Garage is interacting with your OS and your hardware.
![A screenshot of a plot made by Grafana depicting the write speed (in MB/s) during time.](writes.png) ![A screenshot of a plot made by Grafana depicting the write speed (in MB/s) during test time.](writes.png)
This plot has been captured at the same moment than the previous one. This plot has been captured at the same moment as the previous one.
We do not see a correlation between the writes and the API requests for the full upload but only for its beginning. We do not see a correlation between the writes and the API requests for the full upload but only for its beginning.
More precisely, it maps well to multipart upload requests, and this is expected. More precisely, it maps well to multipart upload requests, and this is expected.
Large files (of the multipart uploads) will saturate the writes of your disk but the uploading of small files (via the PutObject endpoint) will be limited by other parts of the system. Large files (of the multipart uploads) will saturate the writes of your disk but the uploading of small files (via the PutObject endpoint) will be limited by other parts of the system.
This simple example covers only 2 metrics over the 20+ ones that we already defined, but it still allowed us to precisely describe our cluster usage and identify where bottlenecks could be. This simple example covers only 2 metrics over the 20+ ones that we already defined, but it still allowed us to precisely describe our cluster usage and identify where bottlenecks could be.
We are confident that cleverly using these metrics on a production cluster will give you many more valuable insights on your cluster. We are confident that cleverly using these metrics on a production cluster will give you many more valuable insights into your cluster.
While metrics are good for having a large, general overview of your system, they are however not adapted for digging and pinpointing a specific performance issue on a specific code path. While metrics are good for having a large, general overview of your system, they are however not adapted for digging and pinpointing a specific performance issue on a specific code path.
Thankfully, we also have a solution for this problem: tracing. Thankfully, we also have a solution for this problem: tracing.
@ -128,9 +128,9 @@ Consequently, this request probably corresponds to a very small file.
Below this first histogram, you can select the request you want to inspect, and then see its trace on the bottom part. Below this first histogram, you can select the request you want to inspect, and then see its trace on the bottom part.
The trace shown above can be broken down in 4 parts: fetching the API key to check authentication (`key get`), fetching the bucket identifier from its name (`bucket_alias get`), fetching the bucket configuration to check authorizations (`bucket_v2 get`), and finally inserting the object in the storage (`object insert`). The trace shown above can be broken down in 4 parts: fetching the API key to check authentication (`key get`), fetching the bucket identifier from its name (`bucket_alias get`), fetching the bucket configuration to check authorizations (`bucket_v2 get`), and finally inserting the object in the storage (`object insert`).
With this example, we demonstrated that we can inspect Garage internals to find slow requests, then see which codepath has been taken by a request, and finally to identify which part of the code took time. With this example, we demonstrated that we can inspect Garage internals to find slow requests, then see which codepath has been taken by a request, and finally identify which part of the code took time.
Keep in mind that this is our first iteration on telemetry for Garage, so things are a bit rough around the edges (step by step documentation is missing, our Grafana dashboard is a work in a progress, etc.). Keep in mind that this is our first iteration on telemetry for Garage, so things are a bit rough around the edges (step-by-step documentation is missing, our Grafana dashboard is a work in progress, etc.).
In all cases, your feedback is welcome on our Matrix channel. In all cases, your feedback is welcome on our Matrix channel.
@ -140,4 +140,4 @@ This is only the first iteration of the Kubernetes and OpenTelemetry integration
We plan to polish their integration in the coming months based on our experience and your feedback. We plan to polish their integration in the coming months based on our experience and your feedback.
You may also ask yourself what will be the other works we plan to conduct: stay tuned, we will soon release information on our roadmap! You may also ask yourself what will be the other works we plan to conduct: stay tuned, we will soon release information on our roadmap!
In the mean time, we hope you will enjoy using Garage v0.7. In the meantime, we hope you will enjoy using Garage v0.7.