Fixes until "myriads of objects"

This commit is contained in:
Alex 2022-09-28 15:45:32 +02:00
parent 6133fcd3ca
commit bacebcfbf1
Signed by: lx
GPG key ID: 0E496D15096376BE

View file

@ -97,7 +97,7 @@ and the moment where the first bytes of the returned object are received by the
Second, we will evaluate generic throughput, to understand how well
Garage can leverage the underlying machine's performances.
**Time To First Byte** - One specificity of Garage is that we implemented S3
**Time-to-First-Byte** - One specificity of Garage is that we implemented S3
web endpoints, with the idea to make it a platform of choice to publish
static websites. When publishing a website, TTFB can be directly observed
by the end user, as it will impact the perceived reactivity of the websites.
@ -136,46 +136,47 @@ whose results are shown on the following figure:
![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png)
Garage v0.7, which does not support block streaming, features TTFB between 1.6s
and 2s, which corresponds to the theoretical time to transfer the full block.
On the other side of the plot, Garage v0.8 has a very low TTFB thanks to the
streaming feature (the lowest value is 43 ms). Minio sits between the two
Garage v0.7, which does not support block streaming, gives us a TTFB between 1.6s
and 2s, which corresponds to the time to transfer the full block which we calculated before.
On the other side of the plot, we can see Garage v0.8 with a very low TTFB thanks to the
streaming feature (the lowest value is 43ms). Minio sits between the two
Garage versions: we suppose that it does some form of batching, but smaller
than 1MB.
**Throughput** - As soon as we publicly released Garage, people started
benchmarking it, comparing its performances to writing directly on the
benchmarking it, comparing its performances with writing directly on the
filesystem, and observed that Garage was slower (eg.
[#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the
situation, we put costly processing like hashing on a dedicated thread and did
many compute optimization
situation, we did some optimizations, such as putting costly processing like hashing on a dedicated thread
and many others
([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342),
[#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to
`v0.8 beta 1`. We also noted logic we wrote (to better control resource usage
and detect errors, like semaphores or timeouts) was artificially limiting
version 0.8 "Beta 1". We also noticed that some of the logic logic we wrote
to better control resource usage
and detect errors, like semaphores or timeouts, was artificially limiting
performances. In another iteration, we made this logic less restrictive at the
cost of higher resource consumption under load
([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in
`v0.8 beta 2`. Finally, we currently do multiple `fsync` calls each time we
version 0.8 "Beta 2". Finally, we currently do multiple `fsync` calls each time we
write a block. We know that this is expensive and did a test build without any
`fsync` call ([see the
commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7))
that will not be merged, just to assess the impact of `fsync`. We refer to it
as `no-fsync` in the following plot.
*A note about fsync: for performance reasons, operating systems often do not
*A note about `fsync`: for performance reasons, operating systems often do not
write directly to the disk when a process creates or updates a file in your
filesystem, instead, the write is kept in memory, and flushed later in a batch
with other writes. If a power loss occurs before the OS has time to flush the
writes on the disk, data will be lost. To ensure that a write is effectively
written on disk, you must use the
[fsync(2)](https://man7.org/linux/man-pages/man2/fsync.2.html) system call: it
will block until your file or directory has been written from your volatile
memory to your persisting storage device. Additionally, the exact semantic of
fsync [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/)
and, even on battle-tested software like Postgres,
[they "did it wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/).
Note that on Garage, we are currently working on our "fsync" policy and thus, for
filesystem. Instead, the write is kept in memory, and flushed later in a batch
with other writes. If a power loss occurs before the OS has time to flush
data to disk, some writes will be lost. To ensure that a write is effectively
written on disk, the
[`fsync(2)`](https://man7.org/linux/man-pages/man2/fsync.2.html) system call must be used,
which blocks until the file or directory on which it is called has been written from volatile
memory to the persistent storage device. Additionally, the exact semantic of
`fsync` [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/)
and, even on battle-tested software like Postgres, it was
["done wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/).
Note that on Garage, we are currently working on our `fsync` policy and thus, for
now, you should expect limited data durability in case of power loss, as we are
aware of some inconsistency on this point (which we describe in the following
and plan to solve).*
@ -188,16 +189,16 @@ performance with a standardized and mixed workload.
![Plot showing IO perf of Garage configs and Minio](io.png)
Minio, our ground truth, features the best performances in this test.
Considering Garage, we observe that each improvement we made has a visible
impact on its performances. We also note that we have a progress margin in
Minio, our reference point, gives us the best performances in this test.
Looking at Garage, we observe that each improvement we made has a visible
impact on performances. We also note that we have a progress margin in
terms of performances compared to Minio: additional benchmarks, tests, and
monitoring could help better understand the remaining difference.
## A myriad of objects
Object storage systems do not handle a single object but a myriad of them:
Object storage systems do not handle a single object but huge numbers of them:
Amazon claims to handle trillions of objects on their platform, and Red Hat
communicates about Ceph being able to handle 10 billion objects. All these
objects must be tracked efficiently in the system to be fetched, listed,