New article: Bringing theoretical design and observed performances face to face #12

Merged
quentin merged 24 commits from perf into master 2022-09-29 11:16:04 +00:00
Showing only changes of commit 026402cae3 - Show all commits

View file

@ -79,7 +79,7 @@ LMDB looks very promising both in term of performances and resource usage,
it is a very good candidate for Garage's default metadata engine in the future, it is a very good candidate for Garage's default metadata engine in the future,
and we need to define a data policy for Garage that would help us arbitrate between performances and durability. and we need to define a data policy for Garage that would help us arbitrate between performances and durability.
*To fsync or not to fsync? Performance is nothing without reliability, so we need to better assess the impact of validating a write and then losing it. Because Garage is a distributed system, even if a node loses its write due to a power loss, it will fetch it back from the 2 other nodes storing it. But rare situations where 1 node is down and the 2 others validated the write and then lost power can occure, what is our policy in this case? For storage durability, we are already supposing that we never loose the storage of more than 2 nodes, should we also expect that we don't loose more than 2 nodes at the same time? What to think about people hosting all their nodes at the same place without an UPS?* *To fsync or not to fsync? Performance is nothing without reliability, so we need to better assess the impact of validating a write and then losing it. Because Garage is a distributed system, even if a node loses its write due to a power loss, it will fetch it back from the 2 other nodes storing it. But rare situations where 1 node is down and the 2 others validated the write and then lost power can occure, what is our policy in this case? For storage durability, we are already supposing that we never loose the storage of more than 2 nodes, should we also expect that we don't loose power on more than 2 nodes at the same time? What should we think about people hosting all their nodes at the same place without an UPS? Historically, it seems that Minio developers also accepted some compromises on this side ([#3536](https://github.com/minio/minio/issues/3536), [HN Discussion](https://news.ycombinator.com/item?id=28135533)). Now, they seem to use a combination of `O_DSYNC` and `fdatasync(3p)` - a derivative that check only data and not metadata are persisted on disk - in combination with `O_DIRECT` for direct I/O ([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274), [example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).*
**Storing million of objects** - Object storage systems are designed not only for data durability and availability, but also for scalability. **Storing million of objects** - Object storage systems are designed not only for data durability and availability, but also for scalability.
Following this observation, some people asked us how scalable Garage is. If answering this question is out of scope of this study, we wanted to Following this observation, some people asked us how scalable Garage is. If answering this question is out of scope of this study, we wanted to
@ -154,14 +154,17 @@ For now, we are confident that a Garage cluster with 100+ nodes should definitel
## Conclusion and Future work ## Conclusion and Future work
Identified some sensitive points: fsync, metadata engine, raw i/o. During this work, we identified some sensitive points on Garage we will continue working on: our data durability target and interaction with the filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across our components, our new metadata engines (lmdb, sqlite) still need some testing and tuning, and we know that raw I/O (GetObject, PutObject) have a small improvement margin.
At the same time, validated important performance improvements (ttfb, minio warp, metadata engine - including less resource usage) while keeping our versatility (network/nodes).
- srpt At the same time, Garage has never been better: its next version (v0.8) will see drastic improvements on term of performances and reliability.
- better analysis of the fsync / data reliability impact We are confident that it is already be able to cover a wide range of deployment needs, up to hundredth of nodes, millions of objects, and so on.
- analysis and comparison of Garage at scale
- try to better understand ecosystem (riak cs, minio, ceph, swift) -> some knowledge to get
In the future, on the performance aspect, we would like to evaluate the impact of introducing an SRPT scheduler ([#361](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/361)),
define a data durability policy and implement it, make a deeper and larger review of the state of the art (minio, ceph, swift, openio, riak cs, seaweedfs, etc.) to learn from them,
and finally, benchmark Garage at scale with possibly multiple terabytes of data on a long lasting experiments.
In the mean time, stay tuned: we have released [a first release candidate for Garage v0.8](https://git.deuxfleurs.fr/Deuxfleurs/garage/releases/tag/v0.8.0-rc1), we are working
on proving and explaining our layout algorithm ([#296](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/296)), we are working on a Python SDK for Garage's administration API ([#379](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/379)), and we will soon introduce officially and explain why we created and published as a technical preview a new API named K2V ([see K2V on our doc](https://garagehq.deuxfleurs.fr/documentation/reference-manual/k2v/)).
## Notes ## Notes