
713 lines
19 KiB
Raw Permalink Normal View History

2024-01-22 18:52:14 +01:00
% \usepackage[frenchb]{babel}
2024-01-22 18:52:14 +01:00
\setbeamertemplate{itemize item}{\color{ListOrange}$\blacktriangleright$}
\setbeamercolor{normal text}{fg=verygrey}
\title{Garage, the low-tech storage platform for geo-distributed clusters}
\author{Alex Auvolat, Deuxfleurs}
\date{FOSDEM'24, 2024-02-03}
{\large\bf Alex Auvolat, Deuxfleurs Association}
Matrix channel: \texttt{\}
\frametitle{Who I am}
\adjincludegraphics[width=.4\linewidth, valign=t]{../assets/alex.jpg}
\textbf{Alex Auvolat}\\
PhD; co-founder of Deuxfleurs
\adjincludegraphics[width=.5\linewidth, valign=t]{../assets/logos/deuxfleurs.pdf}
2024-01-22 18:52:14 +01:00
A non-profit self-hosting collective,\\
member of the CHATONS network
\adjincludegraphics[width=.7\linewidth, valign=t]{../assets/logos/logo_chatons.png}
2024-01-22 18:52:14 +01:00
\frametitle{Our objective at Deuxfleurs}
\textbf{Promote self-hosting and small-scale hosting\\
as an alternative to large cloud providers}
Why is it hard?
{\footnotesize we want good uptime/availability with low supervision}
2024-01-22 18:52:14 +01:00
\frametitle{Building a resilient system with cheap stuff}
2024-01-22 18:52:14 +01:00
2024-01-22 18:52:14 +01:00
\item \textcolor<5->{gray}{Commodity hardware (e.g. old desktop PCs)\\
2024-01-22 18:52:14 +01:00
\visible<4->{{\footnotesize (can die at any time)}}}
\item<5-> \textcolor<7->{gray}{Regular Internet (e.g. FTTB, FTTH) and power grid connections\\
2024-01-22 18:52:14 +01:00
\visible<6->{{\footnotesize (can be unavailable randomly)}}}
\item<7-> \textbf{Geographical redundancy} (multi-site replication)
2024-01-22 18:52:14 +01:00
2024-01-22 18:52:14 +01:00
\frametitle{Object storage: a crucial component}
2024-01-22 18:52:14 +01:00
2024-01-22 18:52:14 +01:00
2024-01-22 18:52:14 +01:00
S3: a de-facto standard, many compatible applications
\visible<2->{MinIO is self-hostable but not suited for geo-distributed deployments}
2024-01-22 18:52:14 +01:00
\visible<3->{\textbf{Garage is a self-hosted drop-in replacement for the Amazon S3 object store}}
2024-01-22 18:52:14 +01:00
\frametitle{CRDTs / weak consistency instead of consensus}
Consensus can be implemented reasonably well in practice, so why avoid it?
\item<2-> \textbf{Software complexity}
2024-01-22 18:52:14 +01:00
\item<3-> \textbf{Performance issues:}
2024-01-22 18:52:14 +01:00
\item<4-> The leader is a \textbf{bottleneck} for all requests\\
2024-01-22 18:52:14 +01:00
\item<5-> \textbf{Sensitive to higher latency} between nodes
2024-01-22 18:52:14 +01:00
\item<6-> \textbf{Takes time to reconverge} when disrupted (e.g. node going down)
2024-01-22 18:52:14 +01:00
\visible<7->{\underline{Internally, Garage uses only CRDTs} (conflict-free replicated data types)}
2024-01-22 18:52:14 +01:00
\frametitle{The data model of object storage}
Object storage is basically a \textbf{key-value store}:
2024-01-22 18:52:14 +01:00
2024-01-22 18:52:14 +01:00
2024-01-22 18:52:14 +01:00
\textbf{Key: file path + name} & \textbf{Value: file data + metadata} \\
\texttt{index.html} &
\texttt{Content-Type: text/html; charset=utf-8} \newline
\texttt{Content-Length: 24929} \newline
\texttt{<binary blob>} \\
\texttt{img/logo.svg} &
\texttt{Content-Type: text/svg+xml} \newline
\texttt{Content-Length: 13429} \newline
\texttt{<binary blob>} \\
\texttt{download/index.html} &
\texttt{Content-Type: text/html; charset=utf-8} \newline
\texttt{Content-Length: 26563} \newline
\texttt{<binary blob>} \\
\item<2> Maps well to CRDT data types
2024-01-22 18:52:14 +01:00
\frametitle{Performance gains in practice}
2024-01-22 18:52:14 +01:00
% ======================================== TIMELINE
% ======================================== TIMELINE
% ======================================== TIMELINE
\section{Recent developments}
% ====================== v0.7.0 ===============================
2024-01-22 18:52:14 +01:00
\frametitle{April 2022 - Garage v0.7.0}
Focus on \underline{observability and ecosystem integration}
\item \textbf{Monitoring:} metrics and traces, using OpenTelemetry
\item Replication modes with 1 or 2 copies / weaker consistency
2024-01-22 18:52:14 +01:00
\item Kubernetes integration
\item Admin API (v0.7.2)
\item Experimental K2V API (v0.7.2)
\frametitle{Metrics (Prometheus + Grafana)}
2024-01-22 18:52:14 +01:00
\frametitle{Traces (Jaeger)}
2024-01-22 18:52:14 +01:00
% ====================== v0.8.0 ===============================
2024-01-22 18:52:14 +01:00
\frametitle{November 2022 - Garage v0.8.0}
Focus on \underline{performance}
\item \textbf{Alternative metadata DB engines} (LMDB, Sqlite)
\item \textbf{Performance improvements:} block streaming, various optimizations...
\item Bucket quotas (max size, max \#objects)
\item Quality of life improvements, observability, etc.
\frametitle{About metadata DB engines}
\textbf{Issues with Sled:}
2024-01-22 18:52:14 +01:00
\item Huge files on disk
\item Unpredictable performance, especially on HDD
\item API limitations
\item Not actively maintained
2024-01-22 18:52:14 +01:00
\textbf{LMDB:} very stable, good performance, reasonably small files on disk
2024-01-22 18:52:14 +01:00
Sled will be removed in Garage v1.0
\frametitle{DB engine performance comparison}
2024-01-22 18:52:14 +01:00
NB: Sqlite was slow due to synchronous mode, now configurable
2024-01-22 18:52:14 +01:00
\frametitle{Block streaming}
\frametitle{TTFB benchmark}
2024-01-22 18:52:14 +01:00
\frametitle{Throughput benchmark}
2024-01-22 18:52:14 +01:00
% ====================== v0.9.0 ===============================
2024-01-22 18:52:14 +01:00
\frametitle{October 2023 - Garage v0.9.0}
Focus on \underline{streamlining \& usability}
\item Support multiple HDDs per node
\item S3 compatibility:
\item support basic lifecycle configurations
\item allow for multipart upload part retries
\item LMDB by default, deprecation of Sled
\item New layout computation algorithm
\frametitle{Layout computation}
2024-01-22 18:52:14 +01:00
\includegraphics[width=\linewidth, trim=0 0 0 -4cm]{../assets/screenshots/garage_status_0.9_prod_zonehl.png}
2024-01-22 18:52:14 +01:00
2024-01-22 18:52:14 +01:00
2024-01-22 18:52:14 +01:00
Garage stores replicas on different zones when possible
2024-01-22 18:52:14 +01:00
\frametitle{What a "layout" is}
\textbf{A layout is a precomputed index table:}
2024-01-22 18:52:14 +01:00
\textbf{Partition} & \textbf{Node 1} & \textbf{Node 2} & \textbf{Node 3} \\
Partition 0 & Io (jupiter) & Drosera (atuin) & Courgette (neptune) \\
Partition 1 & Datura (atuin) & Courgette (neptune) & Io (jupiter) \\
Partition 2 & Io(jupiter) & Celeri (neptune) & Drosera (atuin) \\
\hspace{1em}$\vdots$ & \hspace{1em}$\vdots$ & \hspace{1em}$\vdots$ & \hspace{1em}$\vdots$ \\
Partition 255 & Concombre (neptune) & Io (jupiter) & Drosera (atuin) \\
2024-01-22 18:52:14 +01:00
2024-01-22 18:52:14 +01:00
The index table is built centrally using an optimal algorithm,\\
then propagated to all nodes
2024-01-22 18:52:14 +01:00
Oulamara, M., \& Auvolat, A. (2023). \emph{An algorithm for geo-distributed and redundant storage in Garage}.\\ arXiv preprint arXiv:2302.13798.
2024-01-22 18:52:14 +01:00
2024-01-22 18:52:14 +01:00
% ====================== v0.10.0 ===============================
2024-01-22 18:52:14 +01:00
\frametitle{October 2023 - Garage v0.10.0 beta}
Focus on \underline{consistency}
\item Fix consistency issues when reshuffling data
\frametitle{Working with weak consistency}
Not using consensus limits us to the following:
\item<2-> \textbf{Conflict-free replicated data types} (CRDT)\\
2024-01-22 18:52:14 +01:00
{\footnotesize Non-transactional key-value stores such as S3 are equivalent to a simple CRDT:\\
a map of \textbf{last-writer-wins registers} (each key is its own CRDT)}
\item<3-> \textbf{Read-after-write consistency}\\
2024-01-22 18:52:14 +01:00
{\footnotesize Can be implemented using quorums on read and write operations}
2024-01-22 18:52:14 +01:00
\frametitle{CRDT read-after-write consistency using quorums}
2024-01-22 18:52:14 +01:00
\textbf{Property:} If node $A$ did an operation $write(x)$ and received an OK response,\\
\hspace{2cm} and node $B$ starts an operation $read()$ after $A$ received OK,\\
\hspace{2cm} then $B$ will read a value $x' \sqsupseteq x$.
2024-01-22 18:52:14 +01:00
\textbf{Algorithm $write(x)$:}
\item Broadcast $write(x)$ to all nodes
\item Wait for $k > n/2$ nodes to reply OK
\item Return OK
\textbf{Algorithm $read()$:}
\item Broadcast $read()$ to all nodes
\item Wait for $k > n/2$ nodes to reply\\
with values $x_1, \dots, x_k$
\item Return $x_1 \sqcup \dots \sqcup x_k$
2024-01-22 18:52:14 +01:00
\frametitle{A hard problem: layout changes}
2024-01-22 18:52:14 +01:00
\item We rely on quorums $k > n/2$ within each partition:\\
$$n=3,~~~~~~~k\ge 2$$
\item<2-> When rebalancing, the set of nodes responsible for a partition can change:\\
$$\{A, B, C\} \to \{A, D, E\}$$
\item<3-> During the rebalancing, $D$ and $E$ don't yet have the data,\\
~~~~~~~~~~~~~~~~~~~and $B$ and $C$ want to get rid of the data to free up space\\
$\to$ risk of inconsistency, \textbf{how to coordinate?}
2024-01-22 18:52:14 +01:00
\frametitle{Handling layout changes without losing consistency}
2024-01-22 18:52:14 +01:00
\item \textbf{Solution:}\\
\item keep track of data transfer to new nodes
\item use multiple write quorums\\
(new nodes + old nodes\\
while data transfer is in progress)
\item switching reads to new nodes\\
only once copy is finished
2024-01-22 18:52:14 +01:00
\item \textbf{Implemented} in v0.10
2024-01-22 18:52:14 +01:00
\item \textbf{Validated} with Jepsen testing
2024-01-22 18:52:14 +01:00
{\footnotesize Garage v0.9.0}
2024-01-22 18:52:14 +01:00
{\footnotesize Garage v0.10 beta}
2024-01-22 18:52:14 +01:00
% ====================== v0.10.0 ===============================
2024-01-22 18:52:14 +01:00
2024-01-22 18:52:14 +01:00
\frametitle{Towards v1.0}
Focus on \underline{security \& stability}
2024-01-22 18:52:14 +01:00
\item \textbf{Security audit} in progress by Radically Open Security
2024-01-22 18:52:14 +01:00
\item Misc. S3 features (SSE-C, ...) and compatibility fixes
2024-01-22 18:52:14 +01:00
\item Improve UX
2024-01-22 18:52:14 +01:00
\item Fix bugs
2024-01-22 18:52:14 +01:00
% ======================================== OPERATING
% ======================================== OPERATING
% ======================================== OPERATING
2024-01-22 18:52:14 +01:00
\section{Operating big Garage clusters}
\frametitle{Operating Garage}
2024-01-22 18:52:14 +01:00
2024-01-22 18:52:14 +01:00
\frametitle{Garage's architecture}
2024-01-22 18:52:14 +01:00
\frametitle{Digging deeper}
2024-01-22 18:52:14 +01:00
\frametitle{Potential limitations and bottlenecks}
\item Global:
\item Max. $\sim$100 nodes per cluster (excluding gateways)
\item Metadata:
\item One big bucket = bottleneck, object list on 3 nodes only
\item Block manager:
\item Lots of small files on disk
\item Processing the resync queue can be slow
\frametitle{Deployment advice for very large clusters}
\item Metadata storage:
\item ZFS mirror (x2) on fast NVMe
\item Use LMDB storage engine
\item Data block storage:
\item Use Garage's native multi-HDD support
2024-01-22 18:52:14 +01:00
\item XFS on individual drives
\item Increase block size (1MB $\to$ 10MB, requires more RAM and good networking)
\item Tune \texttt{resync-tranquility} and \texttt{resync-worker-count} dynamically
\item Other :
\item Split data over several buckets
\item Use less than 100 storage nodes
\item Use gateway nodes
Current deployments: $< 10$ TB, we don't have much experience with more
% ======================================== END
% ======================================== END
% ======================================== END
2024-01-22 18:52:14 +01:00
\frametitle{Where to find us}
\texttt{\} on Matrix
2024-01-22 18:52:14 +01:00
%% vim: set ts=4 sw=4 tw=0 noet spelllang=en :