garage/src/model/s3/object_table.rs

677 lines
18 KiB
Rust
Raw Normal View History

use serde::{Deserialize, Serialize};
2020-07-08 15:34:37 +00:00
use std::sync::Arc;
2020-04-09 15:32:28 +00:00
Abstract database behind generic interface and implement alternative drivers (#322) - [x] Design interface - [x] Implement Sled backend - [x] Re-implement the SledCountedTree hack ~~on Sled backend~~ on all backends (i.e. over the abstraction) - [x] Convert Garage code to use generic interface - [x] Proof-read converted Garage code - [ ] Test everything well - [x] Implement sqlite backend - [x] Implement LMDB backend - [ ] (Implement Persy backend?) - [ ] (Implement other backends? (like RocksDB, ...)) - [x] Implement backend choice in config file and garage server module - [x] Add CLI for converting between DB formats - Exploit the new interface to put more things in transactions - [x] `.updated()` trigger on Garage tables Fix #284 **Bugs** - [x] When exporting sqlite, trees iterate empty?? - [x] LMDB doesn't work **Known issues for various back-ends** - Sled: - Eats all my RAM and also all my disk space - `.len()` has to traverse the whole table - Is actually quite slow on some operations - And is actually pretty bad code... - Sqlite: - Requires a lock to be taken on all operations. The lock is also taken when iterating on a table with `.iter()`, and the lock isn't released until the iterator is dropped. This means that we must be VERY carefull to not do anything else inside a `.iter()` loop or else we will have a deadlock! Most such cases have been eliminated from the Garage codebase, but there might still be some that remain. If your Garage-over-Sqlite seems to hang/freeze, this is the reason. - (adapter uses a bunch of unsafe code) - Heed (LMDB): - Not suited for 32-bit machines as it has to map the whole DB in memory. - (adpater uses a tiny bit of unsafe code) **My recommendation:** avoid 32-bit machines and use LMDB as much as possible. **Converting databases** is actually quite easy. For example from Sled to LMDB: ```bash cd src/db cargo run --features cli --bin convert -- -i path/to/garage/meta/db -a sled -o path/to/garage/meta/db.lmdb -b lmdb ``` Then, just add this to your `config.toml`: ```toml db_engine = "lmdb" ``` Co-authored-by: Alex Auvolat <alex@adnab.me> Reviewed-on: https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/322 Co-authored-by: Alex <alex@adnab.me> Co-committed-by: Alex <alex@adnab.me>
2022-06-08 08:01:44 +00:00
use garage_db as db;
2020-04-24 10:10:01 +00:00
use garage_util::data::*;
2020-04-23 17:05:46 +00:00
use garage_table::crdt::*;
2021-03-26 18:41:46 +00:00
use garage_table::replication::TableShardedReplication;
2020-04-24 10:10:01 +00:00
use garage_table::*;
use crate::index_counter::*;
2023-04-27 15:57:54 +00:00
use crate::s3::mpu_table::*;
First implementation of K2V (#293) **Specification:** View spec at [this URL](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/k2v/doc/drafts/k2v-spec.md) - [x] Specify the structure of K2V triples - [x] Specify the DVVS format used for causality detection - [x] Specify the K2V index (just a counter of number of values per partition key) - [x] Specify single-item endpoints: ReadItem, InsertItem, DeleteItem - [x] Specify index endpoint: ReadIndex - [x] Specify multi-item endpoints: InsertBatch, ReadBatch, DeleteBatch - [x] Move to JSON objects instead of tuples - [x] Specify endpoints for polling for updates on single values (PollItem) **Implementation:** - [x] Table for K2V items, causal contexts - [x] Indexing mechanism and table for K2V index - [x] Make API handlers a bit more generic - [x] K2V API endpoint - [x] K2V API router - [x] ReadItem - [x] InsertItem - [x] DeleteItem - [x] PollItem - [x] ReadIndex - [x] InsertBatch - [x] ReadBatch - [x] DeleteBatch **Testing:** - [x] Just a simple Python script that does some requests to check visually that things are going right (does not contain parsing of results or assertions on returned values) - [x] Actual tests: - [x] Adapt testing framework - [x] Simple test with InsertItem + ReadItem - [x] Test with several Insert/Read/DeleteItem + ReadIndex - [x] Test all combinations of return formats for ReadItem - [x] Test with ReadBatch, InsertBatch, DeleteBatch - [x] Test with PollItem - [x] Test error codes - [ ] Fix most broken stuff - [x] test PollItem broken randomly - [x] when invalid causality tokens are given, errors should be 4xx not 5xx **Improvements:** - [x] Descending range queries - [x] Specify - [x] Implement - [x] Add test - [x] Batch updates to index counter - [x] Put K2V behind `k2v` feature flag Co-authored-by: Alex Auvolat <alex@adnab.me> Reviewed-on: https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/293 Co-authored-by: Alex <alex@adnab.me> Co-committed-by: Alex <alex@adnab.me>
2022-05-10 11:16:57 +00:00
use crate::s3::version_table::*;
2020-04-09 15:32:28 +00:00
pub const OBJECTS: &str = "objects";
pub const UNFINISHED_UPLOADS: &str = "unfinished_uploads";
pub const BYTES: &str = "bytes";
2023-01-03 13:44:47 +00:00
mod v05 {
use garage_util::data::{Hash, Uuid};
use serde::{Deserialize, Serialize};
use std::collections::BTreeMap;
/// An object
#[derive(PartialEq, Eq, Clone, Debug, Serialize, Deserialize)]
pub struct Object {
/// The bucket in which the object is stored, used as partition key
pub bucket: String,
/// The key at which the object is stored in its bucket, used as sorting key
pub key: String,
/// The list of currenty stored versions of the object
pub(super) versions: Vec<ObjectVersion>,
}
/// Informations about a version of an object
#[derive(PartialEq, Eq, Clone, Debug, Serialize, Deserialize)]
pub struct ObjectVersion {
/// Id of the version
pub uuid: Uuid,
/// Timestamp of when the object was created
pub timestamp: u64,
/// State of the version
pub state: ObjectVersionState,
}
/// State of an object version
#[derive(PartialEq, Eq, Clone, Debug, Serialize, Deserialize)]
pub enum ObjectVersionState {
/// The version is being received
Uploading(ObjectVersionHeaders),
/// The version is fully received
Complete(ObjectVersionData),
/// The version uploaded containded errors or the upload was explicitly aborted
Aborted,
}
/// Data stored in object version
#[derive(PartialEq, Eq, PartialOrd, Ord, Clone, Debug, Serialize, Deserialize)]
pub enum ObjectVersionData {
/// The object was deleted, this Version is a tombstone to mark it as such
DeleteMarker,
/// The object is short, it's stored inlined
Inline(ObjectVersionMeta, #[serde(with = "serde_bytes")] Vec<u8>),
/// The object is not short, Hash of first block is stored here, next segments hashes are
/// stored in the version table
FirstBlock(ObjectVersionMeta, Hash),
}
/// Metadata about the object version
#[derive(PartialEq, Eq, PartialOrd, Ord, Clone, Debug, Serialize, Deserialize)]
pub struct ObjectVersionMeta {
/// Headers to send to the client
pub headers: ObjectVersionHeaders,
/// Size of the object
pub size: u64,
/// etag of the object
pub etag: String,
}
/// Additional headers for an object
#[derive(PartialEq, Eq, PartialOrd, Ord, Clone, Debug, Serialize, Deserialize)]
pub struct ObjectVersionHeaders {
/// Content type of the object
pub content_type: String,
/// Any other http headers to send
pub other: BTreeMap<String, String>,
}
impl garage_util::migrate::InitialFormat for Object {}
}
mod v08 {
use garage_util::data::Uuid;
use serde::{Deserialize, Serialize};
use super::v05;
2020-04-09 21:45:07 +00:00
2023-01-03 13:44:47 +00:00
pub use v05::{
ObjectVersion, ObjectVersionData, ObjectVersionHeaders, ObjectVersionMeta,
ObjectVersionState,
};
/// An object
#[derive(PartialEq, Eq, Clone, Debug, Serialize, Deserialize)]
pub struct Object {
/// The bucket in which the object is stored, used as partition key
pub bucket_id: Uuid,
/// The key at which the object is stored in its bucket, used as sorting key
pub key: String,
/// The list of currenty stored versions of the object
pub(super) versions: Vec<ObjectVersion>,
}
2020-04-09 15:32:28 +00:00
2023-01-03 13:44:47 +00:00
impl garage_util::migrate::Migrate for Object {
type Previous = v05::Object;
fn migrate(old: v05::Object) -> Object {
use garage_util::data::blake2sum;
Object {
bucket_id: blake2sum(old.bucket.as_bytes()),
key: old.key,
versions: old.versions,
}
}
}
}
2023-04-27 15:57:54 +00:00
mod v09 {
use garage_util::data::Uuid;
use serde::{Deserialize, Serialize};
use super::v08;
pub use v08::{ObjectVersionData, ObjectVersionHeaders, ObjectVersionMeta};
/// An object
#[derive(PartialEq, Eq, Clone, Debug, Serialize, Deserialize)]
pub struct Object {
/// The bucket in which the object is stored, used as partition key
pub bucket_id: Uuid,
/// The key at which the object is stored in its bucket, used as sorting key
pub key: String,
/// The list of currenty stored versions of the object
pub(super) versions: Vec<ObjectVersion>,
}
/// Informations about a version of an object
#[derive(PartialEq, Eq, Clone, Debug, Serialize, Deserialize)]
pub struct ObjectVersion {
/// Id of the version
pub uuid: Uuid,
/// Timestamp of when the object was created
pub timestamp: u64,
/// State of the version
pub state: ObjectVersionState,
}
/// State of an object version
#[derive(PartialEq, Eq, Clone, Debug, Serialize, Deserialize)]
pub enum ObjectVersionState {
/// The version is being received
Uploading {
/// Indicates whether this is a multipart upload
multipart: bool,
/// Headers to be included in the final object
headers: ObjectVersionHeaders,
},
/// The version is fully received
Complete(ObjectVersionData),
/// The version uploaded containded errors or the upload was explicitly aborted
Aborted,
}
impl garage_util::migrate::Migrate for Object {
const VERSION_MARKER: &'static [u8] = b"G09s3o";
type Previous = v08::Object;
fn migrate(old: v08::Object) -> Object {
let versions = old
.versions
.into_iter()
.map(|x| ObjectVersion {
uuid: x.uuid,
timestamp: x.timestamp,
state: match x.state {
v08::ObjectVersionState::Uploading(h) => ObjectVersionState::Uploading {
multipart: false,
headers: h,
},
v08::ObjectVersionState::Complete(d) => ObjectVersionState::Complete(d),
v08::ObjectVersionState::Aborted => ObjectVersionState::Aborted,
},
})
.collect();
Object {
bucket_id: old.bucket_id,
key: old.key,
versions,
}
}
}
}
2024-02-23 15:49:50 +00:00
mod v010 {
use garage_util::data::{Hash, Uuid};
use serde::{Deserialize, Serialize};
use super::v09;
pub use v09::ObjectVersionHeaders;
/// An object
#[derive(PartialEq, Eq, Clone, Debug, Serialize, Deserialize)]
pub struct Object {
/// The bucket in which the object is stored, used as partition key
pub bucket_id: Uuid,
/// The key at which the object is stored in its bucket, used as sorting key
pub key: String,
/// The list of currenty stored versions of the object
pub(super) versions: Vec<ObjectVersion>,
}
/// Informations about a version of an object
#[derive(PartialEq, Eq, Clone, Debug, Serialize, Deserialize)]
pub struct ObjectVersion {
/// Id of the version
pub uuid: Uuid,
/// Timestamp of when the object was created
pub timestamp: u64,
/// State of the version
pub state: ObjectVersionState,
}
/// State of an object version
#[derive(PartialEq, Eq, Clone, Debug, Serialize, Deserialize)]
pub enum ObjectVersionState {
/// The version is being received
Uploading {
/// Indicates whether this is a multipart upload
multipart: bool,
/// Encryption params + headers to be included in the final object
encryption: ObjectVersionEncryption,
},
/// The version is fully received
Complete(ObjectVersionData),
/// The version uploaded containded errors or the upload was explicitly aborted
Aborted,
}
/// Data stored in object version
#[derive(PartialEq, Eq, PartialOrd, Ord, Clone, Debug, Serialize, Deserialize)]
pub enum ObjectVersionData {
/// The object was deleted, this Version is a tombstone to mark it as such
DeleteMarker,
/// The object is short, it's stored inlined.
/// It is never compressed. For encrypted objects, it is encrypted using
/// AES256-GCM, like the encrypted headers.
Inline(ObjectVersionMeta, #[serde(with = "serde_bytes")] Vec<u8>),
/// The object is not short, Hash of first block is stored here, next segments hashes are
/// stored in the version table
FirstBlock(ObjectVersionMeta, Hash),
}
/// Metadata about the object version
#[derive(PartialEq, Eq, PartialOrd, Ord, Clone, Debug, Serialize, Deserialize)]
pub struct ObjectVersionMeta {
/// Size of the object. If object is encrypted/compressed,
/// this is always the size of the unencrypted/uncompressed data
pub size: u64,
/// etag of the object
pub etag: String,
/// Encryption params + headers (encrypted or plaintext)
pub encryption: ObjectVersionEncryption,
}
/// Encryption information + metadata
#[derive(PartialEq, Eq, PartialOrd, Ord, Clone, Debug, Serialize, Deserialize)]
pub enum ObjectVersionEncryption {
SseC {
/// Encrypted serialized ObjectVersionHeaders struct.
/// This is never compressed, just encrypted using AES256-GCM.
#[serde(with = "serde_bytes")]
headers: Vec<u8>,
/// Whether data blocks are compressed in addition to being encrypted
/// (compression happens before encryption, whereas for non-encrypted
/// objects, compression is handled at the level of the block manager)
compressed: bool,
},
Plaintext {
/// Plain-text headers
headers: ObjectVersionHeaders,
},
}
impl garage_util::migrate::Migrate for Object {
const VERSION_MARKER: &'static [u8] = b"G010s3ob";
type Previous = v09::Object;
fn migrate(old: v09::Object) -> Object {
Object {
bucket_id: old.bucket_id,
key: old.key,
versions: old.versions.into_iter().map(migrate_version).collect(),
}
}
}
fn migrate_version(old: v09::ObjectVersion) -> ObjectVersion {
ObjectVersion {
uuid: old.uuid,
timestamp: old.timestamp,
state: match old.state {
v09::ObjectVersionState::Uploading { multipart, headers } => {
ObjectVersionState::Uploading {
multipart,
encryption: migrate_headers(headers),
}
}
v09::ObjectVersionState::Complete(d) => {
ObjectVersionState::Complete(migrate_data(d))
}
v09::ObjectVersionState::Aborted => ObjectVersionState::Aborted,
},
}
}
fn migrate_data(old: v09::ObjectVersionData) -> ObjectVersionData {
match old {
v09::ObjectVersionData::DeleteMarker => ObjectVersionData::DeleteMarker,
v09::ObjectVersionData::Inline(meta, data) => {
ObjectVersionData::Inline(migrate_meta(meta), data)
}
v09::ObjectVersionData::FirstBlock(meta, fb) => {
ObjectVersionData::FirstBlock(migrate_meta(meta), fb)
}
}
}
fn migrate_meta(old: v09::ObjectVersionMeta) -> ObjectVersionMeta {
ObjectVersionMeta {
size: old.size,
etag: old.etag,
encryption: migrate_headers(old.headers),
}
}
fn migrate_headers(old: v09::ObjectVersionHeaders) -> ObjectVersionEncryption {
ObjectVersionEncryption::Plaintext { headers: old }
}
// Since ObjectVersionHeaders can now be serialized independently, for the
// purpose of being encrypted, we need it to support migrations on its own
// as well.
impl garage_util::migrate::InitialFormat for ObjectVersionHeaders {
const VERSION_MARKER: &'static [u8] = b"G010s3oh";
}
}
pub use v010::*;
2023-01-03 13:44:47 +00:00
impl Object {
2021-04-08 13:13:02 +00:00
/// Initialize an Object struct from parts
2021-12-14 12:55:11 +00:00
pub fn new(bucket_id: Uuid, key: String, versions: Vec<ObjectVersion>) -> Self {
let mut ret = Self {
2021-12-14 12:55:11 +00:00
bucket_id,
key,
versions: vec![],
};
for v in versions {
ret.add_version(v)
.expect("Twice the same ObjectVersion in Object constructor");
}
ret
}
2021-03-26 20:53:28 +00:00
/// Adds a version if it wasn't already present
2021-04-23 19:57:32 +00:00
#[allow(clippy::result_unit_err)]
pub fn add_version(&mut self, new: ObjectVersion) -> Result<(), ()> {
match self
.versions
.binary_search_by(|v| v.cmp_key().cmp(&new.cmp_key()))
{
Err(i) => {
self.versions.insert(i, new);
Ok(())
}
Ok(_) => Err(()),
}
}
2021-03-26 20:53:28 +00:00
2021-04-06 03:25:28 +00:00
/// Get a list of currently stored versions of `Object`
pub fn versions(&self) -> &[ObjectVersion] {
&self.versions[..]
}
2020-04-09 15:32:28 +00:00
}
2021-05-02 21:13:08 +00:00
impl Crdt for ObjectVersionState {
fn merge(&mut self, other: &Self) {
2020-04-26 18:55:13 +00:00
use ObjectVersionState::*;
2020-07-08 15:34:37 +00:00
match other {
Aborted => {
*self = Aborted;
}
Complete(b) => match self {
Aborted => {}
Complete(a) => {
a.merge(b);
}
2023-04-27 15:57:54 +00:00
Uploading { .. } => {
2020-07-08 15:34:37 +00:00
*self = Complete(b.clone());
}
},
2023-04-27 15:57:54 +00:00
Uploading { .. } => {}
2020-07-08 15:34:37 +00:00
}
2020-04-26 18:55:13 +00:00
}
}
2021-05-02 21:13:08 +00:00
impl AutoCrdt for ObjectVersionData {
const WARN_IF_DIFFERENT: bool = true;
}
impl ObjectVersion {
2021-05-02 21:13:08 +00:00
fn cmp_key(&self) -> (u64, Uuid) {
2020-04-26 18:55:13 +00:00
(self.timestamp, self.uuid)
}
2021-03-26 20:53:28 +00:00
/// Is the object version currently being uploaded
2023-04-27 15:57:54 +00:00
///
/// matches only multipart uploads if check_multipart is Some(true)
/// matches only non-multipart uploads if check_multipart is Some(false)
/// matches both if check_multipart is None
pub fn is_uploading(&self, check_multipart: Option<bool>) -> bool {
match &self.state {
ObjectVersionState::Uploading { multipart, .. } => {
check_multipart.map(|x| x == *multipart).unwrap_or(true)
}
_ => false,
}
2020-07-08 15:34:37 +00:00
}
2021-03-26 20:53:28 +00:00
/// Is the object version completely received
2020-04-26 18:55:13 +00:00
pub fn is_complete(&self) -> bool {
2021-04-23 19:57:32 +00:00
matches!(self.state, ObjectVersionState::Complete(_))
2020-04-26 18:55:13 +00:00
}
2021-03-26 20:53:28 +00:00
2021-04-06 03:25:28 +00:00
/// Is the object version available (received and not a tombstone)
2020-04-26 18:55:13 +00:00
pub fn is_data(&self) -> bool {
2020-07-08 15:34:37 +00:00
match self.state {
ObjectVersionState::Complete(ObjectVersionData::DeleteMarker) => false,
ObjectVersionState::Complete(_) => true,
_ => false,
}
}
}
2021-12-14 12:55:11 +00:00
impl Entry<Uuid, String> for Object {
fn partition_key(&self) -> &Uuid {
&self.bucket_id
2020-04-09 15:32:28 +00:00
}
fn sort_key(&self) -> &String {
&self.key
}
2021-03-16 15:51:15 +00:00
fn is_tombstone(&self) -> bool {
self.versions.len() == 1
&& self.versions[0].state
== ObjectVersionState::Complete(ObjectVersionData::DeleteMarker)
2021-03-16 15:51:15 +00:00
}
}
2020-04-09 15:32:28 +00:00
2021-05-02 21:13:08 +00:00
impl Crdt for Object {
2020-04-09 21:45:07 +00:00
fn merge(&mut self, other: &Self) {
// Merge versions from other into here
2020-04-09 15:32:28 +00:00
for other_v in other.versions.iter() {
match self
.versions
.binary_search_by(|v| v.cmp_key().cmp(&other_v.cmp_key()))
{
2020-04-09 15:32:28 +00:00
Ok(i) => {
2020-07-08 15:34:37 +00:00
self.versions[i].state.merge(&other_v.state);
2020-04-09 15:32:28 +00:00
}
Err(i) => {
self.versions.insert(i, other_v.clone());
}
}
}
// Remove versions which are obsolete, i.e. those that come
// before the last version which .is_complete().
let last_complete = self
.versions
.iter()
.enumerate()
.rev()
2021-04-23 19:57:32 +00:00
.find(|(_, v)| v.is_complete())
2020-04-09 15:32:28 +00:00
.map(|(vi, _)| vi);
if let Some(last_vi) = last_complete {
self.versions = self.versions.drain(last_vi..).collect::<Vec<_>>();
}
}
}
2020-04-09 21:45:07 +00:00
pub struct ObjectTable {
pub version_table: Arc<Table<VersionTable, TableShardedReplication>>,
2023-04-27 15:57:54 +00:00
pub mpu_table: Arc<Table<MultipartUploadTable, TableShardedReplication>>,
pub object_counter_table: Arc<IndexCounter<Object>>,
2020-04-09 21:45:07 +00:00
}
Implement ListMultipartUploads (#171) Implement ListMultipartUploads, also refactor ListObjects and ListObjectsV2. It took me some times as I wanted to propose the following things: - Using an iterator instead of the loop+goto pattern. I find it easier to read and it should enable some optimizations. For example, when consuming keys of a common prefix, we do many [redundant checks](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/api/s3_list.rs#L125-L156) while the only thing to do is to [check if the following key is still part of the common prefix](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/feature/s3-multipart-compat/src/api/s3_list.rs#L476). - Try to name things (see ExtractionResult and RangeBegin enums) and to separate concerns (see ListQuery and Accumulator) - An IO closure to make unit tests possibles. - Unit tests, to track regressions and document how to interact with the code - Integration tests with `s3api`. In the future, I would like to move them in Rust with the aws rust SDK. Merging of the logic of ListMultipartUploads and ListObjects was not a goal but a consequence of the previous modifications. Some points that we might want to discuss: - ListObjectsV1, when using pagination and delimiters, has a weird behavior (it lists multiple times the same prefix) with `aws s3api` due to the fact that it can not use our optimization to skip the whole prefix. It is independant from my refactor and can be tested with the commented `s3api` tests in `test-smoke.sh`. It probably has the same weird behavior on the official AWS S3 implementation. - Considering ListMultipartUploads, I had to "abuse" upload id marker to support prefix skipping. I send an `upload-id-marker` with the hardcoded value `include` to emulate your "including" token. - Some ways to test ListMultipartUploads with existing software (my tests are limited to s3api for now). Co-authored-by: Quentin Dufour <quentin@deuxfleurs.fr> Reviewed-on: https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/171 Co-authored-by: Quentin <quentin@dufour.io> Co-committed-by: Quentin <quentin@dufour.io>
2022-01-12 18:04:55 +00:00
#[derive(Clone, Copy, Debug, Serialize, Deserialize)]
pub enum ObjectFilter {
2023-04-27 15:57:54 +00:00
/// Is the object version available (received and not a tombstone)
Implement ListMultipartUploads (#171) Implement ListMultipartUploads, also refactor ListObjects and ListObjectsV2. It took me some times as I wanted to propose the following things: - Using an iterator instead of the loop+goto pattern. I find it easier to read and it should enable some optimizations. For example, when consuming keys of a common prefix, we do many [redundant checks](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/api/s3_list.rs#L125-L156) while the only thing to do is to [check if the following key is still part of the common prefix](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/feature/s3-multipart-compat/src/api/s3_list.rs#L476). - Try to name things (see ExtractionResult and RangeBegin enums) and to separate concerns (see ListQuery and Accumulator) - An IO closure to make unit tests possibles. - Unit tests, to track regressions and document how to interact with the code - Integration tests with `s3api`. In the future, I would like to move them in Rust with the aws rust SDK. Merging of the logic of ListMultipartUploads and ListObjects was not a goal but a consequence of the previous modifications. Some points that we might want to discuss: - ListObjectsV1, when using pagination and delimiters, has a weird behavior (it lists multiple times the same prefix) with `aws s3api` due to the fact that it can not use our optimization to skip the whole prefix. It is independant from my refactor and can be tested with the commented `s3api` tests in `test-smoke.sh`. It probably has the same weird behavior on the official AWS S3 implementation. - Considering ListMultipartUploads, I had to "abuse" upload id marker to support prefix skipping. I send an `upload-id-marker` with the hardcoded value `include` to emulate your "including" token. - Some ways to test ListMultipartUploads with existing software (my tests are limited to s3api for now). Co-authored-by: Quentin Dufour <quentin@deuxfleurs.fr> Reviewed-on: https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/171 Co-authored-by: Quentin <quentin@dufour.io> Co-committed-by: Quentin <quentin@dufour.io>
2022-01-12 18:04:55 +00:00
IsData,
2023-04-27 15:57:54 +00:00
/// Is the object version currently being uploaded
///
/// matches only multipart uploads if check_multipart is Some(true)
/// matches only non-multipart uploads if check_multipart is Some(false)
/// matches both if check_multipart is None
IsUploading { check_multipart: Option<bool> },
Implement ListMultipartUploads (#171) Implement ListMultipartUploads, also refactor ListObjects and ListObjectsV2. It took me some times as I wanted to propose the following things: - Using an iterator instead of the loop+goto pattern. I find it easier to read and it should enable some optimizations. For example, when consuming keys of a common prefix, we do many [redundant checks](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/api/s3_list.rs#L125-L156) while the only thing to do is to [check if the following key is still part of the common prefix](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/feature/s3-multipart-compat/src/api/s3_list.rs#L476). - Try to name things (see ExtractionResult and RangeBegin enums) and to separate concerns (see ListQuery and Accumulator) - An IO closure to make unit tests possibles. - Unit tests, to track regressions and document how to interact with the code - Integration tests with `s3api`. In the future, I would like to move them in Rust with the aws rust SDK. Merging of the logic of ListMultipartUploads and ListObjects was not a goal but a consequence of the previous modifications. Some points that we might want to discuss: - ListObjectsV1, when using pagination and delimiters, has a weird behavior (it lists multiple times the same prefix) with `aws s3api` due to the fact that it can not use our optimization to skip the whole prefix. It is independant from my refactor and can be tested with the commented `s3api` tests in `test-smoke.sh`. It probably has the same weird behavior on the official AWS S3 implementation. - Considering ListMultipartUploads, I had to "abuse" upload id marker to support prefix skipping. I send an `upload-id-marker` with the hardcoded value `include` to emulate your "including" token. - Some ways to test ListMultipartUploads with existing software (my tests are limited to s3api for now). Co-authored-by: Quentin Dufour <quentin@deuxfleurs.fr> Reviewed-on: https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/171 Co-authored-by: Quentin <quentin@dufour.io> Co-committed-by: Quentin <quentin@dufour.io>
2022-01-12 18:04:55 +00:00
}
2020-04-12 20:24:53 +00:00
impl TableSchema for ObjectTable {
2021-12-14 11:34:01 +00:00
const TABLE_NAME: &'static str = "object";
2021-12-14 12:55:11 +00:00
type P = Uuid;
2020-04-09 15:32:28 +00:00
type S = String;
type E = Object;
Implement ListMultipartUploads (#171) Implement ListMultipartUploads, also refactor ListObjects and ListObjectsV2. It took me some times as I wanted to propose the following things: - Using an iterator instead of the loop+goto pattern. I find it easier to read and it should enable some optimizations. For example, when consuming keys of a common prefix, we do many [redundant checks](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/api/s3_list.rs#L125-L156) while the only thing to do is to [check if the following key is still part of the common prefix](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/feature/s3-multipart-compat/src/api/s3_list.rs#L476). - Try to name things (see ExtractionResult and RangeBegin enums) and to separate concerns (see ListQuery and Accumulator) - An IO closure to make unit tests possibles. - Unit tests, to track regressions and document how to interact with the code - Integration tests with `s3api`. In the future, I would like to move them in Rust with the aws rust SDK. Merging of the logic of ListMultipartUploads and ListObjects was not a goal but a consequence of the previous modifications. Some points that we might want to discuss: - ListObjectsV1, when using pagination and delimiters, has a weird behavior (it lists multiple times the same prefix) with `aws s3api` due to the fact that it can not use our optimization to skip the whole prefix. It is independant from my refactor and can be tested with the commented `s3api` tests in `test-smoke.sh`. It probably has the same weird behavior on the official AWS S3 implementation. - Considering ListMultipartUploads, I had to "abuse" upload id marker to support prefix skipping. I send an `upload-id-marker` with the hardcoded value `include` to emulate your "including" token. - Some ways to test ListMultipartUploads with existing software (my tests are limited to s3api for now). Co-authored-by: Quentin Dufour <quentin@deuxfleurs.fr> Reviewed-on: https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/171 Co-authored-by: Quentin <quentin@dufour.io> Co-committed-by: Quentin <quentin@dufour.io>
2022-01-12 18:04:55 +00:00
type Filter = ObjectFilter;
2020-04-09 15:32:28 +00:00
Abstract database behind generic interface and implement alternative drivers (#322) - [x] Design interface - [x] Implement Sled backend - [x] Re-implement the SledCountedTree hack ~~on Sled backend~~ on all backends (i.e. over the abstraction) - [x] Convert Garage code to use generic interface - [x] Proof-read converted Garage code - [ ] Test everything well - [x] Implement sqlite backend - [x] Implement LMDB backend - [ ] (Implement Persy backend?) - [ ] (Implement other backends? (like RocksDB, ...)) - [x] Implement backend choice in config file and garage server module - [x] Add CLI for converting between DB formats - Exploit the new interface to put more things in transactions - [x] `.updated()` trigger on Garage tables Fix #284 **Bugs** - [x] When exporting sqlite, trees iterate empty?? - [x] LMDB doesn't work **Known issues for various back-ends** - Sled: - Eats all my RAM and also all my disk space - `.len()` has to traverse the whole table - Is actually quite slow on some operations - And is actually pretty bad code... - Sqlite: - Requires a lock to be taken on all operations. The lock is also taken when iterating on a table with `.iter()`, and the lock isn't released until the iterator is dropped. This means that we must be VERY carefull to not do anything else inside a `.iter()` loop or else we will have a deadlock! Most such cases have been eliminated from the Garage codebase, but there might still be some that remain. If your Garage-over-Sqlite seems to hang/freeze, this is the reason. - (adapter uses a bunch of unsafe code) - Heed (LMDB): - Not suited for 32-bit machines as it has to map the whole DB in memory. - (adpater uses a tiny bit of unsafe code) **My recommendation:** avoid 32-bit machines and use LMDB as much as possible. **Converting databases** is actually quite easy. For example from Sled to LMDB: ```bash cd src/db cargo run --features cli --bin convert -- -i path/to/garage/meta/db -a sled -o path/to/garage/meta/db.lmdb -b lmdb ``` Then, just add this to your `config.toml`: ```toml db_engine = "lmdb" ``` Co-authored-by: Alex Auvolat <alex@adnab.me> Reviewed-on: https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/322 Co-authored-by: Alex <alex@adnab.me> Co-committed-by: Alex <alex@adnab.me>
2022-06-08 08:01:44 +00:00
fn updated(
&self,
tx: &mut db::Transaction,
Abstract database behind generic interface and implement alternative drivers (#322) - [x] Design interface - [x] Implement Sled backend - [x] Re-implement the SledCountedTree hack ~~on Sled backend~~ on all backends (i.e. over the abstraction) - [x] Convert Garage code to use generic interface - [x] Proof-read converted Garage code - [ ] Test everything well - [x] Implement sqlite backend - [x] Implement LMDB backend - [ ] (Implement Persy backend?) - [ ] (Implement other backends? (like RocksDB, ...)) - [x] Implement backend choice in config file and garage server module - [x] Add CLI for converting between DB formats - Exploit the new interface to put more things in transactions - [x] `.updated()` trigger on Garage tables Fix #284 **Bugs** - [x] When exporting sqlite, trees iterate empty?? - [x] LMDB doesn't work **Known issues for various back-ends** - Sled: - Eats all my RAM and also all my disk space - `.len()` has to traverse the whole table - Is actually quite slow on some operations - And is actually pretty bad code... - Sqlite: - Requires a lock to be taken on all operations. The lock is also taken when iterating on a table with `.iter()`, and the lock isn't released until the iterator is dropped. This means that we must be VERY carefull to not do anything else inside a `.iter()` loop or else we will have a deadlock! Most such cases have been eliminated from the Garage codebase, but there might still be some that remain. If your Garage-over-Sqlite seems to hang/freeze, this is the reason. - (adapter uses a bunch of unsafe code) - Heed (LMDB): - Not suited for 32-bit machines as it has to map the whole DB in memory. - (adpater uses a tiny bit of unsafe code) **My recommendation:** avoid 32-bit machines and use LMDB as much as possible. **Converting databases** is actually quite easy. For example from Sled to LMDB: ```bash cd src/db cargo run --features cli --bin convert -- -i path/to/garage/meta/db -a sled -o path/to/garage/meta/db.lmdb -b lmdb ``` Then, just add this to your `config.toml`: ```toml db_engine = "lmdb" ``` Co-authored-by: Alex Auvolat <alex@adnab.me> Reviewed-on: https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/322 Co-authored-by: Alex <alex@adnab.me> Co-committed-by: Alex <alex@adnab.me>
2022-06-08 08:01:44 +00:00
old: Option<&Self::E>,
new: Option<&Self::E>,
) -> db::TxOpResult<()> {
// 1. Count
let counter_res = self.object_counter_table.count(tx, old, new);
if let Err(e) = db::unabort(counter_res)? {
error!(
"Unable to update object counter: {}. Index values will be wrong!",
e
);
}
// 2. Enqueue propagation deletions to version table
if let (Some(old_v), Some(new_v)) = (old, new) {
for v in old_v.versions.iter() {
let new_v_id = new_v
.versions
.binary_search_by(|nv| nv.cmp_key().cmp(&v.cmp_key()));
// Propagate deletion of old versions to the Version table
let delete_version = match new_v_id {
Err(_) => true,
Ok(i) => {
new_v.versions[i].state == ObjectVersionState::Aborted
&& v.state != ObjectVersionState::Aborted
}
};
if delete_version {
2023-04-27 15:57:54 +00:00
let deleted_version = Version::new(
v.uuid,
VersionBacklink::Object {
bucket_id: old_v.bucket_id,
key: old_v.key.clone(),
},
true,
);
let res = self.version_table.queue_insert(tx, &deleted_version);
if let Err(e) = db::unabort(res)? {
error!(
"Unable to enqueue version deletion propagation: {}. A repair will be needed.",
e
);
2020-04-26 18:59:17 +00:00
}
}
// After abortion or completion of multipart uploads, delete MPU table entry
if matches!(
v.state,
ObjectVersionState::Uploading {
multipart: true,
..
}
) {
let delete_mpu = match new_v_id {
Err(_) => true,
Ok(i) => !matches!(
new_v.versions[i].state,
ObjectVersionState::Uploading { .. }
),
};
if delete_mpu {
2023-06-13 08:48:22 +00:00
let deleted_mpu = MultipartUpload::new(
v.uuid,
v.timestamp,
old_v.bucket_id,
old_v.key.clone(),
true,
);
let res = self.mpu_table.queue_insert(tx, &deleted_mpu);
if let Err(e) = db::unabort(res)? {
error!(
"Unable to enqueue multipart upload deletion propagation: {}. A repair will be needed.",
e
);
}
}
}
}
}
Abstract database behind generic interface and implement alternative drivers (#322) - [x] Design interface - [x] Implement Sled backend - [x] Re-implement the SledCountedTree hack ~~on Sled backend~~ on all backends (i.e. over the abstraction) - [x] Convert Garage code to use generic interface - [x] Proof-read converted Garage code - [ ] Test everything well - [x] Implement sqlite backend - [x] Implement LMDB backend - [ ] (Implement Persy backend?) - [ ] (Implement other backends? (like RocksDB, ...)) - [x] Implement backend choice in config file and garage server module - [x] Add CLI for converting between DB formats - Exploit the new interface to put more things in transactions - [x] `.updated()` trigger on Garage tables Fix #284 **Bugs** - [x] When exporting sqlite, trees iterate empty?? - [x] LMDB doesn't work **Known issues for various back-ends** - Sled: - Eats all my RAM and also all my disk space - `.len()` has to traverse the whole table - Is actually quite slow on some operations - And is actually pretty bad code... - Sqlite: - Requires a lock to be taken on all operations. The lock is also taken when iterating on a table with `.iter()`, and the lock isn't released until the iterator is dropped. This means that we must be VERY carefull to not do anything else inside a `.iter()` loop or else we will have a deadlock! Most such cases have been eliminated from the Garage codebase, but there might still be some that remain. If your Garage-over-Sqlite seems to hang/freeze, this is the reason. - (adapter uses a bunch of unsafe code) - Heed (LMDB): - Not suited for 32-bit machines as it has to map the whole DB in memory. - (adpater uses a tiny bit of unsafe code) **My recommendation:** avoid 32-bit machines and use LMDB as much as possible. **Converting databases** is actually quite easy. For example from Sled to LMDB: ```bash cd src/db cargo run --features cli --bin convert -- -i path/to/garage/meta/db -a sled -o path/to/garage/meta/db.lmdb -b lmdb ``` Then, just add this to your `config.toml`: ```toml db_engine = "lmdb" ``` Co-authored-by: Alex Auvolat <alex@adnab.me> Reviewed-on: https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/322 Co-authored-by: Alex <alex@adnab.me> Co-committed-by: Alex <alex@adnab.me>
2022-06-08 08:01:44 +00:00
Ok(())
2020-04-09 15:32:28 +00:00
}
fn matches_filter(entry: &Self::E, filter: &Self::Filter) -> bool {
Implement ListMultipartUploads (#171) Implement ListMultipartUploads, also refactor ListObjects and ListObjectsV2. It took me some times as I wanted to propose the following things: - Using an iterator instead of the loop+goto pattern. I find it easier to read and it should enable some optimizations. For example, when consuming keys of a common prefix, we do many [redundant checks](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/api/s3_list.rs#L125-L156) while the only thing to do is to [check if the following key is still part of the common prefix](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/feature/s3-multipart-compat/src/api/s3_list.rs#L476). - Try to name things (see ExtractionResult and RangeBegin enums) and to separate concerns (see ListQuery and Accumulator) - An IO closure to make unit tests possibles. - Unit tests, to track regressions and document how to interact with the code - Integration tests with `s3api`. In the future, I would like to move them in Rust with the aws rust SDK. Merging of the logic of ListMultipartUploads and ListObjects was not a goal but a consequence of the previous modifications. Some points that we might want to discuss: - ListObjectsV1, when using pagination and delimiters, has a weird behavior (it lists multiple times the same prefix) with `aws s3api` due to the fact that it can not use our optimization to skip the whole prefix. It is independant from my refactor and can be tested with the commented `s3api` tests in `test-smoke.sh`. It probably has the same weird behavior on the official AWS S3 implementation. - Considering ListMultipartUploads, I had to "abuse" upload id marker to support prefix skipping. I send an `upload-id-marker` with the hardcoded value `include` to emulate your "including" token. - Some ways to test ListMultipartUploads with existing software (my tests are limited to s3api for now). Co-authored-by: Quentin Dufour <quentin@deuxfleurs.fr> Reviewed-on: https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/171 Co-authored-by: Quentin <quentin@dufour.io> Co-committed-by: Quentin <quentin@dufour.io>
2022-01-12 18:04:55 +00:00
match filter {
ObjectFilter::IsData => entry.versions.iter().any(|v| v.is_data()),
2023-04-27 15:57:54 +00:00
ObjectFilter::IsUploading { check_multipart } => entry
.versions
.iter()
.any(|v| v.is_uploading(*check_multipart)),
Implement ListMultipartUploads (#171) Implement ListMultipartUploads, also refactor ListObjects and ListObjectsV2. It took me some times as I wanted to propose the following things: - Using an iterator instead of the loop+goto pattern. I find it easier to read and it should enable some optimizations. For example, when consuming keys of a common prefix, we do many [redundant checks](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/api/s3_list.rs#L125-L156) while the only thing to do is to [check if the following key is still part of the common prefix](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/feature/s3-multipart-compat/src/api/s3_list.rs#L476). - Try to name things (see ExtractionResult and RangeBegin enums) and to separate concerns (see ListQuery and Accumulator) - An IO closure to make unit tests possibles. - Unit tests, to track regressions and document how to interact with the code - Integration tests with `s3api`. In the future, I would like to move them in Rust with the aws rust SDK. Merging of the logic of ListMultipartUploads and ListObjects was not a goal but a consequence of the previous modifications. Some points that we might want to discuss: - ListObjectsV1, when using pagination and delimiters, has a weird behavior (it lists multiple times the same prefix) with `aws s3api` due to the fact that it can not use our optimization to skip the whole prefix. It is independant from my refactor and can be tested with the commented `s3api` tests in `test-smoke.sh`. It probably has the same weird behavior on the official AWS S3 implementation. - Considering ListMultipartUploads, I had to "abuse" upload id marker to support prefix skipping. I send an `upload-id-marker` with the hardcoded value `include` to emulate your "including" token. - Some ways to test ListMultipartUploads with existing software (my tests are limited to s3api for now). Co-authored-by: Quentin Dufour <quentin@deuxfleurs.fr> Reviewed-on: https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/171 Co-authored-by: Quentin <quentin@dufour.io> Co-committed-by: Quentin <quentin@dufour.io>
2022-01-12 18:04:55 +00:00
}
}
}
impl CountedItem for Object {
const COUNTER_TABLE_NAME: &'static str = "bucket_object_counter";
// Partition key = bucket id
type CP = Uuid;
// Sort key = nothing
type CS = EmptyKey;
fn counter_partition_key(&self) -> &Uuid {
&self.bucket_id
}
fn counter_sort_key(&self) -> &EmptyKey {
&EmptyKey
}
fn counts(&self) -> Vec<(&'static str, i64)> {
let versions = self.versions();
let n_objects = if versions.iter().any(|v| v.is_data()) {
1
} else {
0
};
2023-04-27 15:57:54 +00:00
let n_unfinished_uploads = versions.iter().filter(|v| v.is_uploading(None)).count();
let n_bytes = versions
.iter()
.map(|v| match &v.state {
ObjectVersionState::Complete(ObjectVersionData::Inline(meta, _))
| ObjectVersionState::Complete(ObjectVersionData::FirstBlock(meta, _)) => meta.size,
_ => 0,
})
.sum::<u64>();
vec![
(OBJECTS, n_objects),
(UNFINISHED_UPLOADS, n_unfinished_uploads as i64),
(BYTES, n_bytes as i64),
]
}
}