Use content defined chunking #43

trinity-1686a · 2021-03-17T00:37:19Z

trinity-1686a commented

2021-03-17 00:37:19 +00:00

Current chunking create chunks of strictly equal lenght. For deduplication purpose, it is fine as long as content is modified, without adding or removing bytes. In case a single byte is added somewhere, chunks after that won't get deduplicated.

Content Defined Chunking tries to overcome this issue by cutting based on content instead of just lenght. In case some bytes are added or removed, usually one to two chunks don't get deduplicated.

This pull request attempt to replace current chunker with FastCdc.

Current chunking create chunks of strictly equal lenght. For deduplication purpose, it is fine as long as content is modified, without adding or removing bytes. In case a single byte is added somewhere, chunks after that won't get deduplicated. Content Defined Chunking tries to overcome this issue by cutting based on content instead of just lenght. In case some bytes are added or removed, usually one to two chunks don't get deduplicated. This pull request attempt to replace current chunker with FastCdc.

trinity-1686a added 2 commits 2021-03-17 00:37:20 +00:00

add content defined chunking 6433bcae0c

change crate used for cdc

continuous-integration/drone/pr Build is failing

Details

a32c0bac50

previous one seemed to output incorrect results

quentin commented

2021-03-18 13:34:07 +00:00

To add some context to trinity's PR:

the PR uses the fastcdc crate in its last version, 1.0.5
FastCDC has been published at USENIX ATC in 2016.
- Read the article (PDF)
- Read the slides (PDF)

I have not yet reviewed in depth the PR but plan to do it soon.
We will also need LX's opinion before merging it :)

But in any case, thanks a lot for your contribution!

To add some context to trinity's PR: - the PR uses the [fastcdc](https://crates.io/crates/fastcdc) crate in its last version, 1.0.5 - FastCDC has been published at USENIX ATC in 2016. - [Read the article (PDF)](https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf) - [Read the slides (PDF)](https://www.usenix.org/sites/default/files/conference/protected-files/atc16_slides_xia.pdf) I have not yet reviewed in depth the PR but plan to do it soon. We will also need LX's opinion before merging it :) But in any case, thanks a lot for your contribution!

lx closed this pull request

2021-03-18 18:27:20 +00:00

lx reopened this pull request

2021-03-18 18:36:29 +00:00

lx changed target branch from dev-0.2 to master

2021-03-18 18:37:03 +00:00

trinity-1686a force-pushed content-defined-chunking from a6c143f706 to e359a3db79

2021-03-18 20:15:31 +00:00

Compare

trinity-1686a force-pushed content-defined-chunking from e359a3db79 to ead91a837d

2021-03-18 21:16:44 +00:00

Compare

lx closed this pull request

2021-03-19 13:16:17 +00:00

lx reopened this pull request

2021-03-19 13:18:26 +00:00

lx changed target branch from master to main

2021-03-19 13:18:31 +00:00

trinity-1686a force-pushed content-defined-chunking from ead91a837d to 1acb8c8739

2021-03-19 17:13:26 +00:00

Compare

trinity-1686a force-pushed content-defined-chunking from 1acb8c8739 to 47d0aee9f8

2021-04-06 00:50:40 +00:00

Compare

trinity-1686a added 1 commit 2021-04-06 00:54:22 +00:00

run fmt

continuous-integration/drone/pr Build is passing

Details

b3b0b20d72

lx reviewed 2021-04-06 13:56:54 +00:00

Dismissed

src/api/s3_put.rs Outdated

					
				@ -302,0 +313,4 @@

								let block = self.buf.drain(..length).collect::<Vec<u8>>();

								Ok(Some(block))

							} else {

								Ok(None)

lx commented

2021-04-06 13:56:54 +00:00

I think that if FastCDC is giving us None here then it's a bug and we should throw an error. We can probably just put unreachable!() here.

I think that if FastCDC is giving us `None` here then it's a bug and we should throw an error. We can probably just put `unreachable!()` here.

lx commented

2021-04-06 13:59:42 +00:00

LGTM. We should make stats to show how often FastCDC helps us deduplicate stuff. In the paper they use FastCDC with much smaller block sizes (around 10KB). I don't see many scenarios where large files (several MBs) are partially rewritten and some of the content is shifted in position. For small files like text documents it made more sense to me. Still, I'm fine with this as it can only be better that what we had before.

trinity-1686a commented

2021-04-06 14:49:21 +00:00

I don't have numbers to quantify how much better it is (if it is). What I know however is that Borg (backup software) uses chunks of min 512kio, average 2Mio and max size 8Mio (source) (using Buzhash instead of FastCDC), so I'm guessing it's probably not totally useless, unless it is.

I don't have numbers to quantify how much better it is (if it is). What I know however is that Borg (backup software) uses chunks of min 512kio, average 2Mio and max size 8Mio ([source](https://borgbackup.readthedocs.io/en/stable/internals/data-structures.html#chunker-details)) (using Buzhash instead of FastCDC), so I'm guessing it's probably not totally useless, [unless it is](https://en.wikipedia.org/wiki/Cargo_cult_programming).

trinity-1686a added 1 commit 2021-04-06 14:53:56 +00:00

mark branch as unreachable

continuous-integration/drone/pr Build is passing

Details

6cbc8d6ec9

lx merged commit 7380f3855c into main

2021-04-06 20:18:45 +00:00

lx referenced this pull request from a commit

2021-04-06 20:18:46 +00:00

Merge pull request 'Use content defined chunking' (#43) from trinity-1686a/garage:content-defined-chunking into main

trinity-1686a deleted branch content-defined-chunking

2021-04-07 09:58:59 +00:00

Sign in to join this conversation.

No reviewers

No milestone

No project

No assignees

3 participants

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#43

Rows
Columns