Use content defined chunking #43

Merged
lx merged 4 commits from trinity-1686a/garage:content-defined-chunking into main 2021-04-06 20:18:46 +00:00

Current chunking create chunks of strictly equal lenght. For deduplication purpose, it is fine as long as content is modified, without adding or removing bytes. In case a single byte is added somewhere, chunks after that won't get deduplicated.

Content Defined Chunking tries to overcome this issue by cutting based on content instead of just lenght. In case some bytes are added or removed, usually one to two chunks don't get deduplicated.

This pull request attempt to replace current chunker with FastCdc.

Current chunking create chunks of strictly equal lenght. For deduplication purpose, it is fine as long as content is modified, without adding or removing bytes. In case a single byte is added somewhere, chunks after that won't get deduplicated. Content Defined Chunking tries to overcome this issue by cutting based on content instead of just lenght. In case some bytes are added or removed, usually one to two chunks don't get deduplicated. This pull request attempt to replace current chunker with FastCdc.
trinity-1686a added 2 commits 2021-03-17 00:37:20 +00:00
change crate used for cdc
Some checks failed
continuous-integration/drone/pr Build is failing
a32c0bac50
previous one seemed to output incorrect results
Owner

To add some context to trinity's PR:

I have not yet reviewed in depth the PR but plan to do it soon.
We will also need LX's opinion before merging it :)

But in any case, thanks a lot for your contribution!

To add some context to trinity's PR: - the PR uses the [fastcdc](https://crates.io/crates/fastcdc) crate in its last version, 1.0.5 - FastCDC has been published at USENIX ATC in 2016. - [Read the article (PDF)](https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf) - [Read the slides (PDF)](https://www.usenix.org/sites/default/files/conference/protected-files/atc16_slides_xia.pdf) I have not yet reviewed in depth the PR but plan to do it soon. We will also need LX's opinion before merging it :) But in any case, thanks a lot for your contribution!
lx closed this pull request 2021-03-18 18:27:20 +00:00
lx reopened this pull request 2021-03-18 18:36:29 +00:00
lx changed target branch from dev-0.2 to master 2021-03-18 18:37:03 +00:00
trinity-1686a force-pushed content-defined-chunking from a6c143f706 to e359a3db79 2021-03-18 20:15:31 +00:00 Compare
trinity-1686a force-pushed content-defined-chunking from e359a3db79 to ead91a837d 2021-03-18 21:16:44 +00:00 Compare
lx closed this pull request 2021-03-19 13:16:17 +00:00
lx reopened this pull request 2021-03-19 13:18:26 +00:00
lx changed target branch from master to main 2021-03-19 13:18:31 +00:00
trinity-1686a force-pushed content-defined-chunking from ead91a837d to 1acb8c8739 2021-03-19 17:13:26 +00:00 Compare
trinity-1686a force-pushed content-defined-chunking from 1acb8c8739 to 47d0aee9f8 2021-04-06 00:50:40 +00:00 Compare
trinity-1686a added 1 commit 2021-04-06 00:54:22 +00:00
run fmt
All checks were successful
continuous-integration/drone/pr Build is passing
b3b0b20d72
lx reviewed 2021-04-06 13:56:54 +00:00
Dismissed
@ -302,0 +313,4 @@
let block = self.buf.drain(..length).collect::<Vec<u8>>();
Ok(Some(block))
} else {
Ok(None)
Owner

I think that if FastCDC is giving us None here then it's a bug and we should throw an error. We can probably just put unreachable!() here.

I think that if FastCDC is giving us `None` here then it's a bug and we should throw an error. We can probably just put `unreachable!()` here.
Owner

LGTM. We should make stats to show how often FastCDC helps us deduplicate stuff. In the paper they use FastCDC with much smaller block sizes (around 10KB). I don't see many scenarios where large files (several MBs) are partially rewritten and some of the content is shifted in position. For small files like text documents it made more sense to me. Still, I'm fine with this as it can only be better that what we had before.

LGTM. We should make stats to show how often FastCDC helps us deduplicate stuff. In the paper they use FastCDC with much smaller block sizes (around 10KB). I don't see many scenarios where large files (several MBs) are partially rewritten and some of the content is shifted in position. For small files like text documents it made more sense to me. Still, I'm fine with this as it can only be better that what we had before.
Author
Owner

I don't have numbers to quantify how much better it is (if it is). What I know however is that Borg (backup software) uses chunks of min 512kio, average 2Mio and max size 8Mio (source) (using Buzhash instead of FastCDC), so I'm guessing it's probably not totally useless, unless it is.

I don't have numbers to quantify how much better it is (if it is). What I know however is that Borg (backup software) uses chunks of min 512kio, average 2Mio and max size 8Mio ([source](https://borgbackup.readthedocs.io/en/stable/internals/data-structures.html#chunker-details)) (using Buzhash instead of FastCDC), so I'm guessing it's probably not totally useless, [unless it is](https://en.wikipedia.org/wiki/Cargo_cult_programming).
trinity-1686a added 1 commit 2021-04-06 14:53:56 +00:00
mark branch as unreachable
All checks were successful
continuous-integration/drone/pr Build is passing
6cbc8d6ec9
lx merged commit 7380f3855c into main 2021-04-06 20:18:45 +00:00
trinity-1686a deleted branch content-defined-chunking 2021-04-07 09:58:59 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#43
No description provided.