Use content defined chunking #43
No reviewers
Labels
No Label
AdminAPI
Bug
Check AWS
CI
Correctness
Critical
Documentation
Ideas
Improvement
Low priority
Newcomer
Performance
S3 Compatibility
Testing
Usability
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#43
Loading…
Reference in New Issue
No description provided.
Delete Branch "trinity-1686a/garage:content-defined-chunking"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Current chunking create chunks of strictly equal lenght. For deduplication purpose, it is fine as long as content is modified, without adding or removing bytes. In case a single byte is added somewhere, chunks after that won't get deduplicated.
Content Defined Chunking tries to overcome this issue by cutting based on content instead of just lenght. In case some bytes are added or removed, usually one to two chunks don't get deduplicated.
This pull request attempt to replace current chunker with FastCdc.
To add some context to trinity's PR:
I have not yet reviewed in depth the PR but plan to do it soon.
We will also need LX's opinion before merging it :)
But in any case, thanks a lot for your contribution!
a6c143f706
toe359a3db79
e359a3db79
toead91a837d
ead91a837d
to1acb8c8739
1acb8c8739
to47d0aee9f8
@ -302,0 +313,4 @@
let block = self.buf.drain(..length).collect::<Vec<u8>>();
Ok(Some(block))
} else {
Ok(None)
I think that if FastCDC is giving us
None
here then it's a bug and we should throw an error. We can probably just putunreachable!()
here.LGTM. We should make stats to show how often FastCDC helps us deduplicate stuff. In the paper they use FastCDC with much smaller block sizes (around 10KB). I don't see many scenarios where large files (several MBs) are partially rewritten and some of the content is shifted in position. For small files like text documents it made more sense to me. Still, I'm fine with this as it can only be better that what we had before.
I don't have numbers to quantify how much better it is (if it is). What I know however is that Borg (backup software) uses chunks of min 512kio, average 2Mio and max size 8Mio (source) (using Buzhash instead of FastCDC), so I'm guessing it's probably not totally useless, unless it is.