Use content defined chunking #43
No reviewers
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#43
Loading…
Reference in a new issue
No description provided.
Delete branch "trinity-1686a/garage:content-defined-chunking"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Current chunking create chunks of strictly equal lenght. For deduplication purpose, it is fine as long as content is modified, without adding or removing bytes. In case a single byte is added somewhere, chunks after that won't get deduplicated.
Content Defined Chunking tries to overcome this issue by cutting based on content instead of just lenght. In case some bytes are added or removed, usually one to two chunks don't get deduplicated.
This pull request attempt to replace current chunker with FastCdc.
To add some context to trinity's PR:
I have not yet reviewed in depth the PR but plan to do it soon.
We will also need LX's opinion before merging it :)
But in any case, thanks a lot for your contribution!
a6c143f706
toe359a3db79
e359a3db79
toead91a837d
ead91a837d
to1acb8c8739
1acb8c8739
to47d0aee9f8
@ -302,0 +313,4 @@
let block = self.buf.drain(..length).collect::<Vec<u8>>();
Ok(Some(block))
} else {
Ok(None)
I think that if FastCDC is giving us
None
here then it's a bug and we should throw an error. We can probably just putunreachable!()
here.LGTM. We should make stats to show how often FastCDC helps us deduplicate stuff. In the paper they use FastCDC with much smaller block sizes (around 10KB). I don't see many scenarios where large files (several MBs) are partially rewritten and some of the content is shifted in position. For small files like text documents it made more sense to me. Still, I'm fine with this as it can only be better that what we had before.
I don't have numbers to quantify how much better it is (if it is). What I know however is that Borg (backup software) uses chunks of min 512kio, average 2Mio and max size 8Mio (source) (using Buzhash instead of FastCDC), so I'm guessing it's probably not totally useless, unless it is.