WIP add content defined chunking #42
No reviewers
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#42
Loading…
Reference in a new issue
No description provided.
Delete branch "content-defined-chunking"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Current chunking create chunks of strictly equal lenght. For deduplication purpose, it is fine as long as content is modified, without adding or removing bytes. In case a single byte is added somewhere, chunks after that won't get deduplicated.
Content Defined Chunking tries to overcome this issue by cutting based on content instead of just lenght. In case some bytes are added or removed, usually one to two chunks don't get deduplicated.
This pull request attempt to replace current chunker with FastCdc.
The pull request is marked as wip because it appears to create chunks considerably shorter than it should (with min size of 512kio, average of 1Mio and max of 2Mio, chunks are less than 600kio long). I don't know if this is due to the dataset I use, this specific chunker, or a buggy implementation
Hi trinity, thanks for the PR!
Unless I'm mistaken, it look to me that you might be feeding data twice to the chunker: when a chunk is taken from
buf
, some remaining data stays inbuf
. At the next iteration, the wholebuf
will be pushed again in the chunker, including the rest of data from the previous iteration, which was already pushed. This might explain why blocks don't have sizes consistent with the parameter of the algorithm.Side note: at the moment all developpement is going on in the
dev-0.2
branch. It shouldn't be too hard to rebase your patch on that branch. Alsodev-0.2
contains many bug fixes and improvements so it's a much better base to work on.The code definitelly looks odd, but its how the crate expect to be used based on its testsuite.
I'll rebase on
dev-0.2
, but I don't think I can change the target of a pr, so I'll close this one and open an other at next commit.Side note too : CI fail to clone branches from forked repository
7de671c48f
toa32c0bac50
Pull request closed