WIP add content defined chunking #42

Closed
trinity-1686a wants to merge 42 commits from content-defined-chunking into master

Current chunking create chunks of strictly equal lenght. For deduplication purpose, it is fine as long as content is modified, without adding or removing bytes. In case a single byte is added somewhere, chunks after that won't get deduplicated.

Content Defined Chunking tries to overcome this issue by cutting based on content instead of just lenght. In case some bytes are added or removed, usually one to two chunks don't get deduplicated.

This pull request attempt to replace current chunker with FastCdc.

The pull request is marked as wip because it appears to create chunks considerably shorter than it should (with min size of 512kio, average of 1Mio and max of 2Mio, chunks are less than 600kio long). I don't know if this is due to the dataset I use, this specific chunker, or a buggy implementation

Current chunking create chunks of strictly equal lenght. For deduplication purpose, it is fine as long as content is modified, without adding or removing bytes. In case a single byte is added somewhere, chunks after that won't get deduplicated. Content Defined Chunking tries to overcome this issue by cutting based on content instead of just lenght. In case some bytes are added or removed, usually one to two chunks don't get deduplicated. This pull request attempt to replace current chunker with FastCdc. The pull request is marked as wip because it appears to create chunks considerably shorter than it should (with min size of 512kio, average of 1Mio and max of 2Mio, chunks are less than 600kio long). I don't know if this is due to the dataset I use, this specific chunker, or a buggy implementation
Owner

Hi trinity, thanks for the PR!

Unless I'm mistaken, it look to me that you might be feeding data twice to the chunker: when a chunk is taken from buf, some remaining data stays in buf. At the next iteration, the whole buf will be pushed again in the chunker, including the rest of data from the previous iteration, which was already pushed. This might explain why blocks don't have sizes consistent with the parameter of the algorithm.

Hi trinity, thanks for the PR! Unless I'm mistaken, it look to me that you might be feeding data twice to the chunker: when a chunk is taken from `buf`, some remaining data stays in `buf`. At the next iteration, the whole `buf` will be pushed again in the chunker, including the rest of data from the previous iteration, which was already pushed. This might explain why blocks don't have sizes consistent with the parameter of the algorithm.
Owner

Side note: at the moment all developpement is going on in the dev-0.2 branch. It shouldn't be too hard to rebase your patch on that branch. Also dev-0.2 contains many bug fixes and improvements so it's a much better base to work on.

Side note: at the moment all developpement is going on in the `dev-0.2` branch. It shouldn't be too hard to rebase your patch on that branch. Also `dev-0.2` contains many bug fixes and improvements so it's a much better base to work on.
Author
Owner

Unless I'm mistaken, it look to me that you might be feeding data twice to the chunker: when a chunk is taken from buf, some remaining data stays in buf. At the next iteration, the whole buf will be pushed again in the chunker, including the rest of data from the previous iteration, which was already pushed. This might explain why blocks don't have sizes consistent with the parameter of the algorithm.

The code definitelly looks odd, but its how the crate expect to be used based on its testsuite.
loop {
    let p = fastcdc.push(&data[..]);
    if p == None || p.unwrap() == data.len() {
        break;
    } else {
        ct += 1;
        if ct > 5 {
            return;
        }
        data = &data[p.unwrap()..];
    }
}

I'll rebase on dev-0.2, but I don't think I can change the target of a pr, so I'll close this one and open an other at next commit.

Side note too : CI fail to clone branches from forked repository

> Unless I'm mistaken, it look to me that you might be feeding data twice to the chunker: when a chunk is taken from buf, some remaining data stays in buf. At the next iteration, the whole buf will be pushed again in the chunker, including the rest of data from the previous iteration, which was already pushed. This might explain why blocks don't have sizes consistent with the parameter of the algorithm. <details> <summary>The code definitelly looks odd, but its how the crate expect to be used based on its testsuite.</summary> ```rust loop { let p = fastcdc.push(&data[..]); if p == None || p.unwrap() == data.len() { break; } else { ct += 1; if ct > 5 { return; } data = &data[p.unwrap()..]; } } ``` </details> <br/> I'll rebase on `dev-0.2`, but I don't think I can change the target of a pr, so I'll close this one and open an other at next commit. Side note too : CI fail to clone branches from forked repository
trinity-1686a force-pushed content-defined-chunking from 7de671c48f to a32c0bac50 2021-03-17 00:34:15 +00:00 Compare
trinity-1686a closed this pull request 2021-03-17 00:34:26 +00:00
Some checks failed
continuous-integration/drone/pr Build is failing

Pull request closed

Sign in to join this conversation.
No description provided.