DESIGN - dedup - deduplicating backup program
HTML git clone git://bitreich.org/dedup/ git://enlrupgkhuxnvlhsf6lc3fziv5h2hhfrinws65d7roiv6bfj7d652fid.onion/dedup/
DIR Log
DIR Files
DIR Refs
DIR Tags
DIR README
DIR LICENSE
---
DESIGN (2323B)
---
1 Design notes
2 ============
3
4 There are three main abstractions in the design of dedup:
5
6 - The chunker interface
7 - The snapshot layer
8 - The block layer
9
10 The block layer
11 ---------------
12
13 From the outside world, the block layer is just an abstraction for
14 dealing with variable length blocks. All blocks are referenced with
15 their hash.
16
17 The block layer is arranged into a stack of layers. From top to
18 bottom these are as follows:
19
20 - Generic layer
21 - The compression layer
22 - The encryption layer
23 - The storage layer
24
25 The generic layer is the one that client code interfaces with. It is
26 the top level entrypoint to the block layer. The generic layer
27 calculates the hash of the block and passes it down to the compression
28 layer.
29
30 The compression layer will prepend a compression descriptor to the
31 block and then compress the block using snappy or lz4. It is possible
32 to disable compression in which case a special descriptor is prepended
33 and the data is passed uncompressed to the encryption layer.
34
35 The encryption layer will prepend an encryption descriptor to the
36 block and then encrypt/authenticate the block using XChaCha20 and
37 Poly1305. It is possible to disable encryption in which case it acts
38 as a bypass with a special type of encryption descriptor. The block
39 is then passed to the storage layer.
40
41 The storage layer will prepend a storage descriptor and append the
42 descriptor and the data to a single backing file.
43
44 The snapshot layer
45 ------------------
46
47 The snapshot abstraction is currently very simplistic. A snapshot is
48 a file under $repo/archive/<name>. The contents of the file are the
49 block hashes of the data stored in the snapshot.
50
51 The chunker interface
52 ---------------------
53
54 The chunker issues variable length blocks. The minimum block size is
55 512KB, the maximum block size is 8MB and the average block size is
56 2MB. These configuration parameters can be modified by editing
57 config.h but it can be tricky to tune it properly.
58
59 The buzhash[0] rolling hash algorithm is used to fingerprint the input
60 stream.
61
62 When encryption is enabled, a random seed is generated and stored
63 encrypted in the repository state file. The seed is XOR-ed with the
64 buzhash initial state table to mitigate against length fingerprinting
65 attacks.
66
67 [0] http://www.serve.net/buz/Notes.1st.year/HTML/C6/rand.012.html