Block-based versus File-based Deduplication

Top Articles


Businesses have ever increasing amounts of data to protect that increases backup time, storage cost and network utilization. Upgrading storage and network equipment can partially help, but come with its own costs. If the backup sources have lots of duplicated data, data deduplication is a great solution for reducing backup size and time. This document discusses the considerations and options when evaluating data deduplication.

Different types of duplication

Data can be duplicated in different ways, affecting the effectiveness and run-time costs to deduplicate them.

  • Identical files – Multiple computers running the same OS and application versions would have gigabytes of identical files with matching dates.

  • Same file downloaded multiple times – If the same file is downloaded at different times, such as different users saving the same email attachment, the downloaded copies often have different dates set according to when the corresponding downloads occur.

  • Common data blocks in different files – An example is the same email attachment independently stored in different users’ Windows Outlook Archive.pst files.

Block-based deduplication

Some backup and storage solutions provide block-based deduplication. They divide each file into smaller blocks and checksum each block individually. Blocks with the same checksum are stored only once. The advantage of this method is that it handles different types of duplication. However, calculating and maintaining the checksums for large number of blocks can be resource intensive, in terms of memory and processor usage. For example, the ZFS file system recommends 5 GB of physical memory or SSD for each TB of deduplicated storage. To cope with such overhead, block-based deduplication products adopt different tradeoffs:

  • Server-side deduplication – All data blocks are read from backup source and sent to the server to deduplicate, to avoid burdening the backup source’s memory and processor. However, this doesn’t reduce source-side disk IO operations and network traffic.

  • Post processing – All data blocks are written to server storage and later deduplicated during off-peak hours, to minimize impact on server’s normal workload. However, this roughly double server-side storage IO operations and temporarily consumes more server storage.

  • Hardware acceleration – Overhead can be shifted from server to dedicated hardware, such as higher-cost storage appliance, adding hardware cost.

File-based deduplication

Unlike image-based backup products, which often result in larger backups and are also more dependent on block-based deduplication, Retrospect performs file level backup and deduplication based on file name, date and size. If a file at the backup source matches one already in the backup set, Retrospect skips the file without reading it. This efficiently deduplicates identical files across backup sources, reducing disk IO operations at the backup sources, network traffic as well as server storage IO operations and space consumption. However, it does not detect other types of data duplication mentioned above. This limitation can be partially offset by using compression and block-level incremental backup to reduce backup size and time.

Combining best of both worlds

We are in the process of certifying Retrospect compatibility with block-based deduplication solutions, such as Windows Server 2012’s built-in Data Deduplication. The combination takes advantage of Retrospect’s resource-efficient file-based deduplication to reduce backup size, thereby reducing the workload of subsequent server-side block-based deduplication.

VM backup

Retrospect’s file level deduplication applies across physical and virtual backup sources. For more information, see our article.

Questions?


Last Update: September 9, 2014