Deduplication best practices

Jan 1, 2018
2 min read

Currently, we offer a post-process deduplication feature. This requires a backup job data to occupy space on your RAID while deduplication occurs. Deduplication occurs by reading the backup job data and copying unique files to a repository that is also on the RAID, while duplicate files are simply referenced and not copied. During this time, both the original backup job data and repository copies of unique files are on the RAID. In the worst case, (if a backup job consists entirely of unique files), free space equal to the backup job size is required for the repository copies in addition to the space the backup itself will require. Consequently, prior to running a backup job which will be deduplicated, the amount of free space required to backup and deduplicate may be as much as double the size of the data to be backed up. When deduplication completes, the original backup job data is then deleted from the RAID.

Given these requirements, appliance checks the following before deduplicating a backup job on the RAID:

There must be at least 1.5 GB free space on the RAID, and
There must be more free space on the RAID than the size of the backup job.

For example, if a 100 GB backup job completes, then appliance requires a minimum 100 GB of free space to begin the deduplication process but worst-case may require up to 200 GB of free space.

If your appliance is nearing its full capacity, you can try these suggestions to allow successful deduplication:

Stagger the full jobs of large clients.

The time between each job should account for its backup, importing, and deduplication time, plus a buffer. For example, if you have 3 clients that each take 30 minutes to backup, import and deduplicate, then schedule the full backups at least 45-60 minutes apart. The length of time it takes to backup and deduplicate varies with the size and number of files.
If a single client’s full backup is too large and will not deduplicate, then we recommend you break the client into two or more separate clients, each with a non-overlapping file set that roughly has a balanced amount of data.

For example, if you have a client named SRV with 1 TB of data on C: and 1 TB of data on D:, create a client named SRV-A with C: in its file set, and a client named SRV-B with D: in its file set. Stagger these clients’ schedules as above. Separating the data in this manner would require a minimum of 2 TB of free space instead of 4 TB.