Deduplication Seeding

US 20130179407A1
Filed: 01/11/2012
Published: 07/11/2013
Est. Priority Date: 01/11/2012
Status: Active Grant

First Claim

Patent Images

1. A non-transitory computer-readable medium storing computer- executable instructions that when executed by a computer cause the computer to perform a data de-duplication method, the method comprising:

re-configuring a data de-duplication repository with a first blocklet taken from a source other than a data stream being ingested by a data de-duplication apparatus; and

re-configuring a data de-duplication index associated with the data de-duplication repository with index information about the first blocklet,where reconfiguring the data de-duplication repository or the data de-duplication index increases the likelihood that a second blocklet will be treated as a duplicate blocklet when processed by the data de-duplication apparatus using the data de-duplication repository and the data-duplication index to support duplicate blocklet determinations.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Apparatus, methods, and other embodiments associated with de- duplication seeding are described. One example method includes re-configuring a data de-duplication repository with a blocklet from a data de-duplication seed corpus. Reconfiguring the repository may include adding a blocklet from the seed corpus to the repository, activating a blocklet identified with the seed corpus in the repository, removing a blocklet from the repository, and de-activating a blocklet in the repository. The example method may also include re-configuring a data de-duplication index associated with the data de-duplication repository with information about the blocklet. Reconfiguring the repository and the index increases the likelihood that a blocklet ingested by a data de-duplication apparatus that relies on the repository and the index will be treated as a duplicate blocklet by the data de-duplication apparatus.

15 Citations

View as Search Results

20 Claims

1. A non-transitory computer-readable medium storing computer- executable instructions that when executed by a computer cause the computer to perform a data de-duplication method, the method comprising:
- re-configuring a data de-duplication repository with a first blocklet taken from a source other than a data stream being ingested by a data de-duplication apparatus; and
  
  re-configuring a data de-duplication index associated with the data de-duplication repository with index information about the first blocklet,where reconfiguring the data de-duplication repository or the data de-duplication index increases the likelihood that a second blocklet will be treated as a duplicate blocklet when processed by the data de-duplication apparatus using the data de-duplication repository and the data-duplication index to support duplicate blocklet determinations.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The non-transitory computer-readable medium of claim 1, where the source other than the data stream being ingested is a seed corpus, and where re-configuring the repository comprises moving the first blocklet from the seed corpus into the repository.
  - 3. The non-transitory computer-readable medium of claim 1, where re-configuring the repository comprises activating the first blocklet in the repository.
  - 4. The non-transitory computer-readable medium of claim 1, where reconfiguring the index comprises moving information about the first blocklet into the index.
  - 5. The non-transitory computer-readable medium of claim 1, where reconfiguring the index comprises activating information about the first blocklet in the index.
  - 6. The non-transitory computer-readable medium of claim 2, comprising selecting the seed corpus from two or more available seed corpora.
  - 7. The non-transitory computer-readable medium of claim 6, where the seed corpus is selected as a function of one or more of, a relationship between data to be ingested by the data de-duplication apparatus and the seed corpus, a historical performance measurement associated with the seed corpus, an on-the-fly performance measurement associated with the seed corpus, a user action, a calendar date, a day of the week, a time of day, a user identity, and an occurrence of a pre-defined event.
  - 8. The non-transitory computer-readable medium of claim 2, comprising generating a new seed corpus.
  - 9. The non-transitory computer-readable medium of claim 8, where generating the new seed corpus comprises selecting a seed blocklet from an existing repository based, at least in part, on one or more of, a reference count associated with the seed blocklet, an attribute describing the generalness of the seed blocklet, a trial and error approach, and a random approach.

10. A data de-duplication apparatus, comprising:
- a processor;
  
  a memory;
  
  a set of logics; and
  
  an interface to connect the processor, the memory, and the set of logics, the set of logics comprising;
  
  a first logic configured to manipulate a data de-duplication repository with a first blocklet associated with a seed corpus, where the data de-duplication apparatus uses the data de-duplication repository to make duplicate blocklet determinations; and
  
  a second logic configured to manipulate a data de-duplication index with information about the first blocklet, where the data de-duplication apparatus uses the data de-duplication index to make duplicate determinations,where manipulating the repository with the first blocklet and manipulating the index with the information about the first blocklet change the likelihood that a second blocklet processed by the data de-duplication apparatus will be treated as a duplicate blocklet.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 11. The apparatus of claim 10, where the first logic is configured to manipulate the repository by adding the first blocklet from the seed corpus to the repository, andwhere the second logic is configured to manipulate the index by adding the information about the first blocklet to the index.
  - 12. The apparatus of claim 10, where the first logic is configured to manipulate the repository by moving the first blocklet from a first storage device having a first access time to a second storage device having a second access time, where the second access time is faster than the first access time, andwhere the second logic is configured to manipulate the index with information about the location of the first blocklet in the second storage device.
  - 13. The apparatus of claim 10, comprising a third logic configured to select the seed corpus from two or more available seed corpora.
  - 14. The apparatus of claim 13, where the two or more available seed corpora include data associated with one or more of, a generic seed corpus, a company specific seed corpus, an application specific seed corpus, a user specific seed corpus, a topic specific seed corpus, a language specific seed corpus, a calendar day specific seed corpus, a day of the week specific seed corpus, a pre-defined event seed corpus, a type of backup seed corpus, and a random seed corpus.
  - 15. The apparatus of claim 13, where the third logic is configured to select the seed corpus based, at least in part, on one or more characteristics of data to be ingested by the data de-duplication apparatus.
  - 16. The apparatus of claim 13, comprising a fourth logic configured to produce a new seed corpus based, at least in part, on one or more of, selecting a blocklet for the seed corpus from an existing repository as a function of a reference count associated with a blocklet in the repository, selecting a blocklet for the seed corpus from an existing repository as a function of how likely the blocklet in the existing repository is to produce a generic match in a data stream to be ingested by the data de-duplication apparatus, and selecting a blocklet for the seed corpus from an existing repository using a random approach.
  - 17. The apparatus of claim 10, where the first logic is configured to manipulate the repository using either a seed corpus that was provided at the time of the initial configuration of the data de-duplication apparatus or a seed corpus that was provided after the time of the initial configuration of the data de-duplication apparatus.
  - 18. The apparatus of claim 10, where the first logic is configured to manipulate the repository using a seed corpus that is part of one or more of, a hierarchy of seed corpora, and a grouping of seed corpora.
  - 19. The apparatus of claim 10, where the first logic is configured to remove a selected blocklet from the repository and to replace the selected blocklet with the first blocklet.

20. A system, comprising:
- means for identifying a property of a data stream being processed by a data de-duplication apparatus; and
  
  means for updating a data de-duplication repository of unique blocks in use by the data de-duplication apparatus with data from a data de-duplication seed corpus, where the data from the data de-duplication seed corpus is configured to increase a de-duplication rate for the data stream being processed.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Quantum Corporation (Chi Ko Investment Co., Ltd.)
Original Assignee
Quantum Corporation (Chi Ko Investment Co., Ltd.)
Inventors
STOAKES, Timothy

Granted Patent

US 8,892,526 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/692
CPC Class Codes

G06F 16/1752 based on file chunks

Deduplication Seeding

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

15 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Deduplication Seeding

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

15 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links