System and method for data driven de-duplication

US 8,495,028 B2
Filed: 09/08/2010
Issued: 07/23/2013
Est. Priority Date: 01/25/2010
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method of locating redundancy within data, the method comprising;

recording target locations within target data where a summary that identifies a particular pattern within the target data equals a predetermined value;

recording reference locations within reference data where a summary that identifies the particular pattern within the reference data equals the predetermined value;

determining a reference set of summaries of the reference data, each member of the reference set of summaries including one or more summaries indicative of one or more patterns of reference data located at one or more recorded reference locations within the reference data;

determining a target set of summaries of the target data, each member of the target set of summaries including one or more summaries indicative of one or more patterns of target data located at one or more recorded target locations within the target data;

prioritizing at least one member from the reference set of summaries relative to at least one other member of the reference set of summaries by comparing at least one summary included in the at least one member from the reference set of summaries to at least one summary included in the at least one other member of the reference set of summaries;

prioritizing at least one member from the target set of summaries relative to at least one other member of the target set of summaries by comparing at least one summary included in the at least one member from the target set of summaries to at least one summary included in the at least one other member of the target set of summaries;

identifying a subset of the reference data that is likely to match a subset of the target data by comparing the at least one summary included in the at least one member from the reference set of summaries to the at least one summary included in the at least one member from the target set of summaries; and

de-duplicating the subset of the target data with reference to the subset of the reference data, wherein recording the target locations includes recording target locations within the target data where a subset of a rolling hash value calculated for a region of the target data equals the predetermined value.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Described are computer-based methods and apparatuses, including computer program products, for removing redundant data from a storage system. In one example, a data delineation process delineates data targeted for de-duplication into regions using a plurality of markers. The de-duplication system determines which of these regions should be subject to further de-duplication processing by comparing metadata representing the regions to metadata representing regions of a reference data set. The de-duplication system identifies an area of data that incorporates the regions that should be subject to further de-duplication processing and de-duplicates this area with reference to a corresponding area within the reference data set.

85 Citations

View as Search Results

15 Claims

1. A computer implemented method of locating redundancy within data, the method comprising;
- recording target locations within target data where a summary that identifies a particular pattern within the target data equals a predetermined value;
  
  recording reference locations within reference data where a summary that identifies the particular pattern within the reference data equals the predetermined value;
  
  determining a reference set of summaries of the reference data, each member of the reference set of summaries including one or more summaries indicative of one or more patterns of reference data located at one or more recorded reference locations within the reference data;
  
  determining a target set of summaries of the target data, each member of the target set of summaries including one or more summaries indicative of one or more patterns of target data located at one or more recorded target locations within the target data;
  
  prioritizing at least one member from the reference set of summaries relative to at least one other member of the reference set of summaries by comparing at least one summary included in the at least one member from the reference set of summaries to at least one summary included in the at least one other member of the reference set of summaries;
  
  prioritizing at least one member from the target set of summaries relative to at least one other member of the target set of summaries by comparing at least one summary included in the at least one member from the target set of summaries to at least one summary included in the at least one other member of the target set of summaries;
  
  identifying a subset of the reference data that is likely to match a subset of the target data by comparing the at least one summary included in the at least one member from the reference set of summaries to the at least one summary included in the at least one member from the target set of summaries; and
  
  de-duplicating the subset of the target data with reference to the subset of the reference data, wherein recording the target locations includes recording target locations within the target data where a subset of a rolling hash value calculated for a region of the target data equals the predetermined value.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method according to claim 1, wherein determining the reference set of summaries includes calculating a hash value over a portion of the reference data sharing a boundary with at least one recorded reference location.
  - 3. The method according to claim 1, wherein identifying the subset of the reference data includes identifying an area of the target data associated with at least one member of the target set of summaries that matches at least one member of the reference set of summaries.
  - 4. The method according to claim 3, wherein identifying the subset of the reference data includes identifying an area of the reference data associated the at least one member of the reference set of summaries.
  - 5. The method according to claim 1, further comprising adjusting the subset of the reference data after identifying a neighboring area of the reference data associated with at least one other member of the reference set of summaries that matches at least one member of the target set of summaries.

6. A system configured to locate redundancy within data, the system comprising:
- data storage storing reference data and target data; and
  
  a processor coupled to the data storage and configured to;
  
  record target locations within the target data where a summary that identifies a particular pattern within the target data equals a predetermined value;
  
  record reference locations within the reference data where a summary that identifies the particular pattern within the reference data equals the predetermined value;
  
  determine a reference set of summaries of the reference data, each member of the reference set of summaries including one or more summaries indicative of one or more patterns of reference data located at one or more recorded reference locations within the reference data;
  
  determine a target set of summaries of the target data, each member of the target set of summaries including one or more summaries indicative of one or more patterns of target data located at one or more recorded target locations within the target data;
  
  prioritize at least one member from the reference set of summaries relative to at least one other member of the reference set of summaries by comparing at least one summary included in the at least one member from the reference set of summaries to at least one summary included in the at least one other member of the reference set of summaries;
  
  prioritize at least one member from the target set of summaries relative to at least one other member of the target set of summaries by comparing at least one summary included in the at least one member from the target set of summaries to at least one summary included in the at least one other member of the target set of summaries;
  
  identify a subset of the reference data that is likely to match a subset of the target data by comparing the at least one summary included in the at least one member from the reference set of summaries to the at least one summary included in the at least one member from the target set of summaries; and
  
  de-duplicate the subset of the target data with reference to the subset of the reference data, wherein the processor is configured to record the target locations by recording target locations within the target data where a subset of a rolling hash value calculated for a region of the target data equals the predetermined value.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The system according to claim 6, wherein the processor is configured to determine the reference set of summaries by calculating a hash value over a portion of the reference data sharing a boundary with at least one recorded reference location.
  - 8. The system according to claim 6, wherein the processor is configured to identify the subset of the reference data by, at least in part, identifying an area of the target data associated with at least one member of the target set of summaries that matches at least one member of the reference set of summaries.
  - 9. The system according to claim 8, wherein the processor is configured to identifying the subset of the reference data includes identifying an area of the reference data associated the at least one member of the reference set of summaries.
  - 10. The system according to claim 6, wherein the processor is further configured to adjust the subset of the reference data after identifying a neighboring area of the reference data associated with at least one other member of the reference set of summaries that matches at least one member of the target set of summaries.

11. A non-transitory computer readable medium storing computer readable instructions that, when executed by at least one processor, instruct the at least one processor to perform a method of locating redundancy within data, the method comprising:
- recording target locations within target data where a summary that identifies a particular pattern within the target data equals a predetermined value;
  
  recording reference locations within reference data where a summary that identifies the particular pattern within the reference data equals the predetermined value;
  
  determining a reference set of summaries of the reference data, each member of the reference set of summaries including one or more summaries indicative of one or more patterns of reference data located at one or more recorded reference locations within the reference data;
  
  determining a target set of summaries of the target data, each member of the target set of summaries including one or more summaries indicative of one or more patterns of target data located at one or more recorded target locations within the target data;
  
  prioritizing at least one member from the reference set of summaries relative to at least one other member of the reference set of summaries by comparing at least one summary included in the at least one member from the reference set of summaries to at least one summary included in the at least one other member of the reference set of summaries;
  
  prioritizing at least one member from the target set of summaries relative to at least one other member of the target set of summaries by comparing at least one summary included in the at least one member from the target set of summaries to at least one summary included in the at least one other member of the target set of summaries;
  
  identifying a subset of the reference data that is likely to match a subset of the target data by comparing the at least one summary included in the at least one member from the reference set of summaries to the at least one summary included in the at least one member from the target set of summaries; and
  
  de-duplicating the subset of the target data with reference to the subset of the reference data, wherein the instructions for recording the target locations instruct the processor to perform acts including recording target locations within the target data where a subset of a rolling hash value calculated for a region of the target data equals the predetermined value.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The computer readable medium according to claim 11, wherein the instructions for determining the reference set, of summaries instruct the processor to perform acts including calculating a hash value over a portion of the reference data sharing a boundary with at least one recorded reference location.
  - 13. The computer readable medium according to claim 11, wherein the instructions for identifying the subset of the reference data instruct the processor to perform acts including identifying an area of the target data associated with at least one member of the target set of summaries that matches at least one member of the reference set of summaries.
  - 14. The computer readable medium according to claim 13, wherein the instructions for identifying the subset of the reference data instruct the processor to perform acts including identifying an area of the reference data associated the at least one member of the reference set of summaries.
  - 15. The computer readable medium according to claim 11, wherein the instructions further instruct the processor to perform acts including adjusting the subset of the reference data after identifying a neighboring area of the reference data associated with at least one other member of the reference set of summaries that matches at least one member of the target set of summaries.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hitachi Vantara, LLC (Hitachi, Ltd.)
Original Assignee
Sepaton Incorporated (Hitachi, Ltd.)
Inventors
Reiter, Timmie G., McMaster, Carey Jay, Trimble, Ronald Ray, King, Stefan Merrill, Biernacki, David Michael, Kennedy, Jon Christopher
Primary Examiner(s)
GOFMAN, ALEX N

Application Number

US12/877,719
Publication Number

US 20110184921A1
Time in Patent Office

1,049 Days
Field of Search

707/692, 707/687, 707/698
US Class Current

707/687
CPC Class Codes

G06F 11/1453 using de-duplication of the...

System and method for data driven de-duplication

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

85 Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for data driven de-duplication

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

85 Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links