System and method for providing data driven de-duplication services

US 8,447,741 B2
Filed: 09/08/2010
Issued: 05/21/2013
Est. Priority Date: 01/25/2010
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method of identifying reference data likely to match target data, the method comprising:

reading a reference set of summaries of data included in a reference data set, each member of the reference set of summaries including a plurality of summaries that indicate particular patterns of the reference data within the reference data set;

comparing the reference set of summaries to a target set of summaries associated with at least one target area of a plurality of target areas, each member of the target set of summaries including a plurality of summaries that indicate particular patterns of the target data included in the at least one target area, the plurality of target areas being included in a target data set;

associating the at least one target area with the reference data set when a threshold number of members of the target set of summaries associated with the at least one target area match members of the reference set of summaries, wherein the reference data set includes a plurality of reference areas, each reference area of the plurality of reference areas being associated with at least one member of the reference set of summaries; and

selecting at least one reference area of the plurality of references areas based on a number of members of the target set of summaries associated with the at least one target area that match members of the reference set of summaries associated with the at least one reference area.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Described are computer-based methods and apparatuses, including computer program products, for removing redundant data from a storage system. In one example, a data delineation process delineates data targeted for de-duplication into regions using a plurality of markers. The de-duplication system determines which of these regions should be subject to further de-duplication processing by comparing metadata representing the regions to metadata representing regions of a reference data set. The de-duplication system identifies an area of data that incorporates the regions that should be subject to further de-duplication processing and de-duplicates this area with reference to a corresponding area within the reference data set.

91 Citations

View as Search Results

17 Claims

1. A computer implemented method of identifying reference data likely to match target data, the method comprising:
- reading a reference set of summaries of data included in a reference data set, each member of the reference set of summaries including a plurality of summaries that indicate particular patterns of the reference data within the reference data set;
  
  comparing the reference set of summaries to a target set of summaries associated with at least one target area of a plurality of target areas, each member of the target set of summaries including a plurality of summaries that indicate particular patterns of the target data included in the at least one target area, the plurality of target areas being included in a target data set;
  
  associating the at least one target area with the reference data set when a threshold number of members of the target set of summaries associated with the at least one target area match members of the reference set of summaries, wherein the reference data set includes a plurality of reference areas, each reference area of the plurality of reference areas being associated with at least one member of the reference set of summaries; and
  
  selecting at least one reference area of the plurality of references areas based on a number of members of the target set of summaries associated with the at least one target area that match members of the reference set of summaries associated with the at least one reference area.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method according to claim 1, wherein reading the reference set of summaries includes reading a set of hash values.
  - 3. The method according to claim 1, wherein selecting the at least one reference area of the plurality of reference areas includes selecting at least one reference area of the plurality of references areas based on a number of members of the target set of summaries associated with the at least one target area that match members of the reference set of summaries associated with at least one neighboring reference area of the plurality of reference areas that neighbors the at least one reference area.
  - 4. The method according to claim 1, further comprising adjusting the at least one reference area to include the at least one neighboring reference area when at least one member of the target set of summaries associated with the at least one target area matches at least one member of the reference set of summaries associated with the at least one neighboring reference area.
  - 5. The method according to claim 3, further comprising adjusting the at least one target area to include at least one neighboring target area when at least one member of the reference set of summaries associated with the at least one reference area matches at least one member of the target set of summaries associated with the at least one neighboring target area.
  - 6. The method according to claim 5, further comprising de-duplicating the at least one target area with reference to the at least one reference area.

7. A system configured to identify reference data likely to match target data, the system comprising:
- data storage storing a target data set; and
  
  a processor coupled to the data storage and configured to;
  
  read a reference set of summaries of data included in a reference data set, each member of the reference set of summaries including a plurality of summaries that indicate particular patterns of the reference data within the reference data set;
  
  compare the reference set of summaries to a target set of summaries associated with at least one target area of a plurality of target areas, each member of the target set of summaries including a plurality of summaries that indicate particular patterns of the target data included in the at least one target area, the plurality of target areas being included in the target data set;
  
  associate the at least one target area with the reference data set when a threshold number of members of the target set of summaries associated with the at least one target area match members of the reference set of summaries,wherein the reference data set includes a plurality of reference areas, each reference area of the plurality of reference areas being associated with at least one member of the reference set of summaries; and
  
  select at least one reference area of the plurality of references areas based on a number of members of the target set of summaries associated with the at least one target area that match members of the reference set of summaries associated with the at least one reference area.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The system according to claim 7, wherein the processor is configured to read the reference set of summaries by reading a set of hash values.
  - 9. The system according to claim 7, wherein the processor is configured to select the at least one reference area of the plurality of reference areas by selecting at least one reference area of the plurality of references areas based on a number of members of the target set of summaries associated with the at least one target area that match members of the reference set of summaries associated with at least one neighboring reference area of the plurality of reference areas that neighbors the at least one reference area.
  - 10. The system according to claim 8, wherein the processor is further configured to adjust the at least one reference area to include the at least one neighboring reference area when at least one member of the target set of summaries associated with the at least one target area matches at least one member of the reference set of summaries associated with the at least one neighboring reference area.
  - 11. The system according to claim 10, wherein the processor is further configured to adjust the at least one target area to include at least one neighboring target area when at least one member of the reference set of summaries associated with the at least one reference area matches at least one member of the target set of summaries associated with the at least one neighboring target area.
  - 12. The system according to claim 11, wherein the processor is further configured to de-duplicate the at least one target area with reference to the at least one reference area.

13. A non-transitory computer readable medium storing computer readable instructions that, when executed by at least one processor, instruct the at least one processor to perform a method of identifying reference data likely to match target data, the method comprising:
- reading a reference set of summaries of data included in a reference data set, each member of the reference set of summaries including a plurality of summaries that indicate particular patterns of the reference data within the reference data set;
  
  comparing the reference set of summaries to a target set of summaries associated with at least one target area of a plurality of target areas, each member of the target set of summaries including a plurality of summaries that indicate particular patterns of the target data included in the at least one target area, the plurality of target areas being included in a target data set;
  
  associating the at least one target area with the reference data set when a threshold number of members of the target set of summaries associated with the at least one target area match members of the reference set of summaries, wherein the reference data set includes a plurality of reference areas, each reference area of the plurality of reference areas being associated with at least one member of the reference set of summaries; and
  
  selecting at least one reference area of the plurality of references areas based on a number of members of the target set of summaries associated with the at least one target area that match members of the reference set of summaries associated with the at least one reference area.
- View Dependent Claims (14, 15, 16, 17)
- - 14. The computer readable medium according to claim 13, wherein the instructions for selecting the at least one reference area of the plurality of reference areas instruct the processor to perform acts including selecting at least one reference area of the plurality of references areas based on a number of members of the target set of summaries associated with the at least one target area that match members of the reference set of summaries associated with at least one neighboring reference area of the plurality of reference areas that neighbors the at least one reference area.
  - 15. The computer readable medium according to claim 14, wherein the instructions further instruct the processor to perform acts including comprising adjusting the at least one reference area to include the at least one neighboring reference area when at least one member of the target set of summaries associated with the at least one target area matches at least one member of the reference set of summaries associated with the at least one neighboring reference area.
  - 16. The computer readable medium according to claim 15, wherein the instructions further instruct the processor to perform acts including adjusting the at least one target area to include at least one neighboring target area when at least one member of the reference set of summaries associated with the at least one reference area matches at least one member of the target set of summaries associated with the at least one neighboring target area.
  - 17. The computer readable medium according to claim 16, wherein the instructions further instruct the processor to perform acts including de-duplicating the at least one target area with reference to the at least one reference area.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hitachi Vantara, LLC (Hitachi, Ltd.)
Original Assignee
Sepaton Incorporated (Hitachi, Ltd.)
Inventors
Reiter, Timmie G., McMaster, Carey Jay, Trimble, Ronald Ray, King, Stefan Merrill, Biernacki, David Michael, Kennedy, Jon Christopher
Primary Examiner(s)
Corrielus, Jean M

Application Number

US12/877,735
Publication Number

US 20110184967A1
Time in Patent Office

986 Days
Field of Search

707/758, 707/692, 711/156, 719/230, 719/231, 715/760, 718/100
US Class Current

707/692
CPC Class Codes

G06F 11/1453 using de-duplication of the...

System and method for providing data driven de-duplication services

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

91 Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for providing data driven de-duplication services

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

91 Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links