Systems and methods of locating redundant data using patterns of matching fingerprints

US 9,766,832 B2
Filed: 01/22/2014
Issued: 09/19/2017
Est. Priority Date: 03/15/2013
Status: Active Grant

First Claim

Patent Images

1. A method of computing match potential between first data and second data, the method comprising:

identifying, by one or more processors, a first sequence of fingerprints characterizing a first plurality of sections of the first data, the first sequence being ordered according to an order of the first plurality of sections within the first data;

identifying a second sequence of fingerprints comprising fingerprints that match fingerprints within the first sequence, the second sequence of fingerprints characterizing a second plurality of sections of the second data, the second sequence being ordered according to an order of the second plurality of sections within the second data, wherein the first sequence includes a plurality of first subsequences, each first subsequence comprising one or more of the fingerprints of the first sequence, and the second sequence includes a plurality of second subsequences, each second subsequence comprising one or more of the fingerprints of the second sequence;

quantifying a similarity between the first sequence and the second sequence by quantifying a similarity between at least one first subsequence of the plurality of first subsequences and at least one second subsequence of the plurality of second subsequences by computing at least one score for at least one first section of the first plurality of sections, the at least one first section being characterized by one or more fingerprints within the at least one first subsequence;

adjusting the match potential between the first data and the second data at least partially based on the quantified similarity; and

performing de-duplication of at least a portion of the first plurality of sections at least partially based on the match potential, wherein;

the at least one first section is associated with a matching range that identifies at least one first ordinal position within the second data, the at least one first ordinal position being one or more ordinal positions of one or more of the second plurality of sections, the one or more of the second plurality of sections being characterized by one or more fingerprints that match one or more fingerprints characterizing the at least one first section; and

computing the at least one score for the at least one first section includes;

selecting a plurality of combinations of sections from a subset of the first data, the subset being characterized by fingerprints included in the at least one first subsequence, each of the plurality of combinations including the at least one first section and at least one second section of the subset, the at least one second section being associated with one or more matching sections within the second plurality of sections, the one or more matching sections being characterized by at least one fingerprint that matches at least one fingerprint characterizing the second section;

computing one or more adjusted ordinal positions for the one or more matching sections based on an ordinal position of the first section, an ordinal position of the second section, and one or more ordinal positions of the one or more matching sections; and

increasing the score for each adjusted ordinal position of the one or more adjusted ordinal positions disposed within the matching range of the at least one first section.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system configured to compute match potential between first data and second data is provided. The system includes data storage storing the first data and the second data, and at least one processor coupled to the data storage. The at least one processor is configured to identify a first sequence of fingerprints characterizing a first plurality of sections of the first data, the first sequence being ordered according to an order of the first plurality of sections within the first data; identify a second sequence of fingerprints comprising fingerprints that match fingerprints within the first sequence, the second sequence of fingerprints characterizing a second plurality of sections of the second data, the second sequence being ordered according to an order of the second plurality of sections within the second data; quantify a similarity between the first sequence and the second sequence; and adjust the match potential based on the similarity.

108 Citations

13 Claims

1. A method of computing match potential between first data and second data, the method comprising:
- identifying, by one or more processors, a first sequence of fingerprints characterizing a first plurality of sections of the first data, the first sequence being ordered according to an order of the first plurality of sections within the first data;
  
  identifying a second sequence of fingerprints comprising fingerprints that match fingerprints within the first sequence, the second sequence of fingerprints characterizing a second plurality of sections of the second data, the second sequence being ordered according to an order of the second plurality of sections within the second data, wherein the first sequence includes a plurality of first subsequences, each first subsequence comprising one or more of the fingerprints of the first sequence, and the second sequence includes a plurality of second subsequences, each second subsequence comprising one or more of the fingerprints of the second sequence;
  
  quantifying a similarity between the first sequence and the second sequence by quantifying a similarity between at least one first subsequence of the plurality of first subsequences and at least one second subsequence of the plurality of second subsequences by computing at least one score for at least one first section of the first plurality of sections, the at least one first section being characterized by one or more fingerprints within the at least one first subsequence;
  
  adjusting the match potential between the first data and the second data at least partially based on the quantified similarity; and
  
  performing de-duplication of at least a portion of the first plurality of sections at least partially based on the match potential, wherein;
  
  the at least one first section is associated with a matching range that identifies at least one first ordinal position within the second data, the at least one first ordinal position being one or more ordinal positions of one or more of the second plurality of sections, the one or more of the second plurality of sections being characterized by one or more fingerprints that match one or more fingerprints characterizing the at least one first section; and
  
  computing the at least one score for the at least one first section includes;
  
  selecting a plurality of combinations of sections from a subset of the first data, the subset being characterized by fingerprints included in the at least one first subsequence, each of the plurality of combinations including the at least one first section and at least one second section of the subset, the at least one second section being associated with one or more matching sections within the second plurality of sections, the one or more matching sections being characterized by at least one fingerprint that matches at least one fingerprint characterizing the second section;
  
  computing one or more adjusted ordinal positions for the one or more matching sections based on an ordinal position of the first section, an ordinal position of the second section, and one or more ordinal positions of the one or more matching sections; and
  
  increasing the score for each adjusted ordinal position of the one or more adjusted ordinal positions disposed within the matching range of the at least one first section.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method according to claim 1, wherein performing the de-duplication of at least the portion of the first plurality of sections comprises:
    - performing at least one additional process to determine that at least the portion of the first plurality of sections and at least a portion of the second plurality of sections are indicated to include duplicate data; and
      
      storing, for at least the portion of the first plurality of sections, reference information that references at least the portion of the second plurality of sections.
  - 3. The method according to claim 1, wherein a magnitude of the at least one score is related to a number of matching fingerprints having a same order within the first sequence and the second sequence.
  - 4. The method according to claim 1, wherein quantifying the similarity includes quantifying a similarity between subsequences of the plurality of first subsequences that overlap other subsequences of the first plurality of subsequences and subsequences of the second subsequences that overlap other subsequences of the plurality of second subsequences.
  - 5. The method according to claim 1, further comprising terminating quantification of the similarity when the at least one first ordinal position is less than at least one second ordinal position identified in a matching range associated with a section of the first data having an ordinal position that is greater than the at least one first section.
  - 6. The method according claim 1, whereinthe at least one first ordinal position includes a plurality of ordinal positions, andincreasing the score includes increasing the score for each adjusted ordinal position within a matching range that identifies the plurality of ordinal positions.
  - 7. The method according to claim 6, whereinthe plurality of ordinal positions includes two ordinal positions, one of the two ordinal positions being a minimum ordinal position of the at least one first section, and another of the two positions being a maximum ordinal position of the at least one first section, andincreasing the score includes increasing the score for each adjusted ordinal position within a matching range that identifies the minimum ordinal position and the maximum ordinal position.

8. A system configured to compute match potential between first data and second data, the system comprising:
- data storage storing the first data and the second data; and
  
  at least one processor coupled to the data storage and configured to;
  
  identify a first sequence of fingerprints characterizing a first plurality of sections of the first data, the first sequence being ordered according to an order of the first plurality of sections within the first data;
  
  identify a second sequence of fingerprints comprising fingerprints that match fingerprints within the first sequence, the second sequence of fingerprints characterizing a second plurality of sections of the second data, the second sequence being ordered according to an order of the second plurality of sections within the second data, wherein the first sequence includes a plurality of first subsequences, each first subsequence comprising one or more of the fingerprints of the first sequence, and the second sequence includes a plurality of second subsequences, each second subsequence comprising one or more of the fingerprints of the second sequence;
  
  quantify a similarity between the first sequence and the second sequence by quantifying a similarity between at least one first subsequence of the plurality of first subsequences and at least one second subsequence of the plurality of second subsequences by computing at least one score for at least one first section of the first plurality of sections, the at least one first section being characterized by one or more fingerprints within the at least one first subsequence;
  
  adjust the match potential between the first data and the second data at least partially based on the quantified similarity; and
  
  perform de-duplication of at least a portion of the first plurality of sections at least partially based on the match potential, wherein;
  
  the at least one first section is associated with a matching range that identifies at least one first ordinal position within the second data, the at least one first ordinal position being one or more ordinal positions of one or more of the second plurality of sections, the one or more of the second plurality of sections being characterized by one or more fingerprints that match one or more fingerprints characterizing the at least one first section; and
  
  the at least one processor is configured to compute the at least one score by, at least in part;
  
  selecting a plurality of combinations of sections from a subset of the first data, the subset being characterized by fingerprints included in the at least one first subsequence, each of the plurality of combinations including the at least one first section and at least one second section of the subset, the at least one second section being associated with one or more matching sections within the second plurality of sections, the one or more matching sections being characterized by at least one fingerprint that matches at least one fingerprint characterizing the second section;
  
  computing one or more adjusted ordinal positions for the one or more matching sections based on an ordinal position of the first section, an ordinal position of the second section, and one or more ordinal positions of the one or more matching sections; and
  
  increasing the score for each adjusted ordinal position of the one or more adjusted ordinal positions disposed within the matching range of the at least one first section.
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The system according to claim 8, wherein the at least one processor is configured to perform the de-duplication of at least the portion of the first plurality of sections by:
    - performing at least one additional process to determine that at least the portion of the first plurality of sections and at least a portion of the second plurality of sections are indicated to include duplicate data; and
      
      storing, for at least the portion of the first plurality of sections, reference information that references at least the portion of the second plurality of sections.
  - 10. The system according to claim 8, wherein the at least one processor is configured to quantify the similarly by quantifying a similarity between subsequences of the first plurality of subsequences that overlap other subsequences of the first plurality of subsequences and subsequences of the second subsequences that overlap other subsequences of the second plurality of subsequences.
  - 11. The system according to claim 8, wherein the at least one processor is further configured to terminate quantification of the similarity when the at least one first ordinal position is less than at least one second ordinal position identified in a matching range associated with a section of the first data having an ordinal position that is greater than the at least one first section.
  - 12. The system according claim 8, whereinthe at least one first ordinal position includes a plurality of ordinal positions, andthe at least one processor is configured to increase the score by, at least in part, increasing the score for each adjusted ordinal position within a matching range that identifies the plurality of ordinal positions.
  - 13. The system according to claim 12, wherein the plurality of ordinal positions includes two ordinal positions, one of the two ordinal positions being a minimum ordinal position of the at least one first section, and another of the two positions being a maximum ordinal position of the at least one first section.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hitachi Vantara, LLC (Hitachi, Ltd.)
Original Assignee
Hitachi Data Systems Corporation (Hitachi, Ltd.)
Inventors
Trimble, Ronald Ray, Kennedy, Jon Christopher, Reiter, Timmie G., Biernacki, David Michael, McMaster, Carey Jay, King, Stefan Merrill
Primary Examiner(s)
Choi, Carol
Assistant Examiner(s)
Choi, Yuk Ting

Application Number

US14/161,142
Publication Number

US 20140279956A1
Time in Patent Office

1,336 Days
Field of Search
US Class Current
CPC Class Codes

G06F 11/1453   using de-duplication of the...

G06F 11/1456   Hardware arrangements for b...

G06F 3/0608   Saving storage space on sto...

G06F 3/0619   in relation to data integri...

G06F 3/0641   De-duplication techniques

G06F 3/067   Distributed or networked st...

G06F 3/0671   In-line storage system

Systems and methods of locating redundant data using patterns of matching fingerprints

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

108 Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods of locating redundant data using patterns of matching fingerprints

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

108 Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links