System and method for multi-scale navigation of data

US 9,256,611 B2
Filed: 06/06/2013
Issued: 02/09/2016
Est. Priority Date: 06/06/2013
Status: Active Grant

First Claim

Patent Images

1. A method of determining duplicate data for de-duplicating data in a computer system, the method comprising:

reading a first predefined set of multiple summaries associated with a first region of data in a storage of the computer system, each member of the first predefined set of multiple summaries being a micro-fingerprint value characterizing a portion of data within the first region of data;

selecting a first member from the first predefined set of multiple summaries based on a value of the micro-fingerprint value of the first member;

generating, at least in part, a first macro-fingerprint associated with the first region of data by storing the first member within the first macro-fingerprint;

reading a second predefined set of multiple summaries associated with a set of data, each member of the second predefined set of multiple summaries being a micro-fingerprint value characterizing a portion of data within the set of data;

selecting a particular member from the second predefined set of multiple summaries based on a value of the micro-fingerprint value of the particular member;

generating, at least in part, a second macro-fingerprint associated with the set of data by storing the second member within the second macro-fingerprint; and

comparing the first macro-fingerprint associated with the first region with the second macro-fingerprint associated with the set of data to determine, at least in part, the duplicate data.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system configured to generate a macro-fingerprint from at least one predefined set of summaries is provided. The system includes data storage storing a first predefined set of summaries associated with a first region of data, each member of the first predefined set of summaries characterizing data within the first region of data; and at least one processor coupled to the data storage and configured to: read the first predefined set of summaries; select at least one first member from the first predefined set of summaries based on a value of the at least one first member; and store the at least one first member within a first macro-fingerprint. The first region of data may have a first size indicative of a quantity of data included in the first region of data. The macro fingerprints are created from previously created smaller (micro) fingerprints without having to reread the data.

Citations

20 Claims

1. A method of determining duplicate data for de-duplicating data in a computer system, the method comprising:
- reading a first predefined set of multiple summaries associated with a first region of data in a storage of the computer system, each member of the first predefined set of multiple summaries being a micro-fingerprint value characterizing a portion of data within the first region of data;
  
  selecting a first member from the first predefined set of multiple summaries based on a value of the micro-fingerprint value of the first member;
  
  generating, at least in part, a first macro-fingerprint associated with the first region of data by storing the first member within the first macro-fingerprint;
  
  reading a second predefined set of multiple summaries associated with a set of data, each member of the second predefined set of multiple summaries being a micro-fingerprint value characterizing a portion of data within the set of data;
  
  selecting a particular member from the second predefined set of multiple summaries based on a value of the micro-fingerprint value of the particular member;
  
  generating, at least in part, a second macro-fingerprint associated with the set of data by storing the second member within the second macro-fingerprint; and
  
  comparing the first macro-fingerprint associated with the first region with the second macro-fingerprint associated with the set of data to determine, at least in part, the duplicate data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method according to claim 1, wherein selecting the first member includes selecting the first member based on a prioritization scheme.
  - 3. The method according to claim 2, wherein the first region of data has a first size indicative of a quantity of data included in the first region of data and the method further comprises:
    - reading a third predefined set of multiple summaries associated with a second region of data, each member of the third predefined set of multiple summaries characterizing a portion of data within the second region of data, the second region of data having a second size indicative of a quantity of data included in the second region of data, the second size being equal to the first size;
      
      selecting a second member from the second predefined set of multiple summaries based on a value of a micro-fingerprint value of the second member; and
      
      storing the second member within the first macro-fingerprint.
  - 4. The method according to claim 3, wherein the set of data has a third size that is indicative of a quantity of data included in the set of data, the third size being equal to the sum of the first size and the second size, the method further comprising:
    - executing, responsive to a threshold number of members of the first macro-fingerprint matching members of the second macro-fingerprint, a navigation process that compares the second predefined set of multiple summaries to a union of the first predefined set of multiple summaries and the third predefined set of multiple summaries.
  - 5. The method according to claim 4, wherein the first predefined set of multiple summaries has a first size and a first scope, the third predefined set of multiple summaries has a second size different from the first size and a second scope different from the first scope, and executing the navigation process includes generating a simulated set of multiple summaries based on at least one of:
    - the first predefined set of multiple summaries, orthe third predefined set of multiple summaries.
  - 6. The method according to claim 4, further comprising selecting the second predefined set of multiple summaries from a third macro-fingerprint selected from other predefined sets of summaries.
  - 7. The method according to claim 1, wherein reading the first predefined set of multiple summaries includes reading a set of first micro-fingerprint values, wherein the first micro-fingerprint values are hash values characterizing respective portions of the data within the first region of data.
  - 8. The method according to claim 7, further comprising de-duplicating at least one target area within the first region of data with reference to at least one reference area within the set of data.
  - 9. The method according to claim 8, further comprising:
    - removing at least one summary of the first predefined set of multiple summaries in response to de-duplicating the at least one target area; and
      
      removing at least one summary from the first macro-fingerprint in response to de-duplicating the at least one target area.

10. A system configured to determine duplicate data for de-duplicating data in a computer system, the system comprising:
- data storage storing a first predefined set of multiple summaries associated with a first region of data, each member of the first predefined set of multiple summaries being a micro-fingerprint value characterizing a portion of data within the first region of data; and
  
  at least one processor coupled to the data storage and programmed to;
  
  read the first predefined set of multiple summaries;
  
  select a first member from the first predefined set of multiple summaries based on a value of the micro-fingerprint value of the first member;
  
  generate, at least in part, a first macro-fingerprint associated with the first region of data by storing the first member within the first macro-fingerprint;
  
  read a second predefined set of multiple summaries associated with a set of data, each member of the second predefined set of multiple summaries being a micro-fingerprint value characterizing a portion of data within the set of data;
  
  select a particular member from the second predefined set of multiple summaries based on a value of the micro-fingerprint value of the particular member;
  
  generate, at least in part, a second macro-fingerprint associated with the set of data by storing the second member within the second macro-fingerprint; and
  
  compare the first macro-fingerprint associated with the first region with the second macro-fingerprint associated with the set of data to determine, at least in part, the duplicate data.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The system according to claim 10, wherein the at least one processor is further programmed to select the first member based on a prioritization scheme.
  - 12. The system according to claim 11, wherein the first region of data has a first size indicative of a quantity of data included in the first region of data and the at least one processor is further programmed to:
    - read a third predefined set of multiple summaries associated with a second region of data, each member of the second predefined set of multiple summaries characterizing data within the second region of data, the second region of data having a second size indicative of a quantity of data included in the second region of data, the second size being equal to the first size;
      
      select a second member from the second predefined set of multiple summaries based on a value of a micro-fingerprint value of the second member; and
      
      store the second member within the first macro-fingerprint.
  - 13. The system according to claim 12, whereinthe set of data has a third size that is indicative of a quantity of data included in the set of data, the third size being equal to the sum of the first size and the second size;
    - andthe at least one processor is further programmed to execute, responsive to a threshold number of members of the first macro-fingerprint matching members of the second macro-fingerprint, a navigation process that compares the second predefined set of multiple summaries to a union of the first predefined set of multiple summaries and the third predefined set of multiple summaries.
  - 14. The system according to claim 13, wherein the first predefined set of multiple summaries has a first size and a first scope, the third predefined set of multiple summaries has a second size different from the first size and a second scope different from the first scope, and the at least one processor is programmed to execute the navigation process by, at least in part, generating a simulated set of summaries based on at least one of:
    - the first predefined set of multiple summaries, orthe third predefined set of multiple summaries.
  - 15. The system according to claim 13, wherein the at least one processor is further programmed to select the second predefined set of multiple summaries from a third macro-fingerprint selected from other predefined sets of summaries.
  - 16. The system according to claim 10, wherein the at least one processor is further programmed to read the first predefined set of multiple summaries by reading a set of first micro-fingerprint values, wherein the first micro-fingerprint values are hash values characterizing respective portions of the data within the first region of data.
  - 17. The system according to claim 16, wherein the at least one processor is further programmed to de-duplicate at least one target area within the first region of data with reference to at least one reference area within the set of data.
  - 18. The system according to claim 17, wherein the at least one processor is further programmed to:
    - remove at least one summary of the first predefined set of multiple summaries in response to de-duplicating the at least one target area; and
      
      remove at least one summary from the first macro-fingerprint in response to de-duplicating the at least one target area.

19. A non-transitory computer readable medium storing computer readable instructions that, when executed by at least one processor, program the at least one processor to perform operations for determining duplicate data for de-duplicating data in a computer system, the operations comprising:
- reading a first predefined set of multiple summaries associated with a first region of data in a storage of the computer system, each member of the first predefined set of multiple summaries being a micro-fingerprint value characterizing a portion of data within the first region of data;
  
  selecting a first member from the first predefined set of multiple summaries based on a value of the micro-fingerprint value of the first member;
  
  generating, at least in part, a first macro-fingerprint associated with the first region of data by storing the first member within the first macro-fingerprint;
  
  reading a second predefined set of multiple summaries associated with a set of data, each member of the second predefined set of multiple summaries being a micro-fingerprint value characterizing a portion of data within the set of data;
  
  selecting a particular member from the second predefined set of multiple summaries based on a value of the micro-fingerprint value of the particular member;
  
  generating, at least in part, a second macro-fingerprint associated with the set of data by storing the second member within the second macro-fingerprint; and
  
  comparing the first macro-fingerprint associated with the first region with the second macro-fingerprint associated with the set of data to determine, at least in part, the duplicate data.
- View Dependent Claims (20)
- - 20. The computer readable medium according to claim 19, wherein the instructions further program the at least one processor to select the first member based on a prioritization scheme.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hitachi Vantara, LLC (Hitachi, Ltd.)
Original Assignee
Sepaton Incorporated (Hitachi, Ltd.)
Inventors
Trimble, Ronald Ray, Kennedy, Jon Christopher
Primary Examiner(s)
Uddin, Mohammed R

Application Number

US13/911,482
Publication Number

US 20140365450A1
Time in Patent Office

978 Days
Field of Search

707/664, 707/699, 707/692, 707/697, 707/698, 707/999.204
US Class Current

1/1
CPC Class Codes

G06F 16/1748 De-duplication implemented ...

System and method for multi-scale navigation of data

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for multi-scale navigation of data

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links