System and method for creating a de-duplicated data set

US 8,738,668 B2
Filed: 12/16/2010
Issued: 05/27/2014
Est. Priority Date: 12/16/2009
Status: Active Grant

First Claim

Patent Images

1. A method utilizing one or more computer systems for creating a data set without duplication, from data taken from one or more database sources, comprising the steps of:

in a first phase, using the one or more computer systems to traverse files contained in one or more custodian containers of the database sources and creating indices of the custodian containers, the indices comprising (i) hash keys representing the data files and (ii) seek information for locating and handling the data files;

in a second phase, creating at the database sources, a master key table of unique hash keys and seek information from all the data indices created; and

in a third phase, using the one or more computer systems to query the master key table of unique hash keys and using the seek information to produce the data files associated with the hash keys to a storage system,wherein there are at least two custodian containers, and a first phase on a second container is configured to perform substantially in parallel with a second phase on a first container upon completion of a first phase on the first container.

View all claims

17 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention is directed to a system and method for creating a non-redundant data set from a plurality of data sources. Generally, the system and method operate by creating unique hash keys corresponding to unique data files; compiling the hash keys along with seeking information for the corresponding data files; de-duplicating the hash keys; and retrieving/storing the data files corresponding to the de-duplicated hash keys. Thus, in accordance with the system and method of the present invention, a non-redundant data set can be created from a plurality of data sources. The system of the present invention can operate independently or in conjunction with any de-duplicating methods and systems. For example, a de-duplicating method and system can be used to read and obtain data from a variety of media, regardless of the application used to generate the backup media. The component parts of a file may be read from a medium, including content and metadata pertaining to a file. These pieces of content and metadata may then be stored and associated. To avoid duplication of data, pieces of content and metadata may be compared to previously stored content and metadata. Furthermore, using these same methods and systems the content and metadata of a file may be associated with a location where the file resided. A database which stores these components and allows linking between the various stored components may be particularly useful in implementing embodiments of these methods and systems.

Citations

6 Claims

1. A method utilizing one or more computer systems for creating a data set without duplication, from data taken from one or more database sources, comprising the steps of:
- in a first phase, using the one or more computer systems to traverse files contained in one or more custodian containers of the database sources and creating indices of the custodian containers, the indices comprising (i) hash keys representing the data files and (ii) seek information for locating and handling the data files;
  
  in a second phase, creating at the database sources, a master key table of unique hash keys and seek information from all the data indices created; and
  
  in a third phase, using the one or more computer systems to query the master key table of unique hash keys and using the seek information to produce the data files associated with the hash keys to a storage system,wherein there are at least two custodian containers, and a first phase on a second container is configured to perform substantially in parallel with a second phase on a first container upon completion of a first phase on the first container.
- View Dependent Claims (3, 4)
- - 3. A method as defined in claim 1, wherein the method is performed globally on more than one custodian container.
  - 4. A method as defined in claim 1, wherein the method is performed on each custodian container and a data set is created for each custodian container.

2. A system for creating a data set without duplication, from data taken from one or more database sources associated with the system, comprising:
- (a) one or more computer systems configured to traverse one or more files contained in one or more custodian containers of the database sources and configured to create indices of the custodian containers, the indices comprising (i) hash keys representing the data files and (ii) seek information for locating and handling the data files;
  
  (b) a module at the database sources configured to create a master key table of unique hash keys and seek information from all the data indices created; and
  
  (c) a query capability associated with the one or more computer systems configured to query the master key table of unique hash keys and configured to use the seek information to produce the data files associated with the hash keys;
  
  a storage system for accepting the data files that are produced,wherein there are at least two custodian containers, and the system is configured to commence a first operation described in (a) on a second container substantially in parallel with a second operation described in (b) on a first container upon completion of a first operation on the first container.
- View Dependent Claims (5, 6)
- - 5. A system as defined in claim 2, wherein the system performs operations globally on more than one custodian container.
  - 6. A system as defined in claim 2, wherein the system performs operations on each custodian container and creates a data set for each custodian container.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kldiscovery Ontrack, LLC (KLDiscovery, Inc.)
Original Assignee
Renew Data Corporation (LDiscovery LLC)
Inventors
Pendlebury, Kenneth C., Pratt, Christopher K., Jones, Terence C., Omberg, Erik J., Marsh, John A., Reese, Christopher D.
Primary Examiner(s)
AL HASHEMI, SANA A

Application Number

US12/970,881
Publication Number

US 20110178996A1
Time in Patent Office

1,258 Days
Field of Search

707/608, 707/687, 707/705, 707/790, 707/813, 707/821
US Class Current

707/821
CPC Class Codes

G06F 11/1451   by selection of backup cont...

G06F 11/1453   using de-duplication of the...

G06F 11/1458   Management of the backup or...

G06F 16/137   Hash-based content-based in...

G06F 16/1748   De-duplication implemented ...

G06F 2201/80   Database-specific techniques

System and method for creating a de-duplicated data set

First Claim

17 Assignments

0 Petitions

Accused Products

Abstract

Citations

6 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for creating a de-duplicated data set

First Claim

17 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

6 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links