Efficient and phased method of processing large collections of electronic data known as “best match first”™ for electronic discovery and other related applications

US 8,819,021 B1
Filed: 01/28/2008
Issued: 08/26/2014
Est. Priority Date: 01/26/2007
Status: Expired due to Fees

First Claim

Patent Images

1. A method comprising:

providing electronic data from a medium or a network;

enabling the electronic data to be accessible to a processing engine;

de-duplicating the electronic data;

extracting the electronic data;

loading the electronic data into a Lightweight File System, wherein the Lightweight File System performs rapid scanning of a plurality of tokens from the content and properties of the electronic data, and is aided by a query engine;

determining, prior to indexing the electronic data, where to prioritize the electronic data in a processing queue by evaluating how well the electronic data matches a user specified criterion relative to other electronic data that have been processed based on inputs received from users relating to their review of the already processed other electronic data;

indexing the electronic data;

subjecting the electronic data to topical categorization to determine which topics are clearly represented in the electronic data;

building discussions based upon the electronic data; and

building a model of all the processed electronic data to measure and assess the source of high value information or gaps in the data with respect to different variables such as time and actor identity.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of more efficient, phased, iterative processing of very large collections of electronic data for the purposes of electronic discovery and related applications is disclosed. The processing minimally includes: text extraction, and the creation of a keyword search index, but may include many additional stages of processing as well. The method further includes: definition of an initial set of characteristics that correspond to “interesting” data, followed by the iterative completion of processing of this data based on a combination of user feedback on the overall relevance of the documents being processed and the system'"'"'s assessment of whether or not the data it has recently selected to promote in the processing completion queue has the desired quality and quantity of relevant data. The process continues until all identified data has either been fully processed, or discarded at some intermediate stage of processing as being likely irrelevant. This has the result of effectively finishing the processing much earlier, as the later documents in the processing queue will be increasingly irrelevant.

Citations

8 Claims

1. A method comprising:
- providing electronic data from a medium or a network;
  
  enabling the electronic data to be accessible to a processing engine;
  
  de-duplicating the electronic data;
  
  extracting the electronic data;
  
  loading the electronic data into a Lightweight File System, wherein the Lightweight File System performs rapid scanning of a plurality of tokens from the content and properties of the electronic data, and is aided by a query engine;
  
  determining, prior to indexing the electronic data, where to prioritize the electronic data in a processing queue by evaluating how well the electronic data matches a user specified criterion relative to other electronic data that have been processed based on inputs received from users relating to their review of the already processed other electronic data;
  
  indexing the electronic data;
  
  subjecting the electronic data to topical categorization to determine which topics are clearly represented in the electronic data;
  
  building discussions based upon the electronic data; and
  
  building a model of all the processed electronic data to measure and assess the source of high value information or gaps in the data with respect to different variables such as time and actor identity.
- View Dependent Claims (2, 3)
- - 2. The method of claim 1, wherein said enabling the electronic data to be accessible to a processing engine comprises:
    - directly connecting various media having electronic data to the processing engine to allow the processing engine to access the data for processing.
  - 3. The method of claim 1, wherein said enabling the electronic data to be accessible to a processing engine comprises:
    - permitting a network spider included on a processing engine to access the relevant data on a network of electronic data.

4. A method comprising:
- inputting a user specified criterion which correlates to high relevance;
  
  preparing a data set for processing;
  
  de-duplicating the data set;
  
  extracting the data set;
  
  determining, prior to indexing the data set, where to prioritize the data set in a processing queue by evaluating how well the data set matches the user specified criterion relative to other data sets that have been processed based on inputs received from users relating to their review of the already processed other electronic data sets;
  
  having a user do a complete or sample assessment of initially processed data sets that outranks the initial user specified criterion in the event of conflict;
  
  permitting a user to override prioritization of specific data sets or its schemes for determining prioritization;
  
  indexing the data set; and
  
  iteratively repeating the process based on user feedback and the desired quality and quantity of relevant data until all data sets have been exhausted.
- View Dependent Claims (5, 6, 7, 8)
- - 5. The method of claim 4, wherein said inputting a user specified criterion which correlates to high relevance comprises:
    - having a user specify actors, custodians, time frame, topical content, initial seeding information, a small or large amount of user review decisions, items the user has flagged as interesting, specific user override instructions, and how the data sets to be processed are to be defined by custodian, department, backup tape or server.
  - 6. The method of claim 4, wherein said preparing a data set for processing comprises:
    - preparing interconnected sets of data that are present on the same media or hardware, are associated with the same person(s) or topic(s), or were collected together.
  - 7. The method of claim 4, wherein having a user do a complete or sample assessment of initially processed data sets that outranks the initial user specified criterion in the event of conflict is generalized into a model, further wherein the model is selected from a group consisting of:
    - multidimensional modeling, closed item set-based calculations, clustering, support vector analysis, and other kinds of statistical analysis.
  - 8. The method of claim 4, wherein iteratively repeating the process based on user feedback and the desired quality and quantity of relevant data until all data sets have been exhausted comprises:
    - iteratively attempting to find further relevant chunks of data on the network according to parameters currently in effect by using a network spider.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Ernst & Young US LLP (Ernst & Young LLP)
Original Assignee
Ernst & Young US LLP (Ernst & Young LLP)
Inventors
Charnock, Elizabeth B., Roberts, Steven L.
Primary Examiner(s)
Beausoliel, Jr., Robert
Assistant Examiner(s)
Liu, Hexing

Application Number

US12/021,259
Time in Patent Office

2,402 Days
Field of Search

707/738, 707/730, 707/749
US Class Current

707/738
CPC Class Codes

G06F 16/3331 Query processing

Efficient and phased method of processing large collections of electronic data known as “best match first”™ for electronic discovery and other related applications

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

8 Claims

Specification

Solutions

Use Cases

Quick Links

Efficient and phased method of processing large collections of electronic data known as “best match first”™ for electronic discovery and other related applications

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

8 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links