Efficient and phased method of processing large collections of electronic data known as “best match first”™ for electronic discovery and other related applications
First Claim
1. A method comprising:
- providing electronic data from a medium or a network;
enabling the electronic data to be accessible to a processing engine;
de-duplicating the electronic data;
extracting the electronic data;
loading the electronic data into a Lightweight File System, wherein the Lightweight File System performs rapid scanning of a plurality of tokens from the content and properties of the electronic data, and is aided by a query engine;
determining, prior to indexing the electronic data, where to prioritize the electronic data in a processing queue by evaluating how well the electronic data matches a user specified criterion relative to other electronic data that have been processed based on inputs received from users relating to their review of the already processed other electronic data;
indexing the electronic data;
subjecting the electronic data to topical categorization to determine which topics are clearly represented in the electronic data;
building discussions based upon the electronic data; and
building a model of all the processed electronic data to measure and assess the source of high value information or gaps in the data with respect to different variables such as time and actor identity.
2 Assignments
0 Petitions
Accused Products
Abstract
A method of more efficient, phased, iterative processing of very large collections of electronic data for the purposes of electronic discovery and related applications is disclosed. The processing minimally includes: text extraction, and the creation of a keyword search index, but may include many additional stages of processing as well. The method further includes: definition of an initial set of characteristics that correspond to “interesting” data, followed by the iterative completion of processing of this data based on a combination of user feedback on the overall relevance of the documents being processed and the system'"'"'s assessment of whether or not the data it has recently selected to promote in the processing completion queue has the desired quality and quantity of relevant data. The process continues until all identified data has either been fully processed, or discarded at some intermediate stage of processing as being likely irrelevant. This has the result of effectively finishing the processing much earlier, as the later documents in the processing queue will be increasingly irrelevant.
-
Citations
8 Claims
-
1. A method comprising:
-
providing electronic data from a medium or a network; enabling the electronic data to be accessible to a processing engine; de-duplicating the electronic data; extracting the electronic data; loading the electronic data into a Lightweight File System, wherein the Lightweight File System performs rapid scanning of a plurality of tokens from the content and properties of the electronic data, and is aided by a query engine; determining, prior to indexing the electronic data, where to prioritize the electronic data in a processing queue by evaluating how well the electronic data matches a user specified criterion relative to other electronic data that have been processed based on inputs received from users relating to their review of the already processed other electronic data; indexing the electronic data; subjecting the electronic data to topical categorization to determine which topics are clearly represented in the electronic data; building discussions based upon the electronic data; and building a model of all the processed electronic data to measure and assess the source of high value information or gaps in the data with respect to different variables such as time and actor identity. - View Dependent Claims (2, 3)
-
-
4. A method comprising:
-
inputting a user specified criterion which correlates to high relevance; preparing a data set for processing; de-duplicating the data set; extracting the data set; determining, prior to indexing the data set, where to prioritize the data set in a processing queue by evaluating how well the data set matches the user specified criterion relative to other data sets that have been processed based on inputs received from users relating to their review of the already processed other electronic data sets; having a user do a complete or sample assessment of initially processed data sets that outranks the initial user specified criterion in the event of conflict; permitting a user to override prioritization of specific data sets or its schemes for determining prioritization; indexing the data set; and iteratively repeating the process based on user feedback and the desired quality and quantity of relevant data until all data sets have been exhausted. - View Dependent Claims (5, 6, 7, 8)
-
Specification