System for enhancing expert-based computerized analysis of a set of digital documents and methods useful in conjunction therewith

US 8,533,194 B1
Filed: 06/15/2011
Issued: 09/10/2013
Est. Priority Date: 04/22/2009
Status: Active Grant

First Claim

Patent Images

1. An electronic document analysis method using a processor for analyzing N electronic documents, the method comprising:

i. providing an initial set of control electronic documents, randomly selected from the N documents, to initially use as a current set of control electronic documents;

ii. for at least one current set of control electronic documents, performing a current iteration comprising the following steps a-c;

a running a computerized text-classifier on the current set of control electronic documents;

b. using the current set of control electronic documents and the processor to evaluate at least one aspect of said running of said computerized text-classifier on said current set of control electronic documents; and

c using the processor for computing a current estimated validation level based on said at least one aspect and on at least one estimated validation level computed in an iteration, if any, previous to said current iteration, andiii. if the current estimated validation level is below a desired validation level, repeating said performing of a current iteration, using a set of training electronic documents larger than that used previously to generate a classifier to be run on a current set of control electronic documents, whereinsaid at least one aspect comprises an F-measure, andif the current estimated validation level is below a desired validation level, said performing of a current iteration is repeated using a set of control electronic documents larger than that used previously and wherein said using a set of control electronic documents larger than that used previously comprises;

estimating a number of electronic documents to be added to the initial set of control documents if the desired validation level is to be achieved; and

randomly selecting said number of electronic documents from among the N electronic documents.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An electronic document analysis method using a processor for analyzing N electronic documents, the method comprising providing a set of control electronic documents from among the electronic N documents; and using the set of control electronic documents and a processor to evaluate at least one aspect of a computerized text-classifier based electronic document categorization process performed on the N documents including computation of at least one statistic; wherein providing includes providing an initial set of control electronic documents; computing, using a processor, an estimated validation level of the at least one statistic assuming the initial set is used, and comparing the estimated validation level to a desired validation level, using a processor, and enlarging the initial set of control electronic documents if the estimated validation level falls below the desired validation level.

Citations

21 Claims

1. An electronic document analysis method using a processor for analyzing N electronic documents, the method comprising:
- i. providing an initial set of control electronic documents, randomly selected from the N documents, to initially use as a current set of control electronic documents;
  
  ii. for at least one current set of control electronic documents, performing a current iteration comprising the following steps a-c;
  
  a running a computerized text-classifier on the current set of control electronic documents;
  
  b. using the current set of control electronic documents and the processor to evaluate at least one aspect of said running of said computerized text-classifier on said current set of control electronic documents; and
  
  c using the processor for computing a current estimated validation level based on said at least one aspect and on at least one estimated validation level computed in an iteration, if any, previous to said current iteration, andiii. if the current estimated validation level is below a desired validation level, repeating said performing of a current iteration, using a set of training electronic documents larger than that used previously to generate a classifier to be run on a current set of control electronic documents, whereinsaid at least one aspect comprises an F-measure, andif the current estimated validation level is below a desired validation level, said performing of a current iteration is repeated using a set of control electronic documents larger than that used previously and wherein said using a set of control electronic documents larger than that used previously comprises;
  
  estimating a number of electronic documents to be added to the initial set of control documents if the desired validation level is to be achieved; and
  
  randomly selecting said number of electronic documents from among the N electronic documents.
- View Dependent Claims (6)
- - 6. A method according to claim 1 further comprising:
    - using at least one output of the electronic document analysis method to perform legal e-discovery.

2. An electronic document analysis method using a processor for analyzing N electronic documents, the method comprising:
- performing at least a portion of a first computerized text-classifier based document categorization process on the N electronic documents, using a first computerized text-classifier, thereby to generate at least one output; and
  
  using said at least one output to perform at least a second computerized text-classifier based document categorization process on at least M additional electronic documents, whereinsaid using comprisesapplying the first computerized text-classifier based document categorization process to at least the M additional electronic documents, only if the processor has determined that the first computerized text-classifier, when applied to at least the M additional electronic documents, satisfies a predetermined categorization quality criterion, andotherwise, applying a document categorization process which is not based on the first computerized text-classifier based document categorization process, to at least the M additional electronic documents,wherein a single set, X, of control documents is used;
  
  both to make a first determination of whether or not the first computerized text-classifier, when applied to at least the M additional electronic documents, satisfies a predetermined categorization quality criterion, and alsoto make a second determination of whether or not said document categorization process which is applied to said M additional electronic documents and which is not based on the first computerized text-classifier based document, satisfies a predetermined categorization quality criterion,thereby to utilize a single categorization process applied to said single set X rather than conducting separate, first and second, categorization processes on separate, first and second, control sets to be used when making said first and second determinations respectively.
- View Dependent Claims (3, 4, 5, 7, 8, 9, 15, 17, 18, 20, 21)
- - 3. A method according to claim 2 wherein said performing includes completing the first computerized text-classifier based document categorization process on the N electronic documents and only subsequently performing said second computerized text-classifier based document categorization process on the N+M electronic documents.
  - 4. A method according to claim 3 wherein said using includes running a text classifier generated for the N documents, on at least a sample of the M electronic documents, determining if the text classifier fits the M documents, if so using said text classifier in said second computerized process and if not, creating a new text classifier for at least the M electronic documents and running a final text classifier on at least the M electronic documents.
  - 5. A method according to claim 2 wherein said first computerized process includes multiple computerized iterations, wherein a control document set is used to evaluate whether or not to perform additional computerized iterations, wherein said performing includes performing only some of said computerized iterations of the first text-classifier based document categorization process on the N documents and wherein said using comprises:
    - providing a categorization for each electronic document within a sample, of size X, from among the M documents; and
      
      performing said second process by continuing the first process except that said sample is added to said control electronic document set.
  - 7. A method according to claim 2 further comprising:
    - using at least one output of the second text-classifier based electronic document categorization process to perform legal e-discovery.
  - 8. A method according to claim 2 wherein said using includes using said at least one output to perform a single second computerized text-classifier based document categorization process using a single text classifier, on the N+M electronic documents including the N documents and M additional documents.
  - 9. A method according to claim 2 wherein said using includes using said at least one output to perform a second computerized text-classifier based document categorization process including running different text classifiers on the N documents and the M additional documents respectively.
  - 15. A method according to claim 2 further comprising:
    - using at least one output of the second text-classifier based document categorization process to perform legal e-discovery.
  - 17. A method according to claim 2, wherein said processor determines whether the first computerized text-classifier, when applied to at least the M additional electronic documents, satisfies a predetermined categorization quality criterion, based on control documents representing and being a subset of the M documents.
  - 18. A method according to claim 2, wherein if the processor determines that the first computerized text-classifier, when applied to at least the M additional electronic documents, fails to satisfy a predetermined categorization quality criterion, the first computerized text-classifier is re-trained using at least the M documents, thereby to obtain the second computerized text-classifier based document categorization process.
  - 20. A method according to claim 2,wherein said X documents are selected at random from among said M documents.
  - 21. A method according to claim 20 and wherein said using comprises:
    - receiving categorization process outputs for each of said X documents;
      
      running said first classifier, used to process said N documents, on said X documents, thereby to obtain results;
      
      using said categorization process outputs and said results to compute a quality measure of ranking of said X documents, thereby to obtain a first quality value;
      
      computing said quality measure on a set of control documents for said N original documents, thereby to obtain a second quality value; and
      
      if said first and second quality values are close, using said first classifier for said N and said M documents, rather than building a new classifier.

10. An electronic document analysis system using a processor for analyzing N electronic documents, the system comprising:
- a computerized text-classifier operative for (a) running on at least one current set of control electronic documents, including, in a first iteration, an initial set of control electronic documents randomly selected from the N documents and in each subsequent iteration, if any, a set of control electronic documents larger than that used in each previous iteration, anda processor system operative for (b) using the current set of control electronic documents to evaluate at least one aspect of said running of said computerized text-classifier on said current set of control electronic documents; and
  
  for (c) computing a current estimated validation level based on said at least one aspect and on at least one estimated validation level computed in an iteration, if any, previous to said current iteration, whereinif the current estimated validation level is below a desired validation level, the computerized text-classifier and the processor system are operative for repeating (a), (b) and (c), using a set of training electronic documents larger than that used previously to generate a classifier to be run on a current set of control electronic documents,said at least one aspect comprises an F-measure, andif the current estimated validation level is below a desired validation level, performing of a current iteration is repeated using a set of control electronic documents larger than that used previously and wherein said using a set of control electronic documents larger than that used previously comprises;
  
  estimating a number of electronic documents to be added to the initial set of control documents if the desired validation level is to be achieved; and
  
  randomly selecting said number of electronic documents from among the N electronic documents.

11. A computer program product, comprising a non-transitory computer usable medium having a computer readable program code embodied therein, said computer readable program code adapted to be executed to implement an electronic document analysis method using a processor for analyzing N electronic documents, the method comprising:
- i. providing an initial set of control electronic documents randomly selected from the N documents to initially use as a current set of control electronic documents;
  
  ii. for at least one current set of control electronic documents, performing a current iteration comprising the following steps a-c;
  
  a running a computerized text-classifier on the current set of control electronic documents;
  
  b. using the current set of control electronic documents and the processor to evaluate at least one aspect of said running of said computerized text-classifier on said current set of control electronic documents; and
  
  c using the processor for computing a current estimated validation level based on said at least one aspect and on at least one estimated validation level computed in an iteration, if any, previous to said current iteration, andiii. if the current estimated validation level is below a desired validation level, repeating said performing of a current iteration, using a set of training electronic documents larger than that used previously to generate a classifier to be run on a current set of control electronic documents, whereinsaid at least one aspect comprises an F-measure, andif the current estimated validation level is below a desired validation level, performing of a current iteration is repeated using a set of control electronic documents larger than that used previously and wherein said using a set of control electronic documents larger than that used previously comprises;
  
  estimating a number of electronic documents to be added to the initial set of control documents if the desired validation level is to be achieved; and
  
  randomly selecting said number of electronic documents from among the N electronic documents.
- View Dependent Claims (14)
- - 14. A computer program product according to claim 11 and wherein the method further comprises using at least one output of the text-classifier based document categorization process to perform legal e-discovery.

12. An electronic document analysis system using a processor for analyzing N electronic documents, the system comprising:
- a repository initially storing N electronic documents and subsequently storing at least M additional electronic documents; and
  
  a computerized text-classifier operative for;
  
  performing at least a portion of a first computerized text-classifier based document categorization process on the N electronic documents, using a first computerized text-classifier, thereby to generate at least one output; and
  
  using said at least one output to perform at least a second computerized text-classifier based document categorization process on at least the M additional electronic documents, whereinsaid using comprises applying the first computerized text-classifier based document categorization process to at least the M additional electronic documents, only if the processor has determined that the first computerized text-classifier, when applied to at least the M additional electronic documents, satisfies a predetermined categorization quality criterion,and otherwise, applying a document categorization process which is not based on the first computerized text-classifier based document categorization process, to at least the M additional electronic documents,wherein a single set, X, of control documents is used;
  
  both to make a first determination of whether or not the first computerized text-classifier, when applied to at least the M additional electronic documents, satisfies a predetermined categorization quality criterion, and alsoto make a second determination of whether or not said document categorization process which is applied to said M additional electronic documents and which is not based on the first computerized text-classifier based document, satisfies a predetermined categorization quality criterion,thereby to utilize a single categorization process applied to said single set X rather than conducting separate, first and second, categorization processes on separate, first and second, control sets to be used when making said first and second determinations respectively.

13. A computer program product, comprising a non-transitory computer usable medium having a computer readable program code embodied therein, said computer readable program code adapted to be executed to implement an electronic document analysis method using a processor for analyzing N electronic documents, the method comprising:
- performing at least a portion of a first computerized text-classifier based document categorization process on the N electronic documents, using a first computerized text-classifier, thereby to generate at least one output; and
  
  using said at least one output to perform at least a second computerized text-classifier based document categorization process on at least M additional electronic documents, whereinsaid using comprises considering an F-measure, andsaid using comprises applying the first computerized text-classifier based document categorization process to at least the M additional electronic documents, only if the processor has determined that the first computerized text-classifier, when applied to at least the M additional electronic documents, is no more than a predetermined extent less than a second F-measure computed on X documents randomly selected from said M documents, relative to a first F-measure computed on control documents for the first computerized text-classifier based document categorization process.
- View Dependent Claims (16)
- - 16. A computer program product according to claim 13, wherein the method further comprises using at least one output of the second text-classifier based document categorization process to perform legal e-discovery.

19. An electronic document analysis method using a processor for analyzing N electronic documents, the method comprising:
- performing at least a portion of a first computerized text-classifier based document categorization process on the N electronic documents, using a first computerized text-classifier, thereby to generate at least one output; and
  
  using said at least one output to perform at least a second computerized text-classifier based document categorization process on at least M additional electronic documents, whereinsaid using comprises;
  
  considering an F-measure, andapplying the first computerized text-classifier based document categorization process to at least the M additional electronic documents, only if the processor has determined that the first computerized text-classifier, when applied to at least the M additional electronic documents, is no more than a predetermined extent less than a second F-measure computed on X documents randomly selected from said M documents, relative to a first F-measure computed on control documents for the first computerized text-classifier based document categorization process.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Israel Research And Development (2002) Ltd (Microsoft Corporation)
Original Assignee
Equivio Ltd. (Microsoft Corporation)
Inventors
Ravid, Yiftach, Sterenzy, Tal
Primary Examiner(s)
Pulliam, Christyann
Assistant Examiner(s)
Quader, Fazlul

Application Number

US13/161,087
Time in Patent Office

818 Days
Field of Search

707/737
US Class Current

707/737
CPC Class Codes

G06N 20/00 Machine learning

G06N 20/10 using kernel methods, e.g. ...

System for enhancing expert-based computerized analysis of a set of digital documents and methods useful in conjunction therewith

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

System for enhancing expert-based computerized analysis of a set of digital documents and methods useful in conjunction therewith

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links