SYSTEM FOR ENHANCING EXPERT-BASED COMPUTERIZED ANALYSIS OF A SET OF DIGITAL DOCUMENTS AND METHODS USEFUL IN CONJUNCTION THEREWITH

US 20130297612A1
Filed: 07/02/2013
Published: 11/07/2013
Est. Priority Date: 04/22/2009
Status: Active Grant

First Claim

Patent Images

1. An electronic document analysis method receiving N electronic documents pertaining to a case encompassing a set of issues including at least one issue and establishing relevance of at least the N electronic documents to at least one individual issue in the set of issues, the method comprising, for at least one individual issue from among said set of issues:

i. receiving an output of a categorization process applied to documents in at least control subsets of said at least N electronic documents, said output including, for each document in said subsets, one of a relevant-to-said-individual issue indication and a non-relevant-to-said-individual issue indication;

ii. seeking an input as to whether or not to initiate a new iteration I. If not, terminate;

if so continue to step iii;

iii. selecting m documents from among a subset of the N documents that are not in the control set and that were not used in previous rounds for training the classifier;

iv. receiving an output of a categorization process applied to the m documents;

v. adding the m documents to an existing training subset and building a text classifier simulating said categorization process using said output for all documents in said training subset of documents;

vi. evaluating said text classifier'"'"'s quality using said output for documents in said control subset;

vii. selecting a cut-off point for binarizing said rankings of said documents in said control subset;

viii. using said cut-off point, computing and storing at least one quality criterion characterizing said binarizing of said rankings of said documents in said control subset, thereby to define a quality of performance indication of a current iteration I;

ix. displaying a comparison of the quality of performance indication of the current iteration I to quality of performance indications of previous iterations; and

x. returning to step ii.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An electronic document analysis method receiving N electronic documents pertaining to a case encompassing a set of issues including at least one issue and establishing relevance of at least the N documents to at least one individual issue in the set of issues, the method comprising, for at least one individual issue from among the set of issues, receiving an output of a categorization process applied to each document in training and control subsets of the at least N documents, the output including, for each document in the subsets, one of a relevant-to-the-individual issue indication and a non-relevant-to-the-individual issue indication; building a text classifier simulating the categorization process using the output for all documents in the training subset of documents; and running the text classifier on the at least N documents thereby to obtain a ranking of the extent of relevance of each of the at least N documents to the individual issue. The method may also comprise evaluating the text classifier'"'"'s quality using the output for all documents in the control subset.

Citations

39 Claims

1. An electronic document analysis method receiving N electronic documents pertaining to a case encompassing a set of issues including at least one issue and establishing relevance of at least the N electronic documents to at least one individual issue in the set of issues, the method comprising, for at least one individual issue from among said set of issues:
- i. receiving an output of a categorization process applied to documents in at least control subsets of said at least N electronic documents, said output including, for each document in said subsets, one of a relevant-to-said-individual issue indication and a non-relevant-to-said-individual issue indication;
  
  ii. seeking an input as to whether or not to initiate a new iteration I. If not, terminate;
  
  if so continue to step iii;
  
  iii. selecting m documents from among a subset of the N documents that are not in the control set and that were not used in previous rounds for training the classifier;
  
  iv. receiving an output of a categorization process applied to the m documents;
  
  v. adding the m documents to an existing training subset and building a text classifier simulating said categorization process using said output for all documents in said training subset of documents;
  
  vi. evaluating said text classifier'"'"'s quality using said output for documents in said control subset;
  
  vii. selecting a cut-off point for binarizing said rankings of said documents in said control subset;
  
  viii. using said cut-off point, computing and storing at least one quality criterion characterizing said binarizing of said rankings of said documents in said control subset, thereby to define a quality of performance indication of a current iteration I;
  
  ix. displaying a comparison of the quality of performance indication of the current iteration I to quality of performance indications of previous iterations; and
  
  x. returning to step ii.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 39)
- - 2. The method according to claim 1 wherein said receiving comprises receiving an output of a categorization process performed by a human operator.
  - 3. The method according to claim 1 wherein said cut-off point is selected from all ranks of documents in said control subset so as to maximize a quality criterion.
  - 4. The method according to claim 1 wherein said displaying a comparison comprises generating at least one graph of at least one quality criterion vs. iteration serial number.
  - 5. The method according to claim 1 wherein said input comprises a user input received from a human user.
  - 6. The method according to claim 4 wherein said input comprises a computerized input including a computerized indication of flatness of said graph of at least one quality criterion vs. iteration serial number.
  - 7. The method of claim 1, the method also comprising:
    - performing a plurality of machine learning iterations on said N electronic documents wherein said iterations determine relevance of documents to at least one issue in said set of issues;
      
      determining at least one relevance determination quality criterion characterizing current relevance determination performance; and
      
      estimating a cost effectiveness of continued iterations on said at least N electronic documents vs. termination thereof.
  - 8. The method according to claim 7 wherein said estimating includes estimating at least one relevance determination quality criterion of future relevance determination performance assuming continued iterations.
  - 9. The method according to claim 7 wherein said estimating includes computing a budget required to enable continued iterations.
  - 10. The method according to claim 9 wherein said culling percentage is computed by:
    - generating a graph of a relevance determination quality criterion characterizing the iterative apparatus'"'"'s performance during an iteration as a function of a serial number of said iteration; and
      
      computing an integral thereof.
  - 11. A method according to claim 1, the method also comprising:
    - generating binary relevance data characterizing relevance of said documents to at least one issue in said set of issues, said binary relevance data being generated by applying a cut-off point to multi-value relevance data; and
      
      estimating the relative cost effectiveness of a multiplicity of possible cut-off points, including;
      
      generating a computer display of a continuum of possible cut-off points, each position along said continuum corresponding to a possible cut-off point, accepting a user'"'"'s indication of positions along said continuum, and computing and displaying, for each user-indicated position, cost effectiveness information characterizing the cut-off point corresponding to said user-indicated position.
  - 12. The method according to claim 1 wherein said computer display of said final output comprises a histogram of ranks for each issue.
  - 13. The method according to claim 1 wherein said computer display of said final output comprises a function of an indication of a quality measure for each of a plurality of cut-off points.
  - 14. The method according to claim 13 wherein said quality measure is selected from the following group:
    - un-weighted F-measure;
      
      weighted F-measure;
      
      precision;
      
      recall; and
      
      accuracy.
  - 15. The method according to claim 13 wherein said function of said indication of a quality measure for each of a plurality of cut-off points comprises a graph of said quality measure as a function of cut-off point.
  - 16. The method according to claim 15 wherein said function of said indication comprises a culling percentage including an integral of said graph.
  - 17. The method according to claim 1 wherein said set of issues comprises a plurality of issues and said computer display includes an indication of documents relevant to a logical combination of a plurality of issues.
  - 18. The method according to claim 1 further comprising using said text classifier most recently built to generate, for at least one individual issue in said set of issues, a set of keywords differentiating documents relevant to said individual issue to documents irrelevant to said individual issue.
  - 19. The method according to claim 7 wherein said set of issues comprises a plurality of issues and also comprising monitoring the system'"'"'s analysis of relevance of said at least N electronic documents to each of the plurality of issues and prompting at least one of the user and the system if relevance has not been analyzed for some of said plurality of issues.
  - 20. A method according to claim 1 comprising:
    - generating a computer display of at least one user-selected document within said set of documents, wherein at least some words in said user-selected document are differentially presented depending on their contribution to said classification of the document as relevant or irrelevant by said text classifier;
      
      sequentially removing certain sets of words from each individual document in the set of documents and using the text classifier to classify said document'"'"'s relevance assuming said words are removed, thereby to obtain a relevance output for each set of words; and
      
      comparing said relevance output to an output obtained by using the text classifier to classify said individual document without removing any words, thereby to obtain an indication of the contribution of each set of words to the relevance of the document.
  - 21. The method according to claim 11 wherein said set of issues comprises a plurality of issues and the system further comprising monitoring the system'"'"'s analysis of relevance of said at least N electronic documents to each of the plurality of issues and prompting at least one of the user and the system if relevance has not been analyzed for some of said plurality of issues.
  - 22. The method according to claim 1 further comprising:
    - classifying each document from among said N electronic documents as relevant or irrelevant to an issue; and
      
      generating a computer display of at least one user-selected document within said N electronic documents, wherein at least some words in said user-selected document are differentially presented depending on their contribution to said classifying of the document as relevant or irrelevant, whereinsaid words differentially presented are differentially colored and intensity of color is used to represent strength of said contribution for each word.
  - 23. The method according to claim 22 wherein said information regarding a budget comprises a culling percentage computed by:
    - generating a graph of a relevance determination quality criterion characterizing the iterative apparatus'"'"'s performance during an iteration as a function of a serial number of said iteration; and
      
      computing an integral thereof.
  - 24. A method according to claim 1 and also comprising:
    - executing a plurality of learning iterations each characterized by precision and recall, only until a diminishing returns criterion is true, including;
      
      executing at least one learning iteration;
      
      computing the diminishing returns criterion; and
      
      subsequently executing at least one additional learning iteration only if the diminishing returns criterion is not true, whereinsaid diminishing returns criterion returns a true value if and only if a non-decreasing function of one of the precision and the recall is approaching a steady state,said non-decreasing function comprises an F-measure, andsaid diminishing returns criterion is computed byusing a linear regression to compute a linear function estimating an F-measure obtained in previous iterations as a function of a log of a corresponding iteration number,generating a prediction of at least one F-measure at least one future iteration by finding a value along the linear function corresponding to a log of said future iteration,comparing said prediction to a currently known F-measure, andreturning true if the prediction is close to said currently known F-measure to a predetermined degree.
  - 25. The method according to claim 24 wherein said diminishing returns criterion returns a true value if and only if a standard deviation of the F-measure is below a threshold value.
  - 26. The method according to claim 24 wherein said F-measure comprises a most recent F-measure.
  - 27. The method according to claim 24 wherein said learning comprises learning to perform a classification task.
  - 28. The method according to claim 27 wherein said learning comprises learning to use a Support Vector Machine in said classification task.
  - 29. The method according to claim 26 wherein said linear regression is weighted so as to assign more importance to later iterations, relative to earlier iterations.
  - 30. The method according to claim 27, further comprising employing a Support Vector Machine to perform said classification task.
  - 31. A method according to claim 1 and also comprising:
    - executing a plurality of learning iterations each characterized by precision and recall, only until a diminishing returns criterion is true, including;
      
      executing at least one learning iteration;
      
      computing the diminishing returns criterion; and
      
      subsequently executing at least one additional learning iteration only if the diminishing returns criterion is not true, whereinsaid diminishing returns criterion returns a true value if and only if a non-decreasing function of one of the precision and the recall is approaching a steady state, andsaid executing the plurality of learning iterations comprises;
      
      computing a linear function estimating a logarithmic transformation of an F-measure; and
      
      setting said criterion to true if the linear function is approaching a steady state.
  - 32. The method according to claim 31 wherein said setting said criterion to true comprises:
    - computing first and second values of the linear function at, respectively, a first point corresponding to a number of iterations already performed and a second point corresponding to a number of iterations more than all of which were already performed, andsetting the diminishing returns criterion to true if the difference between said first and second values is pre-determinedly small.
  - 33. The method according to claim 31 wherein said setting said criterion to true comprises setting said diminishing returns criterion to true if at least one estimated future value of the linear function has a standard deviation which falls below a threshold.
  - 34. The method according to claim 24 wherein said F-measure comprises an un-weighted F-measure in which precision and recall are equally weighted.
  - 35. The method according to claim 24 wherein said F-measure comprises a weighted F-measure in which precision and recall are unequally weighted.
  - 36. The method according to claim 35, further comprising employing the Support Vector Machine to perform said classification task.
  - 37. The method according to claim 1 when said selecting of said m documents is based on classification results for at least some documents of the N documents.
  - 39. The method according to claim 1 further comprising:
    - classifying each document from among said N electronic documents as relevant or irrelevant to an issue.

38. An electronic document analysis system operative for receiving N electronic documents pertaining to a case encompassing a set of issues including at least one issue and establishing relevance of at least the N electronic documents to at least one individual issue in the set of issues, the system comprising:
- a processor operative, for at least one individual issue from among said set of issues, for;
  
  i. receiving an output of a categorization process applied to documents in at least control subsets of said at least N electronic documents, said output including, for each document in said subsets, one of a relevant-to-said-individual issue indication and a non-relevant-to-said-individual issue indication;
  
  ii. seeking an input as to whether or not to initiate a new iteration I. If not, terminate;
  
  if so continue to step iii;
  
  iii. selecting m documents from among a subset of the N documents that are not in the control set and that were not used in previous rounds for training the classifier;
  
  iv. receiving an output of a categorization process applied to the m documents;
  
  v. adding the m documents to an existing training subset and building a text classifier simulating said categorization process using said output for all documents in said training subset of documents;
  
  vi. evaluating said text classifier'"'"'s quality using said output for documents in said control subset;
  
  vii. selecting a cut-off point for binarizing said rankings of said documents in said control subset;
  
  viii. using said cut-off point, computing and storing at least one quality criterion characterizing said binarizing of said rankings of said documents in said control subset, thereby to define a quality of performance indication of a current iteration I;
  
  ix. displaying a comparison of the quality of performance indication of the current iteration I to quality of performance indications of previous iterations; and
  
  x. returning to step ii.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Israel Research And Development (2002) Ltd (Microsoft Corporation)
Original Assignee
Equivio Ltd. (Microsoft Corporation)
Inventors
RAVID, Yiftach

Granted Patent

US 8,914,376 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/740
CPC Class Codes

G06F 16/3326   using relevance feedback fr...

G06F 16/35   Clustering; Classification

G06F 16/353   into predefined classes

G06F 16/93   Document management systems

G06F 16/951   Indexing; Web crawling tech...

G06N 20/00   Machine learning

SYSTEM FOR ENHANCING EXPERT-BASED COMPUTERIZED ANALYSIS OF A SET OF DIGITAL DOCUMENTS AND METHODS USEFUL IN CONJUNCTION THEREWITH

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

39 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEM FOR ENHANCING EXPERT-BASED COMPUTERIZED ANALYSIS OF A SET OF DIGITAL DOCUMENTS AND METHODS USEFUL IN CONJUNCTION THEREWITH

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

39 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links