System for enhancing expert-based computerized analysis of a set of digital documents and methods useful in conjunction therewith
First Claim
1. An electronic document analysis method using a processor for analyzing N electronic documents, the method comprising:
- i. providing an initial set of control electronic documents, randomly selected from the N documents, to initially use as a current set of control electronic documents;
ii. for at least one current set of control electronic documents, performing a current iteration comprising the following steps a-c;
a running a computerized text-classifier on the current set of control electronic documents;
b. using the current set of control electronic documents and the processor to evaluate at least one aspect of said running of said computerized text-classifier on said current set of control electronic documents; and
c using the processor for computing a current estimated validation level based on said at least one aspect and on at least one estimated validation level computed in an iteration, if any, previous to said current iteration, andiii. if the current estimated validation level is below a desired validation level, repeating said performing of a current iteration, using a set of training electronic documents larger than that used previously to generate a classifier to be run on a current set of control electronic documents, whereinsaid at least one aspect comprises an F-measure, andif the current estimated validation level is below a desired validation level, said performing of a current iteration is repeated using a set of control electronic documents larger than that used previously and wherein said using a set of control electronic documents larger than that used previously comprises;
estimating a number of electronic documents to be added to the initial set of control documents if the desired validation level is to be achieved; and
randomly selecting said number of electronic documents from among the N electronic documents.
2 Assignments
0 Petitions
Accused Products
Abstract
An electronic document analysis method using a processor for analyzing N electronic documents, the method comprising providing a set of control electronic documents from among the electronic N documents; and using the set of control electronic documents and a processor to evaluate at least one aspect of a computerized text-classifier based electronic document categorization process performed on the N documents including computation of at least one statistic; wherein providing includes providing an initial set of control electronic documents; computing, using a processor, an estimated validation level of the at least one statistic assuming the initial set is used, and comparing the estimated validation level to a desired validation level, using a processor, and enlarging the initial set of control electronic documents if the estimated validation level falls below the desired validation level.
-
Citations
21 Claims
-
1. An electronic document analysis method using a processor for analyzing N electronic documents, the method comprising:
-
i. providing an initial set of control electronic documents, randomly selected from the N documents, to initially use as a current set of control electronic documents; ii. for at least one current set of control electronic documents, performing a current iteration comprising the following steps a-c; a running a computerized text-classifier on the current set of control electronic documents; b. using the current set of control electronic documents and the processor to evaluate at least one aspect of said running of said computerized text-classifier on said current set of control electronic documents; and c using the processor for computing a current estimated validation level based on said at least one aspect and on at least one estimated validation level computed in an iteration, if any, previous to said current iteration, and iii. if the current estimated validation level is below a desired validation level, repeating said performing of a current iteration, using a set of training electronic documents larger than that used previously to generate a classifier to be run on a current set of control electronic documents, wherein said at least one aspect comprises an F-measure, and if the current estimated validation level is below a desired validation level, said performing of a current iteration is repeated using a set of control electronic documents larger than that used previously and wherein said using a set of control electronic documents larger than that used previously comprises; estimating a number of electronic documents to be added to the initial set of control documents if the desired validation level is to be achieved; and randomly selecting said number of electronic documents from among the N electronic documents. - View Dependent Claims (6)
-
-
2. An electronic document analysis method using a processor for analyzing N electronic documents, the method comprising:
-
performing at least a portion of a first computerized text-classifier based document categorization process on the N electronic documents, using a first computerized text-classifier, thereby to generate at least one output; and using said at least one output to perform at least a second computerized text-classifier based document categorization process on at least M additional electronic documents, wherein said using comprises applying the first computerized text-classifier based document categorization process to at least the M additional electronic documents, only if the processor has determined that the first computerized text-classifier, when applied to at least the M additional electronic documents, satisfies a predetermined categorization quality criterion, and otherwise, applying a document categorization process which is not based on the first computerized text-classifier based document categorization process, to at least the M additional electronic documents, wherein a single set, X, of control documents is used; both to make a first determination of whether or not the first computerized text-classifier, when applied to at least the M additional electronic documents, satisfies a predetermined categorization quality criterion, and also to make a second determination of whether or not said document categorization process which is applied to said M additional electronic documents and which is not based on the first computerized text-classifier based document, satisfies a predetermined categorization quality criterion, thereby to utilize a single categorization process applied to said single set X rather than conducting separate, first and second, categorization processes on separate, first and second, control sets to be used when making said first and second determinations respectively. - View Dependent Claims (3, 4, 5, 7, 8, 9, 15, 17, 18, 20, 21)
-
-
10. An electronic document analysis system using a processor for analyzing N electronic documents, the system comprising:
-
a computerized text-classifier operative for (a) running on at least one current set of control electronic documents, including, in a first iteration, an initial set of control electronic documents randomly selected from the N documents and in each subsequent iteration, if any, a set of control electronic documents larger than that used in each previous iteration, and a processor system operative for (b) using the current set of control electronic documents to evaluate at least one aspect of said running of said computerized text-classifier on said current set of control electronic documents; and
for (c) computing a current estimated validation level based on said at least one aspect and on at least one estimated validation level computed in an iteration, if any, previous to said current iteration, whereinif the current estimated validation level is below a desired validation level, the computerized text-classifier and the processor system are operative for repeating (a), (b) and (c), using a set of training electronic documents larger than that used previously to generate a classifier to be run on a current set of control electronic documents, said at least one aspect comprises an F-measure, and if the current estimated validation level is below a desired validation level, performing of a current iteration is repeated using a set of control electronic documents larger than that used previously and wherein said using a set of control electronic documents larger than that used previously comprises; estimating a number of electronic documents to be added to the initial set of control documents if the desired validation level is to be achieved; and randomly selecting said number of electronic documents from among the N electronic documents.
-
-
11. A computer program product, comprising a non-transitory computer usable medium having a computer readable program code embodied therein, said computer readable program code adapted to be executed to implement an electronic document analysis method using a processor for analyzing N electronic documents, the method comprising:
-
i. providing an initial set of control electronic documents randomly selected from the N documents to initially use as a current set of control electronic documents; ii. for at least one current set of control electronic documents, performing a current iteration comprising the following steps a-c; a running a computerized text-classifier on the current set of control electronic documents; b. using the current set of control electronic documents and the processor to evaluate at least one aspect of said running of said computerized text-classifier on said current set of control electronic documents; and c using the processor for computing a current estimated validation level based on said at least one aspect and on at least one estimated validation level computed in an iteration, if any, previous to said current iteration, and iii. if the current estimated validation level is below a desired validation level, repeating said performing of a current iteration, using a set of training electronic documents larger than that used previously to generate a classifier to be run on a current set of control electronic documents, wherein said at least one aspect comprises an F-measure, and if the current estimated validation level is below a desired validation level, performing of a current iteration is repeated using a set of control electronic documents larger than that used previously and wherein said using a set of control electronic documents larger than that used previously comprises; estimating a number of electronic documents to be added to the initial set of control documents if the desired validation level is to be achieved; and randomly selecting said number of electronic documents from among the N electronic documents. - View Dependent Claims (14)
-
-
12. An electronic document analysis system using a processor for analyzing N electronic documents, the system comprising:
-
a repository initially storing N electronic documents and subsequently storing at least M additional electronic documents; and a computerized text-classifier operative for; performing at least a portion of a first computerized text-classifier based document categorization process on the N electronic documents, using a first computerized text-classifier, thereby to generate at least one output; and using said at least one output to perform at least a second computerized text-classifier based document categorization process on at least the M additional electronic documents, wherein said using comprises applying the first computerized text-classifier based document categorization process to at least the M additional electronic documents, only if the processor has determined that the first computerized text-classifier, when applied to at least the M additional electronic documents, satisfies a predetermined categorization quality criterion, and otherwise, applying a document categorization process which is not based on the first computerized text-classifier based document categorization process, to at least the M additional electronic documents, wherein a single set, X, of control documents is used; both to make a first determination of whether or not the first computerized text-classifier, when applied to at least the M additional electronic documents, satisfies a predetermined categorization quality criterion, and also to make a second determination of whether or not said document categorization process which is applied to said M additional electronic documents and which is not based on the first computerized text-classifier based document, satisfies a predetermined categorization quality criterion, thereby to utilize a single categorization process applied to said single set X rather than conducting separate, first and second, categorization processes on separate, first and second, control sets to be used when making said first and second determinations respectively.
-
-
13. A computer program product, comprising a non-transitory computer usable medium having a computer readable program code embodied therein, said computer readable program code adapted to be executed to implement an electronic document analysis method using a processor for analyzing N electronic documents, the method comprising:
-
performing at least a portion of a first computerized text-classifier based document categorization process on the N electronic documents, using a first computerized text-classifier, thereby to generate at least one output; and using said at least one output to perform at least a second computerized text-classifier based document categorization process on at least M additional electronic documents, wherein said using comprises considering an F-measure, and said using comprises applying the first computerized text-classifier based document categorization process to at least the M additional electronic documents, only if the processor has determined that the first computerized text-classifier, when applied to at least the M additional electronic documents, is no more than a predetermined extent less than a second F-measure computed on X documents randomly selected from said M documents, relative to a first F-measure computed on control documents for the first computerized text-classifier based document categorization process. - View Dependent Claims (16)
-
-
19. An electronic document analysis method using a processor for analyzing N electronic documents, the method comprising:
-
performing at least a portion of a first computerized text-classifier based document categorization process on the N electronic documents, using a first computerized text-classifier, thereby to generate at least one output; and using said at least one output to perform at least a second computerized text-classifier based document categorization process on at least M additional electronic documents, wherein said using comprises; considering an F-measure, and applying the first computerized text-classifier based document categorization process to at least the M additional electronic documents, only if the processor has determined that the first computerized text-classifier, when applied to at least the M additional electronic documents, is no more than a predetermined extent less than a second F-measure computed on X documents randomly selected from said M documents, relative to a first F-measure computed on control documents for the first computerized text-classifier based document categorization process.
-
Specification