Language-independent method of generating index terms
First Claim
Patent Images
1. A method of extracting index terms from sample text relative to background text, comprising the steps of(a) filtering the background text to remove undesired symbols, thereby to produce filtered background text;
- (b) counting the n-grams in said filtered background text to produce background n-gram counts;
(c) filtering the sample text to remove undesired symbols, thereby to produce filtered sample text;
(d) counting the n-grams in said filtered sample text to produce sample n-gram counts;
(e) comparing said sample n-gram counts to said background n-gram counts to produce n-gram scores;
(f) assigning to each symbol of said filtered sample text a symbol score derived from said n-gram scores, said symbol score being derived from the scores of the n-grams containing said symbol;
(g) determining a symbol score threshold; and
(h) extracting as index terms the words and phrases of said filtered sample text that contain symbols whose symbol scores exceed said symbol score threshold.
1 Assignment
0 Petitions
Accused Products
Abstract
Index terms are drawn from text documents without the need for language-specific processes or training and are suitable as gists for the subject documents. Index terms are extracted on the basis of scores of constituent n-grams relative to n-gram counts in a corpus. A method of extracting joint index terms to represent a plurality of documents is also provided.
158 Citations
14 Claims
-
1. A method of extracting index terms from sample text relative to background text, comprising the steps of
(a) filtering the background text to remove undesired symbols, thereby to produce filtered background text; -
(b) counting the n-grams in said filtered background text to produce background n-gram counts; (c) filtering the sample text to remove undesired symbols, thereby to produce filtered sample text; (d) counting the n-grams in said filtered sample text to produce sample n-gram counts; (e) comparing said sample n-gram counts to said background n-gram counts to produce n-gram scores; (f) assigning to each symbol of said filtered sample text a symbol score derived from said n-gram scores, said symbol score being derived from the scores of the n-grams containing said symbol; (g) determining a symbol score threshold; and (h) extracting as index terms the words and phrases of said filtered sample text that contain symbols whose symbol scores exceed said symbol score threshold. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method of extracting joint index terms from a plurality of sample texts relative to background text, comprising:
-
(a) filtering the background text to remove undesired symbols, thereby to produce filtered background text; (b) counting the n-grams in said filtered background text to produce background n-gram counts; (c) selecting an intersection score threshold; (d) marking all n-grams as candidates; (e) clearing master n-gram counts; (f) for each sample text of the plurality of sample texts, performing the steps of; (i) filtering the sample text to remove undesired symbols, thereby to produce filtered sample text; (ii) counting the n-grams in said filtered sample text to produce sample n-gram counts; (iii) adding said sample n-gram counts to the master n-gram counts; (iv) for each sample n-gram that is still a candidate, comparing said sample n-gram count to the corresponding background n-gram count to produce an n-gram score; and (v) marking all n-grams whose scores are below said intersection score threshold as no longer being candidates; (g) for each n-gram that is still a candidate, comparing its master n-gram count to the corresponding background n-gram count to produce a master n-gram score, and for each n-gram that is not still a candidate, assigning a master n-gram score of zero; and (h) for each sample text of the plurality of sample texts, performing the steps of; (i) assigning to each symbol of said filtered sample text a symbol score derived from said n-gram scores, said symbol score being derived from the master n-gram scores of the n-grams containing said symbol; (ii) determining a symbol score threshold; and (iii) extracting as index terms the words and phrases of said filtered sample text that contain symbols whose symbol scores exceed said symbol score threshold. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A method of extracting index terms from sample text relative to background text, comprising the steps of:
-
a) filtering the background text to remove undesired symbols, thereby to produce filtered background text; b) counting the n-grams in said filtered background text to produce background n-gram counts; c) filtering the sample text to remove undesired symbols, thereby to produce filtered sample text; d) counting the n-grams in said filtered sample text to produce sample n-gram counts, wherein said sample n-gram counts and said background n-gram counts are produced by accumulating counts in hash tables; e) comparing said sample n-gram counts to said background n-gram counts to produce n-gram scores, wherein said comparing step for each count includes computing the score;
##EQU7## where Ci is the sample n-gram count being compared, Bi is the corresponding background n-gram count, S is the sample size, and R is the background size, andwhere t=(Ci +Bi)/(S+R); f) assigning to each symbol of said filtered sample text a symbol score derived from said n-gram scores, said symbol score being derived from the scores of the n-grams containing said symbol; g) determining a symbol score threshold; and h) extracting as index terms the words and phrases of said filtered sample text that contain symbols whose symbol scores exceed said symbol score threshold.
-
-
14. A method of extracting joint index terms from a plurality of sample texts relative to background text, comprising the steps of:
-
a) filtering the background text to remove undesired symbols, thereby to produce filtered background text; b) counting the n-grams in said filtered background text to produce background n-gram counts; c) selecting an intersection score threshold; d) marking all n-grams as candidates; e) clearing master n-gram counts; f) for each sample text of the plurality of sample texts, performing the steps of; i) filtering the sample text to remove undesired symbols, thereby to produce filtered sample text; ii) counting the n-grams in said filtered sample text to produce sample n-gram counts, wherein said sample n-gram counts and said background n-gram counts are produced by accumulating counts in hash tables; iii) adding said sample n-gram counts to the master n-gram counts; iv) for each sample n-gram that is still a candidate, comparing said sample n-gram count to the corresponding background n-gram count to produce an n-gram score, wherein said step of comparing said sample n-gram counts to said background n-gram counts to produce n-gram scores, for each count, includes computing the score;
##EQU8## where Ci is the sample n-gram count being compared, B is the corresponding background n-gram count, S is the sample size, and R is the background size, andwhere t=(Ci +Bi)/(S+R); and v) marking all n-grams whose scores are below said intersection score threshold as no longer being candidates; g) for each n-gram that is still a candidate, comparing its master n-gram count to the corresponding background n-gram count to produce a master n-gram score, and for each n-gram that is not still a candidate, assigning a master n-gram score of zero; and h) for each sample text of the plurality of sample texts, performing the steps of; i) assigning to each symbol of said filtered sample text a symbol score derived from said n-gram scores, said symbol score being derived from the master n-gram scores of the n-grams containing said symbol; ii) determining a symbol score threshold; and iii) extracting as index terms the words and phrases of said filtered sample text that contain symbols whose symbol scores exceed said symbol score threshold.
-
Specification