Method and system for calculating phrase-document importance
First Claim
1. A method in a computer system for generating a weight for a phrase within one of a plurality of documents, each document having terms, the phrase having component terms, the method comprising:
- for each term, providing a term frequency that represents the number of occurrences of that term in the plurality of documents;
estimating a document frequency for the phrase based on an estimated phrase probability of the phrase, the document frequency being the number of the plurality of the documents that contain the phrase, the estimated phrase probability being an estimation of the probability that any phrase in documents that contain each component term is the phrase, the phrase probability being derived from term probabilities of the component terms, the term probability of a component term being a ratio of an average of the provided term frequencies for the component terms per document that contains that component term to an average number of terms per document;
estimating a total phrase frequency for the phrase based on an average phrase frequency for the phrase times the estimated document frequency for the phrase, the average phrase frequency being derived from the phrase probability of the phrase and the average number of terms per document; and
combining the estimated document frequency with the estimated total phrase frequency to generate the weight of the phrase.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for generating a weight for phrases within each document in a collection of documents. Each document has terms such as words and numbers. Each phrase comprises component terms. Each term frequency represents the number of occurrences of a term in a document, and the phrase frequency represents the number of occurrences of a phrase in a document. To generate the weight, the weighting system first estimates a document frequency for the phrase by multiplying an estimated phrase probability of the phrase times the number of documents that contain each component term. The estimated phrase probability is an estimation of the probability that any phrase in documents that contain each component term is the phrase whose weight is to be estimated. The document frequency is the number of the documents that contain the phrase. The weighting system then estimates a total phrase frequency for the phrase as the average phrase frequency for the phrase times the estimated document frequency for the phrase. The weighting system derives the average phrase frequency from the phrase probability of the phrase and average number of terms per document. The weighting system then combines the estimated document frequency with the estimated total phrase frequency to generate the weight of the phrase.
139 Citations
66 Claims
-
1. A method in a computer system for generating a weight for a phrase within one of a plurality of documents, each document having terms, the phrase having component terms, the method comprising:
-
for each term, providing a term frequency that represents the number of occurrences of that term in the plurality of documents;
estimating a document frequency for the phrase based on an estimated phrase probability of the phrase, the document frequency being the number of the plurality of the documents that contain the phrase, the estimated phrase probability being an estimation of the probability that any phrase in documents that contain each component term is the phrase, the phrase probability being derived from term probabilities of the component terms, the term probability of a component term being a ratio of an average of the provided term frequencies for the component terms per document that contains that component term to an average number of terms per document;
estimating a total phrase frequency for the phrase based on an average phrase frequency for the phrase times the estimated document frequency for the phrase, the average phrase frequency being derived from the phrase probability of the phrase and the average number of terms per document; and
combining the estimated document frequency with the estimated total phrase frequency to generate the weight of the phrase. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
where Wtj is the weight of the phrase, where PFtj is the phrase frequency within the one document, where PFt is the estimated total phrase frequency, where Γ
is a normalizing term frequency function, where N is the number of documents, where nt is the estimated document frequency, and where bases α and
β
are bases of the logarithms.
-
-
8. The method of claim 7 wherein the normalizing term frequency function Γ
- is a square root function.
-
9. The method of claim 7 wherein the normalizing term frequency function Γ
- is a logarithmic function.
-
10. The method of claim 7 wherein the bases α
- and β
are selected so that each factor of the formula contributes equally on average to the weight.
- and β
-
11. The method of claim 1 wherein the combining is a logarithmic function of a phrase frequency for the document normalized by the estimated total phrase frequency divided by a logarithm of the number of the plurality of documents divided by the estimated document frequency for the phrase.
-
12. The method of claim 1 including estimating the number of documents that contain each component term by multiplying the number of the plurality of documents by the document probability of the phrase, the document probability of the phrase being a probability that a document contains each component term.
-
13. The method of claim 12 wherein the document probability of a phrase is a product of the document probabilities of each component term, the document probability of a component term being a probability that a document contains that component term.
-
14. The method of claim 13 wherein the document probability of a component term is the document frequency of that term divided by the number of the plurality of documents, the document frequency of a term being the number of the plurality of the documents that contain that term.
-
15. A method in a computer system for estimating a document frequency of a phrase, the document frequency indicating a number of documents of a plurality of documents that contains the phrase, each document having terms, each term having a term frequency for each document, the term frequency for a term indicating a number of occurrences of that term within the document, the phrase having component terms, the method comprising:
-
estimating a phrase probability for the phrase, the estimated phrase probability being an estimation of the probability that any phrase in documents that contain each component term of the phrase is the phrase, the estimated phrase probability being derived from the term frequencies of the component terms; and
multiplying the estimated phrase probability by a number of documents that contain each component term to estimate the document frequency. - View Dependent Claims (16, 17, 18, 19, 20)
-
-
21. A method in a computer system for estimating a total phrase frequency of a phrase, the total phrase frequency indicating a total number of occurrences of the phrase within a plurality of documents, each document having terms, each term having a term frequency for each document, the term frequency for a term indicating a number of occurrences of that term within the document, the phrase having component terms, the method comprising:
-
estimating a phrase probability for the phrase, the estimated phrase probability being an estimation of the probability that any phrase in documents that contain each component term of the phrase is the phrase, the estimated phrase probability being derived from the term frequencies of the component terms;
estimating an average phrase frequency for the phrase by multiplying the estimated phrase probability by an average number of terms per document; and
multiplying the estimated average phrase frequency by an estimated number of documents that contain the phrase to estimate the total phrase frequency. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28)
-
-
29. A method in a computer system for generating a weight for a phrase within one of a plurality of documents, each document having terms, the phrase having component terms, the method comprising:
-
estimating a number of the plurality of documents that contain the phrase based on term frequencies of the component terms, a term frequency of a term being a number of occurrences of that term in document;
estimating a total number of times the phrase occurs in the plurality of documents based on the term frequencies of the component terms; and
combining the estimated number of documents that contain the phrase and the estimated total number of times that the phrase occurs in the plurality of documents to generate the weight for the phrase. - View Dependent Claims (30, 31)
-
-
32. A method in a computer system for estimating a number of a plurality of documents that contain a phrase, each document having terms, the phrase having component terms, the method comprising:
-
providing an indication of a number of occurrences of each component term within each document;
providing an indication of a total number of occurrences of all terms within the plurality of documents;
calculating a probability that a document contains the phrase based on the number of occurrences of each component term within each document and the total number of occurrences of all terms within the plurality of document; and
multiply the calculated probability by the total number of the plurality of document to estimate that number of documents that contain the phrase. - View Dependent Claims (33)
-
-
34. A method in a computer system for estimating a total number of occurrences of a phrase within a plurality of documents, each document having terms, the phrase having component terms, the method comprising:
-
providing an indication of a number of occurrences of each component term within each document;
providing an indication of a total number of occurrences of all terms within the plurality of documents;
estimating an average number of occurrences of the phrase in documents that contain the phrase based on the number of occurrences of each component term within each document and the total number of occurrences of all terms with the plurality of document; and
multiplying the estimated average number of occurrences of the phrase by the number of the plurality of documents that contain the phrase to estimate the total number of occurrences of the phrase within the plurality of documents. - View Dependent Claims (35)
-
-
36. A computer system for calculating a document frequency of a phrase, each document having terms, each term having a term frequency for each document, the phrase having component terms, comprising:
-
a component that calculates a phrase probability for the phrase, the calculated phrase probability being an estimation of the probability that any phrase in documents that contain each component term of the phrase is the phrase, the calculated phrase probability being derived from the term frequencies of the component terms; and
a component that combines the calculated phrase probability with a number of documents that contain each component term to calculate the document frequency. - View Dependent Claims (37, 38, 39, 40, 41)
-
-
42. A computer system for calculating a total phrase frequency of a phrase, each document having terms, each term having a term frequency for each document, the phrase having component terms, comprising:
-
a component for calculating a phrase probability for the phrase, the calculated phrase probability being derived from the term frequencies of the component terms;
a component for calculating an average phrase frequency for the phrase by multiplying the calculated phrase probability by an average number of terms per document; and
a component for multiplying the calculated average phrase frequency by a calculated number of documents that contain the phrase to calculate the total phrase frequency. - View Dependent Claims (43, 44, 45, 46, 47, 48, 49)
-
-
50. A computer-readable medium containing instructions for causing a computer system to generate a weight for a phrase within one of a plurality of documents, each document having terms, the phrase having component terms, by:
-
generating a term frequency that represents the number of occurrences of that term in the plurality of documents;
estimating a document frequency for the phrase based on an estimated phrase probability of the phrase, the phrase probability being derived from term probabilities of the component terms, the term probability of a component term being a ratio of an average of the generated term frequencies for the component terms per document that contains that component term to an average number of terms per document;
estimating a total phrase frequency for the phrase based on an average phrase frequency for the phrase times the estimated document frequency for the phrase, the average phrase frequency being derived from the phrase probability of the phrase and the average number of terms per document; and
combining the estimated document frequency with the estimated total phrase frequency to generate the weight of the phrase. - View Dependent Claims (51, 52, 53, 54, 55, 56, 57, 58, 59)
-
-
60. A computer-readable medium containing instructions that cause a computer system to generate a weight for a phrase within one of a plurality of documents, each document having terms, the phrase having component terms, by:
-
estimating a number of the plurality of documents that contain the phrase based on term frequencies of the component terms;
estimating a total number of times the phrase occurs in the plurality of documents based on the term frequencies of the component terms; and
combining the estimated number of documents that contain the phrase and the estimated total number of times that the phrase occurs in the plurality of documents to generate the weight for the phrase. - View Dependent Claims (61, 62)
-
-
63. A computer-readable medium containing instructions that cause a computer system to estimate a number of a plurality of documents that contain a phrase, each document having terms, the phrase having component terms, by:
-
calculating a probability that a document contains the phrase based on a number of occurrences of each component term within each document and a total number of occurrences of all terms within the plurality of document; and
multiply the calculated probability by the total number of the plurality of documents to estimate that number of documents that contain the phrase. - View Dependent Claims (64)
-
-
65. A computer-readable medium containing instructions for causing a computer system to estimate a total number of occurrences of a phrase within a plurality of documents, each document having terms, the phrase having component terms, by:
-
estimating an average number of occurrences of the phrase in documents that contain the phrase based on a number of occurrences of each component term within each document and a total number of occurrences of all terms with the plurality of document; and
multiplying the estimated average number of occurrences of the phrase by the number of the plurality of document that contain the phrase to estimate the total number of occurrences of the phrase within the plurality of documents. - View Dependent Claims (66)
-
Specification