Method and system for calculating phrase-document importance

US 6,549,897 B1
Filed: 12/17/1998
Issued: 04/15/2003
Est. Priority Date: 10/09/1998
Status: Expired due to Fees

First Claim

Patent Images

1. A method in a computer system for generating a weight for a phrase within one of a plurality of documents, each document having terms, the phrase having component terms, the method comprising:

for each term, providing a term frequency that represents the number of occurrences of that term in the plurality of documents;

estimating a document frequency for the phrase based on an estimated phrase probability of the phrase, the document frequency being the number of the plurality of the documents that contain the phrase, the estimated phrase probability being an estimation of the probability that any phrase in documents that contain each component term is the phrase, the phrase probability being derived from term probabilities of the component terms, the term probability of a component term being a ratio of an average of the provided term frequencies for the component terms per document that contains that component term to an average number of terms per document;

estimating a total phrase frequency for the phrase based on an average phrase frequency for the phrase times the estimated document frequency for the phrase, the average phrase frequency being derived from the phrase probability of the phrase and the average number of terms per document; and

combining the estimated document frequency with the estimated total phrase frequency to generate the weight of the phrase.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for generating a weight for phrases within each document in a collection of documents. Each document has terms such as words and numbers. Each phrase comprises component terms. Each term frequency represents the number of occurrences of a term in a document, and the phrase frequency represents the number of occurrences of a phrase in a document. To generate the weight, the weighting system first estimates a document frequency for the phrase by multiplying an estimated phrase probability of the phrase times the number of documents that contain each component term. The estimated phrase probability is an estimation of the probability that any phrase in documents that contain each component term is the phrase whose weight is to be estimated. The document frequency is the number of the documents that contain the phrase. The weighting system then estimates a total phrase frequency for the phrase as the average phrase frequency for the phrase times the estimated document frequency for the phrase. The weighting system derives the average phrase frequency from the phrase probability of the phrase and average number of terms per document. The weighting system then combines the estimated document frequency with the estimated total phrase frequency to generate the weight of the phrase.

139 Citations

66 Claims

1. A method in a computer system for generating a weight for a phrase within one of a plurality of documents, each document having terms, the phrase having component terms, the method comprising:
- for each term, providing a term frequency that represents the number of occurrences of that term in the plurality of documents;
  
  estimating a document frequency for the phrase based on an estimated phrase probability of the phrase, the document frequency being the number of the plurality of the documents that contain the phrase, the estimated phrase probability being an estimation of the probability that any phrase in documents that contain each component term is the phrase, the phrase probability being derived from term probabilities of the component terms, the term probability of a component term being a ratio of an average of the provided term frequencies for the component terms per document that contains that component term to an average number of terms per document;
  
  estimating a total phrase frequency for the phrase based on an average phrase frequency for the phrase times the estimated document frequency for the phrase, the average phrase frequency being derived from the phrase probability of the phrase and the average number of terms per document; and
  
  combining the estimated document frequency with the estimated total phrase frequency to generate the weight of the phrase.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1 wherein the combining includes dividing the number of the plurality of the documents by the estimated document frequency.
  - 3. The method of claim 1 wherein the combining includes dividing the number of occurrences of the phrase within the one document by the estimated total phrase frequency.
  - 4. The method of claim 3 wherein the number of occurrences of the phrase within the one document is estimated based on the average phrase frequency.
  - 5. The method of claim 3 wherein the number of occurrences of the phrase within the one document is generated by counting the number of occurrences within the one document.
  - 6. The method of claim 1 including deriving the average phrase frequency by multiplying the estimated phrase probability by the average number of terms per document.
  - 7. The method of claim 1 wherein the combining is in accordance with the following formula:
    - $W_{tj} = \log α ((α - 1) + \frac{{PF}_{tj}}{Γ ({PF}_{t})}) * \log_{β} \frac{N}{n_{t}}$
8. The method of claim 7 wherein the normalizing term frequency function Γ
- is a square root function.
9. The method of claim 7 wherein the normalizing term frequency function Γ
- is a logarithmic function.
10. The method of claim 7 wherein the bases α
- and β
  
  are selected so that each factor of the formula contributes equally on average to the weight.
11. The method of claim 1 wherein the combining is a logarithmic function of a phrase frequency for the document normalized by the estimated total phrase frequency divided by a logarithm of the number of the plurality of documents divided by the estimated document frequency for the phrase.
12. The method of claim 1 including estimating the number of documents that contain each component term by multiplying the number of the plurality of documents by the document probability of the phrase, the document probability of the phrase being a probability that a document contains each component term.
13. The method of claim 12 wherein the document probability of a phrase is a product of the document probabilities of each component term, the document probability of a component term being a probability that a document contains that component term.
14. The method of claim 13 wherein the document probability of a component term is the document frequency of that term divided by the number of the plurality of documents, the document frequency of a term being the number of the plurality of the documents that contain that term.

15. A method in a computer system for estimating a document frequency of a phrase, the document frequency indicating a number of documents of a plurality of documents that contains the phrase, each document having terms, each term having a term frequency for each document, the term frequency for a term indicating a number of occurrences of that term within the document, the phrase having component terms, the method comprising:
- estimating a phrase probability for the phrase, the estimated phrase probability being an estimation of the probability that any phrase in documents that contain each component term of the phrase is the phrase, the estimated phrase probability being derived from the term frequencies of the component terms; and
  
  multiplying the estimated phrase probability by a number of documents that contain each component term to estimate the document frequency.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The method of claim 15 wherein the estimated phrase probability is the product of term probabilities for each component term, the term probability of a component term being the average term frequency for that component term per document that contains that term divided by the average number of terms per document.
  - 17. The method of claim 16 wherein the average number of terms per document is calculated by dividing a total of the term frequencies by the number of the plurality of documents.
  - 18. The method of claim 15 wherein the number of documents that contain each component term is estimated by multiplying the number of the plurality of documents by an estimated document probability of the phrase, the estimated document probability of the phrase being a probability that a document contains each component term of the phrase.
  - 19. The method of claim 18 wherein the estimated document probability of the phrase is a product of document probabilities for each component term.
  - 20. The method of claim 19 wherein the document probability of each component term is the document frequency of the component term divided by the number of the plurality of documents.

21. A method in a computer system for estimating a total phrase frequency of a phrase, the total phrase frequency indicating a total number of occurrences of the phrase within a plurality of documents, each document having terms, each term having a term frequency for each document, the term frequency for a term indicating a number of occurrences of that term within the document, the phrase having component terms, the method comprising:
- estimating a phrase probability for the phrase, the estimated phrase probability being an estimation of the probability that any phrase in documents that contain each component term of the phrase is the phrase, the estimated phrase probability being derived from the term frequencies of the component terms;
  
  estimating an average phrase frequency for the phrase by multiplying the estimated phrase probability by an average number of terms per document; and
  
  multiplying the estimated average phrase frequency by an estimated number of documents that contain the phrase to estimate the total phrase frequency.
- View Dependent Claims (22, 23, 24, 25, 26, 27, 28)
- - 22. The method of claim 21 wherein the estimated phrase probability is the product of term probabilities for each component term, the term probability of a component term being the average term frequency for that component term per document that contains that term divided by the average number of terms per document.
  - 23. The method of claim 22 wherein the average number of terms per document is calculated by dividing a total of the term frequencies by the number of the plurality of documents.
  - 24. The method of claim 21 wherein the estimated number of documents that contain the phrase is derived by multiplying the estimated phrase probability by a number of documents that contain each component term to estimate the document frequency.
  - 25. The method of claim 24 wherein the number of documents that contain each component term is estimated by multiplying the number of the plurality of documents by an estimated document probability of the phrase, the estimated document probability of the phrase being a probability that a document contains each component term of the phrase.
  - 26. The method of claim 25 wherein the estimated document probability of the phrase is a product of document probabilities for each component term.
  - 27. The method of claim 26 wherein the document probability of each component term is the document frequency of the component term divided by the number of the plurality of documents.
  - 28. The method of claim 21 wherein the average number of terms per document is derived by totaling all the term frequencies and dividing that total by the number of the plurality of documents.

29. A method in a computer system for generating a weight for a phrase within one of a plurality of documents, each document having terms, the phrase having component terms, the method comprising:
- estimating a number of the plurality of documents that contain the phrase based on term frequencies of the component terms, a term frequency of a term being a number of occurrences of that term in document;
  
  estimating a total number of times the phrase occurs in the plurality of documents based on the term frequencies of the component terms; and
  
  combining the estimated number of documents that contain the phrase and the estimated total number of times that the phrase occurs in the plurality of documents to generate the weight for the phrase.
- View Dependent Claims (30, 31)
- - 30. The method of claim 29 wherein the combining also includes combining a number of occurrences of the phrase within the one document.
  - 31. The method of claim 29 wherein the combining also includes combining the number of the plurality of documents.

32. A method in a computer system for estimating a number of a plurality of documents that contain a phrase, each document having terms, the phrase having component terms, the method comprising:
- providing an indication of a number of occurrences of each component term within each document;
  
  providing an indication of a total number of occurrences of all terms within the plurality of documents;
  
  calculating a probability that a document contains the phrase based on the number of occurrences of each component term within each document and the total number of occurrences of all terms within the plurality of document; and
  
  multiply the calculated probability by the total number of the plurality of document to estimate that number of documents that contain the phrase.
- View Dependent Claims (33)
- - 33. The method of claim 32 wherein the calculating of the probability that a document contains the phrase as the product of the ratios for each component term of the number of documents that contain that component term and the number of the plurality of documents.

34. A method in a computer system for estimating a total number of occurrences of a phrase within a plurality of documents, each document having terms, the phrase having component terms, the method comprising:
- providing an indication of a number of occurrences of each component term within each document;
  
  providing an indication of a total number of occurrences of all terms within the plurality of documents;
  
  estimating an average number of occurrences of the phrase in documents that contain the phrase based on the number of occurrences of each component term within each document and the total number of occurrences of all terms with the plurality of document; and
  
  multiplying the estimated average number of occurrences of the phrase by the number of the plurality of documents that contain the phrase to estimate the total number of occurrences of the phrase within the plurality of documents.
- View Dependent Claims (35)
- - 35. The method of claim 34 wherein the estimating of an average number of occurrences of the phrase includes calculating a probability that any phrase within the plurality of documents is the phrase and multiplying the calculated probability by an average number of occurrences of terms within a document.

36. A computer system for calculating a document frequency of a phrase, each document having terms, each term having a term frequency for each document, the phrase having component terms, comprising:
- a component that calculates a phrase probability for the phrase, the calculated phrase probability being an estimation of the probability that any phrase in documents that contain each component term of the phrase is the phrase, the calculated phrase probability being derived from the term frequencies of the component terms; and
  
  a component that combines the calculated phrase probability with a number of documents that contain each component term to calculate the document frequency.
- View Dependent Claims (37, 38, 39, 40, 41)
- - 37. The system of claim 36 wherein the calculated phrase probability is the product of term probabilities for each component term, the term probability of a component term being the average term frequency for that component term per document that contains that term divided by the average number of terms per document.
  - 38. The system of claim 37 wherein the average number of terms per document is calculated by dividing a total of the term frequencies by the number of the plurality of documents.
  - 39. The system of claim 36 wherein the number of documents that contain each component term is calculated by multiplying the number of the plurality of documents by the document probability of the phrase, the document probability of the phrase being a probability that a document contains each component term of the phrase.
  - 40. The system of claim 39 wherein the document probability of the phrase is a product of document probabilities for each component term.
  - 41. The system of claim 40 wherein the document probability of each component term is the document frequency of the component term divided by the number of the plurality of documents.

42. A computer system for calculating a total phrase frequency of a phrase, each document having terms, each term having a term frequency for each document, the phrase having component terms, comprising:
- a component for calculating a phrase probability for the phrase, the calculated phrase probability being derived from the term frequencies of the component terms;
  
  a component for calculating an average phrase frequency for the phrase by multiplying the calculated phrase probability by an average number of terms per document; and
  
  a component for multiplying the calculated average phrase frequency by a calculated number of documents that contain the phrase to calculate the total phrase frequency.
- View Dependent Claims (43, 44, 45, 46, 47, 48, 49)
- - 43. The system of claim 42 wherein the calculated phrase probability is the product of term probabilities for each component term, the term probability of a component term being the average term frequency for that component term per document that contains that term divided by the average number of terms per document.
  - 44. The system of claim 43 wherein the average number of terms per document is calculated by dividing a total of the term frequencies by the number of the plurality of documents.
  - 45. The system of claim 42 wherein the calculated number of documents that contain the phrase is derived by multiplying the calculated phrase probability by a number of documents that contain each component term to calculate the document frequency.
  - 46. The system of claim 45 wherein the number of documents that contain each component term is calculated by multiplying the number of the plurality of documents by a calculated document probability of the phrase, the calculated document probability of the phrase being a probability that a document contains each component term of the phrase.
  - 47. The system of claim 46 wherein the calculated document probability of the phrase is a product of document probabilities for each component term.
  - 48. The system of claim 47 wherein the document probability of each component term is the document frequency of the component term divided by the number of the plurality of documents.
  - 49. The system of claim 42 wherein the average number of terms per document is derived by totaling all the term frequencies and dividing that total by the number of the plurality of documents.

50. A computer-readable medium containing instructions for causing a computer system to generate a weight for a phrase within one of a plurality of documents, each document having terms, the phrase having component terms, by:
- generating a term frequency that represents the number of occurrences of that term in the plurality of documents;
  
  estimating a document frequency for the phrase based on an estimated phrase probability of the phrase, the phrase probability being derived from term probabilities of the component terms, the term probability of a component term being a ratio of an average of the generated term frequencies for the component terms per document that contains that component term to an average number of terms per document;
  
  estimating a total phrase frequency for the phrase based on an average phrase frequency for the phrase times the estimated document frequency for the phrase, the average phrase frequency being derived from the phrase probability of the phrase and the average number of terms per document; and
  
  combining the estimated document frequency with the estimated total phrase frequency to generate the weight of the phrase.
- View Dependent Claims (51, 52, 53, 54, 55, 56, 57, 58, 59)
- - 51. The computer-readable medium of claim 50 wherein the combining includes dividing the number of the plurality of the documents by the estimated document frequency.
  - 52. The computer-readable medium of claim 50 wherein the combining includes dividing the number of occurrences of the phrase within the one document by the estimated total phrase frequency.
  - 53. The computer-readable medium of claim 52 wherein the number of occurrences of the phrase within the one document is estimated based on the average phrase frequency.
  - 54. The computer-readable medium of claim 52 wherein the number of occurrences of the phrase within the one document is generated by counting the number of occurrences within the one document.
  - 55. The computer-readable medium of claim 50 including deriving the average phrase frequency by multiplying the estimated phrase probability by the average number of terms per document.
  - 56. The computer-readable medium of claim 50 wherein the combining is a logarithmic function of a phrase frequency for the document normalized by the estimated total phrase frequency divided by a logarithm of the number of the plurality of documents divided by the estimated document frequency for the phrase.
  - 57. The computer-readable medium of claim 50 including estimating the number of documents that contain each component term by multiplying the number of the plurality of documents by the document probability of the phrase, the document probability of the phrase being a probability that a document contains each component term.
  - 58. The computer-readable medium of claim 57 wherein the document probability of a phrase is a product of the document probabilities of each component term, the document probability of a component term being a probability that a document contains that component term.
  - 59. The computer-readable medium of claim 58 wherein the document probability of a component term is the document frequency of that term divided by the number of the plurality of documents, the document frequency of a term being the number of the plurality of the documents that contain that term.

60. A computer-readable medium containing instructions that cause a computer system to generate a weight for a phrase within one of a plurality of documents, each document having terms, the phrase having component terms, by:
- estimating a number of the plurality of documents that contain the phrase based on term frequencies of the component terms;
  
  estimating a total number of times the phrase occurs in the plurality of documents based on the term frequencies of the component terms; and
  
  combining the estimated number of documents that contain the phrase and the estimated total number of times that the phrase occurs in the plurality of documents to generate the weight for the phrase.
- View Dependent Claims (61, 62)
- - 61. The computer-readable medium of claim 60 wherein the combining also includes combining a number of occurrences of the phrase within the one document.
  - 62. The computer-readable medium of claim 60 wherein the combining also includes combining the number of the plurality of documents.

63. A computer-readable medium containing instructions that cause a computer system to estimate a number of a plurality of documents that contain a phrase, each document having terms, the phrase having component terms, by:
- calculating a probability that a document contains the phrase based on a number of occurrences of each component term within each document and a total number of occurrences of all terms within the plurality of document; and
  
  multiply the calculated probability by the total number of the plurality of documents to estimate that number of documents that contain the phrase.
- View Dependent Claims (64)
- - 64. The computer-readable medium of claim 63 wherein the calculating of the probability that a document contains the phrase as the product of the ratios for each component term of the number of documents that contain that component term and the number of the plurality of documents.

65. A computer-readable medium containing instructions for causing a computer system to estimate a total number of occurrences of a phrase within a plurality of documents, each document having terms, the phrase having component terms, by:
- estimating an average number of occurrences of the phrase in documents that contain the phrase based on a number of occurrences of each component term within each document and a total number of occurrences of all terms with the plurality of document; and
  
  multiplying the estimated average number of occurrences of the phrase by the number of the plurality of document that contain the phrase to estimate the total number of occurrences of the phrase within the plurality of documents.
- View Dependent Claims (66)
- - 66. The computer-readable medium of claim 65 wherein the estimating of an average number of occurrences of the phrase includes calculating a probability that any phrase within the plurality of documents is the phrase and multiplying the calculated probability by an average number of occurrences of terms within a document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Jones, William P., Katariya, Sanjeev
Primary Examiner(s)
Feild, Joseph H.
Assistant Examiner(s)
DESAI, RACHNA SINGH

Application Number

US09/215,513
Time in Patent Office

1,580 Days
Field of Search

707/500, 707/3, 707/5, 707/6
US Class Current

1/1
CPC Class Codes

G06F 16/313   Selection or weighting of t...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Method and system for calculating phrase-document importance

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

139 Citations

66 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for calculating phrase-document importance

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

139 Citations

66 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links