Domain Dictionary Creation
First Claim
Patent Images
1. A computer-implemented method, comprising:
- determining a topic divergence value, the topic divergence value substantially proportional to a ratio of a first topic word distribution in a topic document corpus to a second topic word distribution in a document corpus, wherein the topic document corpus is a corpus of topic documents related to a topic, and the document corpus is a corpus of documents that includes the topic documents and other documents;
determining a candidate topic word divergence value for a candidate topic word, the candidate topic word divergence value substantially proportional to a ratio of a first distribution of the candidate topic Word in the topic document corpus to a second distribution of the candidate topic word in the document corpus; and
determining whether the candidate topic word is a new topic word based on the candidate topic word divergence value and the topic divergence value.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer program products, to identify topic words in a document corpus that includes topic documents related to a topic are disclosed. A reference topic word divergence value based on the document corpus and the topic document corpus is determined. A candidate topic word divergence value for a candidate topic word is determined based on the document corpus and the topic document corpus. The candidate topic word is determined to be a topic word if the candidate topic word divergence value is greater than the reference topic word divergence value.
243 Citations
24 Claims
-
1. A computer-implemented method, comprising:
-
determining a topic divergence value, the topic divergence value substantially proportional to a ratio of a first topic word distribution in a topic document corpus to a second topic word distribution in a document corpus, wherein the topic document corpus is a corpus of topic documents related to a topic, and the document corpus is a corpus of documents that includes the topic documents and other documents; determining a candidate topic word divergence value for a candidate topic word, the candidate topic word divergence value substantially proportional to a ratio of a first distribution of the candidate topic Word in the topic document corpus to a second distribution of the candidate topic word in the document corpus; and determining whether the candidate topic word is a new topic word based on the candidate topic word divergence value and the topic divergence value. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computer-implemented method, comprising:
-
selecting a topic dictionary comprising topic words related to a topic; determining a topic word divergence value based on a topic word, a document corpus and a topic document corpus, wherein the topic document corpus is a corpus of topic documents related to a topic, and the document corpus is a corpus of documents that includes the topic documents and other documents, and the topic word is a word that is related to the topic; determining a candidate topic word divergence value for a candidate topic word based on the document corpus and the topic document corpus; and determining whether the candidate topic word is a new topic word based on the candidate topic word divergence value and the topic word divergence value. - View Dependent Claims (13, 14, 15, 16)
-
-
17. An apparatus comprising software stored in a computer readable medium, the software comprising computer readable instructions executable by a computer processing device and that upon such execution cause the computer processing device to:
-
determine a topic word divergence value based on a topic word, a document corpus and a topic document corpus, wherein the topic document corpus is a corpus of topic documents related to a topic, and the document corpus is a corpus of documents that includes the topic documents and other documents, and the topic word is a word that in a topic dictionary that is related to the topic; determine a candidate topic word divergence value for a candidate topic word based on the document corpus and the topic document corpus; determine whether the candidate topic word is a topic word based on the candidate topic word divergence value and the topic word divergence value; and store the candidate topic word in a topic dictionary if the candidate topic word is determined to be a topic word.
-
-
18. A system, comprising:
-
a data store storing a topic dictionary comprising topic words related to a topic; a topic word processing module configured to; determine a topic word divergence value based on a topic word, a document corpus and a topic document corpus, wherein the topic document corpus is a corpus of topic documents related to a topic, the document corpus is a corpus of documents that includes the topic documents and other documents, and the topic word is a word in a topic dictionary that is related to the topic; select a candidate topic word; determine a candidate topic word divergence value for the candidate topic word based on the document corpus and the topic document corpus; and determine whether the candidate topic word is a topic word based on the candidate topic word divergence value and the topic word divergence value; and a dictionary updater module configured to store the candidate topic word in the topic dictionary if the candidate topic word is determined to be a topic word - View Dependent Claims (19)
-
-
20. A method, comprising:
-
determining a divergence threshold for a topic document corpus, the divergence threshold proportional to the ratio of a first topic word probability for a topic word in the topic document corpus to a second topic word probability for the topic word in the document corpus, wherein the topic document corpus is a corpus of topic documents related to a topic, the topic word is a word in a topic dictionary related to the topic, and the document corpus is a corpus of documents that includes the topic documents and other documents; determining a candidate word divergence value for a candidate word, the candidate word divergence value proportional to the ratio of a first candidate word probability for the candidate word with reference to the topic document corpus to a second candidate word probability for the candidate word with reference to the document corpus; and determining that the candidate word is a topic word for the topic if the candidate word divergence value exceeds the divergence threshold.
-
-
21. A system, comprising:
-
means for determining a topic divergence value, the topic divergence value substantially proportional to a ratio of a first topic word distribution in a topic document corpus to a second topic word distribution in a document corpus, wherein the topic document corpus is a corpus of topic documents related to a topic, and the document corpus is a corpus of documents that includes the topic documents and other documents; means for determining a candidate topic word divergence value for a candidate topic word, the candidate topic word divergence value substantially proportional to a ratio of a first distribution of the candidate topic word in the topic document corpus to a second distribution of the candidate topic word in the document corpus; and means for determining whether the candidate topic word is a new topic word based on the candidate topic word divergence value and the topic divergence value.
-
-
22. A system, comprising:
-
means for selecting a topic dictionary comprising topic words related to a topic; means for determining a topic word divergence value based on a topic word, a document corpus and a topic document corpus, wherein the topic document corpus is a corpus of topic documents related to a topic, and the document corpus is a corpus of documents that includes the topic documents and other documents, and the topic word is a word that is related to the topic; means for determining a candidate topic word divergence value for a candidate topic word based on the document corpus and the topic document corpus; and means for determining whether the candidate topic word is a new topic word based on the candidate topic word divergence value and the topic word divergence value.
-
-
23. An computer processing device comprising:
-
means for determining a topic word divergence value based on a topic word, a document corpus and a topic document corpus, wherein the topic document corpus is a corpus of topic documents related to a topic, and the document corpus is a corpus of documents that includes the topic documents and other documents, and the topic word is a word that in a topic dictionary that is related to the topic; means for determining a candidate topic word divergence value for a candidate topic word based on the document corpus and the topic document corpus; means for determining whether the candidate topic word is a topic word based on the candidate topic word divergence value and the topic word divergence value; and means for storing the candidate topic word in a topic dictionary if the candidate topic word is determined to be a topic word.
-
-
24. A system, comprising:
-
means for determining a divergence threshold for a topic document corpus, the divergence threshold proportional to the ratio of a first topic word probability for a topic word in the topic document corpus to a second topic word probability for the topic word in the document corpus, wherein the topic document corpus is a corpus of topic documents related to a topic, the topic word is a word in a topic dictionary related to the topic, and the document corpus is a corpus of documents that includes the topic documents and other documents; means for determining a candidate word divergence value for a candidate word, the candidate word divergence value proportional to the ratio of a first candidate word probability for the candidate word with reference to the topic document corpus to a second candidate word probability for the candidate word with reference to the document corpus; and means for determining that the candidate word is a topic word for the topic if the candidate word divergence value exceeds the divergence threshold.
-
Specification