Domain dictionary creation by detection of new topic words using divergence value comparison
First Claim
Patent Images
1. A computer-implemented method, comprising:
- determining a topic divergence value, the topic divergence value proportional to a ratio of a first topic word distribution in a first collection of documents to a second topic word distribution in a second collection of documents, wherein the first collection of documents is a collection of topic documents related to a particular topic, and the second collection of documents is a collection of documents that includes other documents related to other topics;
determining a candidate topic word divergence value for a candidate topic word, the candidate topic word divergence value proportional to a ratio of a first distribution of the candidate topic word in the first collection of documents to a second distribution of the candidate topic word in the second collection of documents, wherein the candidate topic word is a candidate for being identified as a new topic word for the particular topic and is not a word in a topic dictionary for the particular topic; and
determining whether the candidate topic word is a new topic word for the particular topic based on the candidate topic word divergence value and the topic divergence value.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer program products, to identify topic words in a collection of documents that includes topic documents related to a topic are disclosed. A reference topic word divergence value based on a document collection and the topic document collection is determined. A candidate topic word divergence value for a candidate topic word is determined based on the document collection and the topic document collection. The candidate topic word is determined to be a topic word if the candidate topic word divergence value is greater than the reference topic word divergence value.
21 Citations
25 Claims
-
1. A computer-implemented method, comprising:
-
determining a topic divergence value, the topic divergence value proportional to a ratio of a first topic word distribution in a first collection of documents to a second topic word distribution in a second collection of documents, wherein the first collection of documents is a collection of topic documents related to a particular topic, and the second collection of documents is a collection of documents that includes other documents related to other topics; determining a candidate topic word divergence value for a candidate topic word, the candidate topic word divergence value proportional to a ratio of a first distribution of the candidate topic word in the first collection of documents to a second distribution of the candidate topic word in the second collection of documents, wherein the candidate topic word is a candidate for being identified as a new topic word for the particular topic and is not a word in a topic dictionary for the particular topic; and determining whether the candidate topic word is a new topic word for the particular topic based on the candidate topic word divergence value and the topic divergence value. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A computer-implemented method, comprising:
-
selecting a topic dictionary comprising topic words related to a particular topic; determining a topic word divergence value based on a topic word, a second collection of documents and a first collection of documents, wherein the first collection of documents is a collection of topic documents related to the particular topic, and the second collection of documents is a collection of documents that includes documents related to a plurality of topics, and the topic word is a word that is related to the topic; determining a candidate topic word divergence value for a candidate topic word based on the second collection of documents and the first collection of documents, wherein the candidate topic word is a candidate for being identified as a new topic word for the particular topic and is not a word in the topic dictionary for the particular topic; and determining whether the candidate topic word is a new topic word for the particular topic based on the candidate topic word divergence value and the topic word divergence value. - View Dependent Claims (14, 15, 16, 17)
-
-
18. An apparatus comprising software stored in a non-transitory computer readable medium, the software comprising computer readable instructions executable by a computer processing device and that upon such execution cause the computer processing device to:
-
determine a topic word divergence value based on a topic word, a second collection of documents and a first collection of documents, wherein the first collection of documents is a collection of topic documents related to a particular topic, and the second collection of documents is a collection of documents that includes the topic documents and other documents that are related to a plurality of topics, and the topic word is a word that is in a topic dictionary that is related to the particular topic; determine a candidate topic word divergence value for a candidate topic word based on the second collection of documents and the first collection of documents, wherein the candidate topic word is a candidate for being identified as a new topic word for the particular topic and is not a word in the topic dictionary for the particular topic; determine whether the candidate topic word is a topic word for the particular topic based on the candidate topic word divergence value and the topic word divergence value; and store the candidate topic word in the topic dictionary if the candidate topic word is determined to be a topic word.
-
-
19. A system, comprising:
-
a data store storing a topic dictionary comprising topic words related to a topic; a topic word processing module configured to; determine a topic word divergence value based on a topic word, a second collection of documents and a first collection of documents, wherein the first collection of documents is a collection of topic documents related to a topic, the second collection of documents is a collection of documents that includes documents related to a plurality of topics, and the topic word is a word in a topic dictionary that is related to the topic; select a candidate topic word that is not a word in the topic dictionary; determine a candidate topic word divergence value for the candidate topic word based on the second collection of documents and the first collection of documents; and determine whether the candidate topic word is a topic word for the particular topic based on the candidate topic word divergence value and the topic word divergence value; and a dictionary updater module configured to store the candidate topic word in the topic dictionary if the candidate topic word is determined to be a topic word. - View Dependent Claims (20)
-
-
21. A method, comprising:
-
determining a divergence threshold for a first collection of documents, the divergence threshold proportional to the ratio of a first topic word probability for a topic word in the first collection of documents to a second topic word probability for the topic word in the second collection of documents, wherein the first collection of documents is a first collection of topic documents related to a topic, the topic word is a word in a topic dictionary related to the topic, and the second collection of documents is a collection of documents related to a plurality of topics; determining a candidate word divergence value for a candidate word that is not a word in the topic dictionary, the candidate word divergence value proportional to the ratio of a first candidate word probability for the candidate word with reference to the first collection of documents to a second candidate word probability for the candidate word with reference to the second collection of documents; and determining that the candidate word is a topic word for the topic if the candidate word divergence value exceeds the divergence threshold.
-
-
22. A system, comprising:
-
means for determining a topic divergence value, the topic divergence value proportional to a ratio of a first topic word distribution in a first collection of documents to a second topic word distribution in a second collection of documents, wherein the first collection of documents is a collection of topic documents related to a particular topic, and the second collection of documents is a collection of documents that includes documents related to a plurality of topics; means for determining a candidate topic word divergence value for a candidate topic word, the candidate topic word divergence value proportional to a ratio of a first distribution of the candidate topic word in the first collection of documents to a second distribution of the candidate topic word in the second collection of documents, wherein the candidate top word is a candidate for being identified as a new topic word for the particular topic and is not a word in a topic dictionary for the particular topic; and means for determining whether the candidate topic word is a new topic word for the topic based on the candidate topic word divergence value and the topic divergence value.
-
-
23. A system, comprising:
-
means for selecting a topic dictionary comprising topic words related to a topic; means for determining a topic word divergence value based on a topic word, a second collection of documents and a first collection of documents, wherein the first collection of documents is a collection of topic documents related to a particular topic, and the second collection of documents is a collection of documents that includes documents related to a plurality of topics, and the topic word is a word that is in the topic dictionary; means for determining a candidate topic word divergence value for a candidate topic word based on the second collection of documents and the first collection of documents, wherein the candidate topic word is a candidate for being identified as a new topic word for the particular topic and is not a word in a topic dictionary for the topic; and means for determining whether the candidate topic word is a new topic word for the topic based on the candidate topic word divergence value and the topic word divergence value.
-
-
24. A computer processing device comprising:
-
means for determining a topic word divergence value based on a topic word, a second collection of documents and a first collection of documents, wherein the first collection of documents is a collection of topic documents related to a topic, and the second collection of documents is a collection of documents that includes documents related to a plurality of topics, and the topic word is a word that is in a topic dictionary that is related to the topic; means for determining a candidate topic word divergence value for a candidate topic word based on the second collection of documents and the first collection of documents, wherein the candidate topic word is not a word in the topic dictionary; means for determining whether the candidate topic word is a topic word based on the candidate topic word divergence value and the topic word divergence value; and means for storing the candidate topic word in the topic dictionary if the candidate topic word is determined to be a topic word.
-
-
25. A system, comprising:
-
means for determining a divergence threshold for a first collection of documents, the divergence threshold proportional to the ratio of a first topic word probability for a topic word in the first collection of documents to a second topic word probability for the topic word in the second collection of documents, wherein the first collection of documents is a collection of topic documents related to a topic, the topic word is a word in a topic dictionary related to the topic, and the second collection of documents is a collection of documents that includes documents related to a plurality of topics; means for determining a candidate word divergence value for a candidate word that is a candidate for being identified as a new topic word for the particular topic and that is not a word in the topic dictionary for the topic, the candidate word divergence value proportional to the ratio of a first candidate word probability for the candidate word with reference to the first collection of documents to a second candidate word probability for the candidate word with reference to the second collection of documents; and means for determining that the candidate word is a topic word for the topic if the candidate word divergence value exceeds the divergence threshold.
-
Specification