Method for learning to infer the topical content of documents based upon their lexical content
First Claim
1. An unsupervised method of learning relationships between words and unspecified topics using a processing unit coupled to a memory, the memory storing in machine readable form a first multiplicity of documents including a second multiplicity of words and a third multiplicity of unspecified topics, the memory also storing in machine readable form a lexicon including a fourth multiplicity of words, the fourth multiplicity of words being less than the second multiplicity of words, the method comprising the processing unit implemented steps of:
- a) generating an observed feature vector for each document of the first multiplicity of documents, each observed feature vector indicating which words of the lexicon are present in an associated document;
b) initializing a fifth multiplicity of association strength values to initial values, each association strength value indicating a relationship between a word-cluster word pair, each word-cluster word pair including a word of the lexicon and a word-cluster of a sixth multiplicity of word-clusters, each word-cluster being representative of a one of the third multiplicity of unspecified topics;
c) for each document in the first multiplicity of documents;
1) predicting a predicted topical content of the document using the fifth multiplicity of association strength values and the observed feature vector for the document, the predicted topical content being represented by a topic belief vector;
2) predicting which words of the lexicon appear in the document using the topic belief vector and the fifth multiplicity of association strength values, which words of lexicon predicted to appear in the document being represented via a predicted feature vector;
3) determining whether the topic belief vector permits adequate prediction of which words of the lexicon appear in the document by calculating a document cost using the observed feature vector and the predicted feature vector;
4) if the topic belief vector did not permit adequate prediction of which words of the lexicon appear in the document;
A) determining how to modify the topic belief vector and modifying the topic belief vector accordingly;
B) repeating steps c2) through c4) until the topic belief vector permits adequate prediction of which words of the lexicon appear in the document;
d) determining whether the fifth multiplicity of association strength values permit adequate prediction of which words of the lexicon appear in the first multiplicity of documents using a total cost generated by summing together the document cost for each document of the first multiplicity of documents;
e) if the fifth multiplicity of association strength values do not permit adequate prediction of which words of the lexicon appear in all documents of the first multiplicity documents;
1) determining how to improve the prediction of which words in the lexicon appear in the first multiplicity of documents by determining how the values of the fifth multiplicity of association strength values should be modified;
2) adjusting the values of the fifth multiplicity of association strength values as determined in step e1);
3) repeating steps c) through e) until the fifth multiplicity of association strength values permit adequate prediction of which words of the lexicon appear in the first multiplicity of documents;
f) storing in the memory the association strength values that permit adequate prediction of which words of the lexicon appear in the first multiplicity of documents; and
g) predicting a topical content included in a selected document using the fifth multiplicity of association strength values and the sixth multiplicity of word-clusters, the selected document being presented to the processing unit in machine readable form, the selected document not being included in the first multiplicity of documents.
4 Assignments
0 Petitions
Accused Products
Abstract
An unsupervised method of learning the relationships between words and unspecified topics in documents using a computer is described. The computer represents the relationships between words and unspecified topics via word clusters and association strength values, which can be used later during topical characterization of documents. The computer learns the relationships between words and unspecified topics in an iterative fashion from a set of learning documents. The computer preprocesses the training documents by generating an observed feature vector for each document of the set of training documents and by setting association strengths to initial values. The computer then determines how well the current association strength values predict the topical content of all of the learning documents by generating a cost for each document and summing the individual costs together to generate a total cost. If the total cost is excessive, the association strength values are modified and the total cost recalculated. The computer continues calculating total cost and modifying association strength values until a set of association strength values are discovered that adequately predict the topical content of the entire set of learning documents.
-
Citations
19 Claims
-
1. An unsupervised method of learning relationships between words and unspecified topics using a processing unit coupled to a memory, the memory storing in machine readable form a first multiplicity of documents including a second multiplicity of words and a third multiplicity of unspecified topics, the memory also storing in machine readable form a lexicon including a fourth multiplicity of words, the fourth multiplicity of words being less than the second multiplicity of words, the method comprising the processing unit implemented steps of:
-
a) generating an observed feature vector for each document of the first multiplicity of documents, each observed feature vector indicating which words of the lexicon are present in an associated document; b) initializing a fifth multiplicity of association strength values to initial values, each association strength value indicating a relationship between a word-cluster word pair, each word-cluster word pair including a word of the lexicon and a word-cluster of a sixth multiplicity of word-clusters, each word-cluster being representative of a one of the third multiplicity of unspecified topics; c) for each document in the first multiplicity of documents; 1) predicting a predicted topical content of the document using the fifth multiplicity of association strength values and the observed feature vector for the document, the predicted topical content being represented by a topic belief vector; 2) predicting which words of the lexicon appear in the document using the topic belief vector and the fifth multiplicity of association strength values, which words of lexicon predicted to appear in the document being represented via a predicted feature vector; 3) determining whether the topic belief vector permits adequate prediction of which words of the lexicon appear in the document by calculating a document cost using the observed feature vector and the predicted feature vector; 4) if the topic belief vector did not permit adequate prediction of which words of the lexicon appear in the document; A) determining how to modify the topic belief vector and modifying the topic belief vector accordingly; B) repeating steps c2) through c4) until the topic belief vector permits adequate prediction of which words of the lexicon appear in the document; d) determining whether the fifth multiplicity of association strength values permit adequate prediction of which words of the lexicon appear in the first multiplicity of documents using a total cost generated by summing together the document cost for each document of the first multiplicity of documents; e) if the fifth multiplicity of association strength values do not permit adequate prediction of which words of the lexicon appear in all documents of the first multiplicity documents; 1) determining how to improve the prediction of which words in the lexicon appear in the first multiplicity of documents by determining how the values of the fifth multiplicity of association strength values should be modified; 2) adjusting the values of the fifth multiplicity of association strength values as determined in step e1); 3) repeating steps c) through e) until the fifth multiplicity of association strength values permit adequate prediction of which words of the lexicon appear in the first multiplicity of documents; f) storing in the memory the association strength values that permit adequate prediction of which words of the lexicon appear in the first multiplicity of documents; and g) predicting a topical content included in a selected document using the fifth multiplicity of association strength values and the sixth multiplicity of word-clusters, the selected document being presented to the processing unit in machine readable form, the selected document not being included in the first multiplicity of documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
Specification