×

Method for learning to infer the topical content of documents based upon their lexical content

  • US 5,687,364 A
  • Filed: 09/16/1994
  • Issued: 11/11/1997
  • Est. Priority Date: 09/16/1994
  • Status: Expired due to Term
First Claim
Patent Images

1. An unsupervised method of learning relationships between words and unspecified topics using a processing unit coupled to a memory, the memory storing in machine readable form a first multiplicity of documents including a second multiplicity of words and a third multiplicity of unspecified topics, the memory also storing in machine readable form a lexicon including a fourth multiplicity of words, the fourth multiplicity of words being less than the second multiplicity of words, the method comprising the processing unit implemented steps of:

  • a) generating an observed feature vector for each document of the first multiplicity of documents, each observed feature vector indicating which words of the lexicon are present in an associated document;

    b) initializing a fifth multiplicity of association strength values to initial values, each association strength value indicating a relationship between a word-cluster word pair, each word-cluster word pair including a word of the lexicon and a word-cluster of a sixth multiplicity of word-clusters, each word-cluster being representative of a one of the third multiplicity of unspecified topics;

    c) for each document in the first multiplicity of documents;

    1) predicting a predicted topical content of the document using the fifth multiplicity of association strength values and the observed feature vector for the document, the predicted topical content being represented by a topic belief vector;

    2) predicting which words of the lexicon appear in the document using the topic belief vector and the fifth multiplicity of association strength values, which words of lexicon predicted to appear in the document being represented via a predicted feature vector;

    3) determining whether the topic belief vector permits adequate prediction of which words of the lexicon appear in the document by calculating a document cost using the observed feature vector and the predicted feature vector;

    4) if the topic belief vector did not permit adequate prediction of which words of the lexicon appear in the document;

    A) determining how to modify the topic belief vector and modifying the topic belief vector accordingly;

    B) repeating steps c2) through c4) until the topic belief vector permits adequate prediction of which words of the lexicon appear in the document;

    d) determining whether the fifth multiplicity of association strength values permit adequate prediction of which words of the lexicon appear in the first multiplicity of documents using a total cost generated by summing together the document cost for each document of the first multiplicity of documents;

    e) if the fifth multiplicity of association strength values do not permit adequate prediction of which words of the lexicon appear in all documents of the first multiplicity documents;

    1) determining how to improve the prediction of which words in the lexicon appear in the first multiplicity of documents by determining how the values of the fifth multiplicity of association strength values should be modified;

    2) adjusting the values of the fifth multiplicity of association strength values as determined in step e1);

    3) repeating steps c) through e) until the fifth multiplicity of association strength values permit adequate prediction of which words of the lexicon appear in the first multiplicity of documents;

    f) storing in the memory the association strength values that permit adequate prediction of which words of the lexicon appear in the first multiplicity of documents; and

    g) predicting a topical content included in a selected document using the fifth multiplicity of association strength values and the sixth multiplicity of word-clusters, the selected document being presented to the processing unit in machine readable form, the selected document not being included in the first multiplicity of documents.

View all claims
  • 4 Assignments
Timeline View
Assignment View
    ×
    ×