Method for learning to infer the topical content of documents based upon their lexical content

US 5,687,364 A
Filed: 09/16/1994
Issued: 11/11/1997
Est. Priority Date: 09/16/1994
Status: Expired due to Term

First Claim

Patent Images

1. An unsupervised method of learning relationships between words and unspecified topics using a processing unit coupled to a memory, the memory storing in machine readable form a first multiplicity of documents including a second multiplicity of words and a third multiplicity of unspecified topics, the memory also storing in machine readable form a lexicon including a fourth multiplicity of words, the fourth multiplicity of words being less than the second multiplicity of words, the method comprising the processing unit implemented steps of:

a) generating an observed feature vector for each document of the first multiplicity of documents, each observed feature vector indicating which words of the lexicon are present in an associated document;

b) initializing a fifth multiplicity of association strength values to initial values, each association strength value indicating a relationship between a word-cluster word pair, each word-cluster word pair including a word of the lexicon and a word-cluster of a sixth multiplicity of word-clusters, each word-cluster being representative of a one of the third multiplicity of unspecified topics;

c) for each document in the first multiplicity of documents;

1) predicting a predicted topical content of the document using the fifth multiplicity of association strength values and the observed feature vector for the document, the predicted topical content being represented by a topic belief vector;

2) predicting which words of the lexicon appear in the document using the topic belief vector and the fifth multiplicity of association strength values, which words of lexicon predicted to appear in the document being represented via a predicted feature vector;

3) determining whether the topic belief vector permits adequate prediction of which words of the lexicon appear in the document by calculating a document cost using the observed feature vector and the predicted feature vector;

4) if the topic belief vector did not permit adequate prediction of which words of the lexicon appear in the document;

A) determining how to modify the topic belief vector and modifying the topic belief vector accordingly;

B) repeating steps c2) through c4) until the topic belief vector permits adequate prediction of which words of the lexicon appear in the document;

d) determining whether the fifth multiplicity of association strength values permit adequate prediction of which words of the lexicon appear in the first multiplicity of documents using a total cost generated by summing together the document cost for each document of the first multiplicity of documents;

e) if the fifth multiplicity of association strength values do not permit adequate prediction of which words of the lexicon appear in all documents of the first multiplicity documents;

1) determining how to improve the prediction of which words in the lexicon appear in the first multiplicity of documents by determining how the values of the fifth multiplicity of association strength values should be modified;

2) adjusting the values of the fifth multiplicity of association strength values as determined in step e1);

3) repeating steps c) through e) until the fifth multiplicity of association strength values permit adequate prediction of which words of the lexicon appear in the first multiplicity of documents;

f) storing in the memory the association strength values that permit adequate prediction of which words of the lexicon appear in the first multiplicity of documents; and

g) predicting a topical content included in a selected document using the fifth multiplicity of association strength values and the sixth multiplicity of word-clusters, the selected document being presented to the processing unit in machine readable form, the selected document not being included in the first multiplicity of documents.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An unsupervised method of learning the relationships between words and unspecified topics in documents using a computer is described. The computer represents the relationships between words and unspecified topics via word clusters and association strength values, which can be used later during topical characterization of documents. The computer learns the relationships between words and unspecified topics in an iterative fashion from a set of learning documents. The computer preprocesses the training documents by generating an observed feature vector for each document of the set of training documents and by setting association strengths to initial values. The computer then determines how well the current association strength values predict the topical content of all of the learning documents by generating a cost for each document and summing the individual costs together to generate a total cost. If the total cost is excessive, the association strength values are modified and the total cost recalculated. The computer continues calculating total cost and modifying association strength values until a set of association strength values are discovered that adequately predict the topical content of the entire set of learning documents.

Citations

19 Claims

1. An unsupervised method of learning relationships between words and unspecified topics using a processing unit coupled to a memory, the memory storing in machine readable form a first multiplicity of documents including a second multiplicity of words and a third multiplicity of unspecified topics, the memory also storing in machine readable form a lexicon including a fourth multiplicity of words, the fourth multiplicity of words being less than the second multiplicity of words, the method comprising the processing unit implemented steps of:
- a) generating an observed feature vector for each document of the first multiplicity of documents, each observed feature vector indicating which words of the lexicon are present in an associated document;
  
  b) initializing a fifth multiplicity of association strength values to initial values, each association strength value indicating a relationship between a word-cluster word pair, each word-cluster word pair including a word of the lexicon and a word-cluster of a sixth multiplicity of word-clusters, each word-cluster being representative of a one of the third multiplicity of unspecified topics;
  
  c) for each document in the first multiplicity of documents;
  
  1) predicting a predicted topical content of the document using the fifth multiplicity of association strength values and the observed feature vector for the document, the predicted topical content being represented by a topic belief vector;
  
  2) predicting which words of the lexicon appear in the document using the topic belief vector and the fifth multiplicity of association strength values, which words of lexicon predicted to appear in the document being represented via a predicted feature vector;
  
  3) determining whether the topic belief vector permits adequate prediction of which words of the lexicon appear in the document by calculating a document cost using the observed feature vector and the predicted feature vector;
  
  4) if the topic belief vector did not permit adequate prediction of which words of the lexicon appear in the document;
  
  A) determining how to modify the topic belief vector and modifying the topic belief vector accordingly;
  
  B) repeating steps c2) through c4) until the topic belief vector permits adequate prediction of which words of the lexicon appear in the document;
  
  d) determining whether the fifth multiplicity of association strength values permit adequate prediction of which words of the lexicon appear in the first multiplicity of documents using a total cost generated by summing together the document cost for each document of the first multiplicity of documents;
  
  e) if the fifth multiplicity of association strength values do not permit adequate prediction of which words of the lexicon appear in all documents of the first multiplicity documents;
  
  1) determining how to improve the prediction of which words in the lexicon appear in the first multiplicity of documents by determining how the values of the fifth multiplicity of association strength values should be modified;
  
  2) adjusting the values of the fifth multiplicity of association strength values as determined in step e1);
  
  3) repeating steps c) through e) until the fifth multiplicity of association strength values permit adequate prediction of which words of the lexicon appear in the first multiplicity of documents;
  
  f) storing in the memory the association strength values that permit adequate prediction of which words of the lexicon appear in the first multiplicity of documents; and
  
  g) predicting a topical content included in a selected document using the fifth multiplicity of association strength values and the sixth multiplicity of word-clusters, the selected document being presented to the processing unit in machine readable form, the selected document not being included in the first multiplicity of documents.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 2. The method of claim 1 wherein each predicted feature vector includes a dimension associated with a word included in the lexicon, and wherein each dimension of the predicted feature vector is associated with a scalar.
  - 3. The method of claim 2 wherein step c2) comprises determining a value of the scalar associated with each dimension of the predicted feature vector using a soft disjunctive mixing function.
  - 4. The method of claim 3 wherein the soft disjunctive mixing function is of the form:
    - space="preserve" listing-type="equation">r.sub.j, =1-II.sub.k (1-m.sub.k c.sub.j,k),
      where;
      
      r_j denotes the value of a scalar associated with a jth dimension of the predicted feature vector;
      
      k denotes an index for the first multiplicity of word clusters;
      
      m_k denotes a scalar value of the topic belief vector associated with the kth word cluster; and
      
      c_j,k denotes an association strength value relating the jth word to the kth word cluster.
  - 5. The method of claim 3 wherein a cost per document is calculated using a standard log likelihood function, the predicted feature vector associated with the document, and the observed feature vector associated with the document.
  - 6. The method of claim 5 wherein the observed feature vector includes a dimension for each word included in the lexicon, and wherein each dimension of the observed feature vector is associated with a binary valued scalar, the scalar having a first value indicating that the associated word of the lexicon is present in the document and a second value indicating that the associated word of the lexicon is not present in the document.
  - 7. The method of claim 6 wherein the standard log likelihood function is of the form:
    - space="preserve" listing-type="equation">g=Σ
      
      .sub.j log.sub.2 d.sub.j r.sub.j +(1-d.sub.j)(1-r.sub.j)!
      where;
      
      g denotes the document cost;
      
      d_j denotes the value of a scalar associated with a jth dimension of the observed feature vector; and
      
      r_j denotes the value of a scalar associated with a jth dimension of the predicted feature vector.
  - 8. The method of claim 7 wherein step c4A) comprises the computer implemented step of:
    - performing conjugate gradient descent on the document cost.
  - 9. The method of claim 8 wherein step e1) of the fourth multiplicity of association comprises the computer implemented step of:
    - performing conjugate gradient descent on the total cost.
  - 10. The method of claim 9 wherein step a) includes using a thresholding function for each dimension of the observed feature vector to determine a value of the scalar associated with the dimension.
  - 11. The method of claim 10 wherein the thresholding function relates a number of times a word of the lexicon occurs in the document compared to a knee of a word count histogram for the document.
  - 12. The method of claim 1 wherein the document cost is calculated using a standard log likelihood function, the predicted feature vector associated with the document, and the observed feature vector associated with the document.
  - 13. The method of claim 12 wherein the observed feature vector includes a dimension for each word included in the lexicon, and wherein each dimension of the observed feature vector is associated with a binary valued scalar, each scalar having a first value indicating that an associated word of the lexicon is present in the document and a second value indicating that the associated word of the lexicon is not present in the document.
  - 14. The method of claim 13 wherein the standard log likelihood function is of the form:
    - space="preserve" listing-type="equation">g=Σ
      
      .sub.j log.sub.2 d.sub.j r.sub.j +(1-d.sub.j)(1-r.sub.j)!
      where;
      
      g denotes the document cost;
      
      d_j denotes the value of a scalar associated with a jth dimension of the observed feature vector; and
      
      r_j denotes the value of a scalar associated with a jth dimension of the predicted feature vector.
  - 15. The method of claim 1 wherein the step of determining how to modify the topic belief vector for the document comprises the processing unit implemented step of:
    - performing conjugate gradient descent on the document cost.
  - 16. The method of claim 1 wherein the step of determining how modify the fifth multiplicity of association strength values comprises the processing unit implemented step of:
    - performing conjugate gradient descent on the total cost.
  - 17. The method of claim 1 wherein the observed feature vector is multidimensional and each dimension is associated with a word included of the lexicon, wherein each dimension of the observed feature vector is associated with a binary valued scalar, each scalar having a value indicative of whether the associated word of the lexicon is present in the document.
  - 18. The method of claim 17 wherein the step of generating the observed feature vector for each document in the first multiplicity of documents includes using a thresholding function for each dimension of the observed feature vector to determine a value of the scalar associated with the dimension.
  - 19. The method of claim 18 wherein the thresholding function relates a number of times the associated word of the lexicon occurs in the document compared to a knee of a word count histogram for the document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Saund, Eric, Hearst, Marti A.
Primary Examiner(s)
Black, Thomas G.
Assistant Examiner(s)
Coby, Frantz

Application Number

US08/308,037
Time in Patent Office

1,152 Days
Field of Search

364/419.19, 364/900, 364/419, 364/200, 395/600, 395/605, 395/606
US Class Current

704/5
CPC Class Codes

G06F 16/355 Class or cluster creation o...

Y10S 707/99936 Pattern matching access

Method for learning to infer the topical content of documents based upon their lexical content

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Method for learning to infer the topical content of documents based upon their lexical content

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links