Topic indexing method

US 6,185,531 B1
Filed: 01/09/1998
Issued: 02/06/2001
Est. Priority Date: 01/09/1997
Status: Expired due to Term

First Claim

Patent Images

1. A method of topic determination and discrimination of stories composed of words, said method comprising the steps of:

compiling a first set of topics;

obtaining prior probabilities for each topic in said first set of topics;

for all words in a story, determining probabilities associating each word in said story with each topic of said first set of topics, yielding word probabilities;

responsive to said individual topic probabilities, determining probabilities of all possible combinations of topics from said first set of topics associated with said story, yielding a posterior probability for each combination of topics from said first set of topics; and

choosing a combination of topics from said first set of topics having the highest posterior probability.

View all claims

11 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for improving the associating articles of information or stories with topics associated with specific subjects (subject topics) and with a general topic of words that are not associated with any subject. The inventive method is trained using Hidden Markov Models (HMM) to represent each story with each state in the HMM representing each topic. A standard Expectation and Maximization algorithm, as are known in this art field can be used to maximize the expected likelihood to the method relating the words associated with each topic to that topic. In the method, the probability that each word in a story is related to a subject topic is determined and evaluated, and the subject topics with the lowest probability are discarded. The remaining subject topics are evaluated and a sub-set of subject topics with the highest probabilities over all the words in a story are considered to be the “correct” subject topic set. The method utilizes only the positive information and words related to other topics are not taken as negative evidence for a topic being evaluated. The technique has particular application to text that is derived from speech via a speech recognizer or any other techniques that results in a text file. The use of a general topic category enhances the results since most words in any story are not keywords that are associated with any given subject topic. The removal of the general words reduces the numbers of words being considered as keywords related to any given subject topic. The reduced number of words being processed allows the method to enhance the discrimination between the fewer words as related to the subject topics. The topics can range from general, for example “the U.S. economy”, to very specific, for example, “the relationship of the yen to the dollar in the U.S. economy.”

81 Citations

View as Search Results

26 Claims

1. A method of topic determination and discrimination of stories composed of words, said method comprising the steps of:
- compiling a first set of topics;
  
  obtaining prior probabilities for each topic in said first set of topics;
  
  for all words in a story, determining probabilities associating each word in said story with each topic of said first set of topics, yielding word probabilities;
  
  responsive to said individual topic probabilities, determining probabilities of all possible combinations of topics from said first set of topics associated with said story, yielding a posterior probability for each combination of topics from said first set of topics; and
  
  choosing a combination of topics from said first set of topics having the highest posterior probability.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method as defined in claim 1 further comprising the steps of:
3. The method as defined in claim 1 wherein determining word probabilities comprises the steps of:
- performing the following steps for each combination of topics from said first set of topics;
  
  summing, over the topics in the combination, the probabilities that a particular word is associated with each topic of said combination, yielding summed probabilities; and
  
  multiplying said summed probabilities by said posterior probability for the combination of topics, said multiplying resulting in a score for that combination of topics from said first set of topics.
4. The method as defined in claim 1 further comprising the steps of:
- independently determining the probability that each topic of said first set of topics is related to a particular story;
  
  selecting a second set of topics from said first set of topics, said second set of topics comprising the topics in the combination of topics from the first set of topics having the highest posterior probability score;
  
  scoring said second set of topics by summing, over said second set of topics, the probability that a particular word is related to each topic in said second set of topics, yielding summed probabilities for the second set of topics; and
  
  multiplying said summed probabilities for the second set of topics over all words in the story and then multiplying by said prior probabilities of the combination of topics in said second set of topics.
5. The method of claim 4 wherein the step of determining the probability of a particular word in a story, given a particular topic, depends on the preceding words in said story.
6. The method of claim 1, wherein said stories are stored as computer files.

7. A method of topic determination and discrimination of stories, composed of words, said method comprising the steps of:
- compiling a first set of topics, said first set of topics including a set of subject topics and a general topic;
  
  forming said general topic by automatically compiling words that are not generally associated with any particular subject;
  
  obtaining prior probabilities for each topic in said first set of topics;
  
  for all words in a story, determining probabilities associating each word in said story to each topic in said set of subject topics and said general topic, yielding word probabilities;
  
  calculating individual topic probabilities from said word probabilities; and
  
  responsive to said individual topic probabilities, determining the probabilities of all possible combinations of topics from said set of subject topics associated with said story, yielding a posterior probability for each combination of topics in said first set of topics.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The method as defined in claim 7 further comprising the steps of:
9. The method as defined in claim 7 wherein determining word probabilities comprises the steps of:
- performing the following steps for each combination of topics from said first set of topics;
  
  summing, over the topics in the combination, the probabilities that a particular word is associated with each topic of said combination, yielding summed probabilities; and
  
  multiplying said summed probabilities by said posterior probability for the combination of topics, said multiplying resulting in a score for that combination of topics from said first set of topics.
10. The method as defined in claim 7 further comprising the steps of:
- independently determining the probability that each topic of said first set of topics is related to a particular story;
  
  selecting a second set of topics from said first set of topics, the probability that each topic of the second set of topics is related to said particular story exceeding a threshold;
  
  scoring said second set of topics by summing, over said second set of topics, the probability that a particular word is related to each topic in said second set of topics, yielding summed probabilities for said second set of topics; and
  
  multiplying said summed probabilities for said second set of topics over all words in the story and then multiplying by said posterior probabilities of the combination of topics in said second set of topics.
11. The method of claim 10 wherein the step of determining the probability of a particular word in a story, given a particular topic, depends on the preceding words in said story.
12. The method of claim 7 further comprising the steps of:
- determining the probabilities associating the words in said story to said general topic and to said subject topics; and
  
  based on a preset criteria responsive to said probabilities, determining a second set of subject topics chosen from the set of subject topics associated with said story.

13. A method of modeling the relationship between words and topics in stories comprising the steps of:
- compiling a set of training stories;
  
  determining a set of topics associated with at least one of said training stories;
  
  for each training story, relating each word of said training story with each topic of said set of topics, the relating comprising forming a Hidden Markov Model for each story, said Hidden Markov Model relates, for a topic related to a given story, the expected percentage of words in said story that relates to said topic, and the probability that a given word in said story, related to a topic, is the word being considered; and
  
  maximizing the joint likelihood of the topics and words related to those respective topics in said set of training stories.
- View Dependent Claims (14)
- - 14. The method of claim 13 further comprising the steps of:

15. A method of modeling the relationship between words and topics in stories comprising the steps of:
- compiling a set of training stories;
  
  determining a set of subject topics associated with at least one of said stories;
  
  determining a general topic of words that are not associated with any subject topic; and
  
  relating each word of said training story to each topic, the relating comprising forming a Hidden Markov Model for each story, said Hidden Markov Model relates, for a topic related to a given story, the expected percentage of words in said story that relates to said topic, and the probability that a given word in said story, related to a topic, is the word being considered.
- View Dependent Claims (16, 17)
- - 16. The method as defined in claim 15 further comprising the steps of:
17. The method of claim 15 further comprising the step of maximizing the joint likelihood of the topics and the words related to those respective topics in said training set of stories.

18. An apparatus for topic determination and discrimination of stories composed of words, said apparatus comprising:
- means for compiling a first set of topics, means for obtaining prior probabilities for each topic in said first set of topics;
  
  for all words in a story, means for determining the probabilities associating each word in said story with each topic of said first set of topics yielding word probabilities;
  
  means for calculating individual topic probabilities from said word probabilities; and
  
  responsive to individual topic probabilities, means for determining the probabilities of all possible combinations of topics from said first set of topics associated with said story, yielding a posterior probability for each combination of topics from said first set of topics.
- View Dependent Claims (19, 20, 21)
- - 19. The apparatus as defined in claim 18 further comprising:
20. The apparatus of claim 18 wherein said means for determining word probabilities further comprises:
- means for summing, over the topics in the combination, the probabilities that a particular word is associated with each topic of said combination of topics from said first set of topics, yielding summed probabilities;
  
  means for multiplying said summed probabilities by said posterior probability for the combination of topics from said first set of topics, said multiplying resulting in a score for each combination of topics from said first set of topics; and
  
  means to control the apparatus such that at the apparatus each combination of topics from said first set of topics is operated upon.
21. The apparatus as defined in claim 18 further comprising:
- means for independently determining the probability that each topic of said first set of topics is related to a particular story;
  
  means for selecting a second set of topics from said first set of topics, said second set of topics comprising the topics in said combination of topics from said first set of topics having the highest posterior probability score;
  
  means for scoring said second set of topics by summing, over said second set of topics, the probability that a particular word is related to each topic in said second set of topics, yielding summed probabilities for the second set of topics; and
  
  means for multiplying said summed probabilities for the second set of topics over all words in the story and then multiplying by said prior probabilities of the combination of topics matching said second set of topics.

22. An Apparatus for topic determination and discrimination of stories composed of words, said apparatus comprising:
- means for compiling a first set of topics, said first set of topics including a set of subject topics and a general topic;
  
  means for forming said general topic by compiling words that are not generally associated with any subject topic;
  
  for all words in a story, means for determining probabilities associating each word in said story to each topic in said set of subject topics and said general topic, yielding word probabilities;
  
  means for determining individual topic probabilities from said word probabilities; and
  
  means for selecting, based on the individual topic probabilities, a second set of topics from said first set of topics associated with said story.
- View Dependent Claims (23, 24, 25)
- - 23. The apparatus as defined in claim 22 further comprising:
24. The apparatus of claim 22 wherein said means for determining word probabilities further comprises:
- means for summing, over all combinations of from said first set of topics, the probabilities that a particular word is associated with each topic of said combination of topics from said first set of topics, yielding summed probabilities;
  
  means for multiplying said summed probabilities over all said words in said story and multiplying by said prior probability of each combination of topics from said first set of topics, said multiplying resulting in a score for each combination of topics from said first set of topics; and
  
  means to control said apparatus such that each combination of topics from said first set of topics is operated upon.
25. The apparatus as defined in claim 22 further comprising:
- means for independently determining the probability that each topic of said first set of topics is related to a particular story;
  
  means for selecting a second set of topics from said first set of topics, said second set of topics comprising the topics in said combination of topics from said first set of topics having the highest probability score;
  
  means for scoring said second set of topics by summing, over said second set of topics, the probability that a particular word is related to each topic in said second set of topics, yielding summed probabilities for the second set of topics; and
  
  means for multiplying said summed probabilities for the second set of topics over all words in the story and then multiplying by said prior probabilities of the combination of topics matching said second set of topics.

26. Apparatus for modeling the relationship between words and topics in stories, the apparatus comprising:
- means for compiling a set of training stories;
  
  means for determining a set of topics, said topics comprising a set of subject topics associated with at least one of said training stories and a general topic of words that are not associated with any subject topic, associated with each of said stories; and
  
  means for relating each word of each training story with each topic of said set of topics, the means for relating comprising means for forming a Hidden Markov Model for each story, said Hidden Markov Model relates, for a topic related to a given story, the expected percentage of words in said story that relates to said topic, and the probability that a given word in said story, related to a topic, is the word being considered.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cxense ASA
Original Assignee
GTE Internetworking Incorporated (Lumen Technologies, Inc.)
Inventors
Imai, Toru, Schwartz, Richard M.
Primary Examiner(s)
Tsang, Fan
Assistant Examiner(s)
SAX, ROBERT L

Application Number

US09/005,960
Time in Patent Office

1,124 Days
Field of Search

704/256, 704/251, 704/240, 704/236, 707/6
US Class Current

704/256.1
CPC Class Codes

G06F 40/211   Syntactic parsing, e.g. bas...

G06F 40/279   Recognition of textual enti...

G10L 15/183   using context dependencies,...

Y10S 707/99936   Pattern matching access

Topic indexing method

First Claim

11 Assignments

0 Petitions

Accused Products

Abstract

81 Citations

26 Claims

Specification

Use Cases

Quick Links

Others

Topic indexing method

First Claim

11 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

81 Citations

26 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others