Text segmentation and identification of topic using language models

US 6,052,657 A
Filed: 11/25/1997
Issued: 04/18/2000
Est. Priority Date: 09/09/1997
Status: Expired due to Term

First Claim

Patent Images

1. A method for segmenting a stream of text into segments using a plurality of language models, the stream of text including a sequence of blocks of text, the method comprising:

scoring the blocks of text against the language models to generate language model scores for the blocks of text, the language model score for a block of text against a language model indicating a correlation between the block of text and the language model;

generating language model sequence scores for different sequences of language models to which a sequence of blocks of text may correspond, a language model sequence score being a function of the scores of a sequence of blocks of text against a sequence of language models;

selecting a sequence of language models that satisfies a predetermined condition; and

identifying segment boundaries in the stream of text that correspond to language model transitions in the selected sequence of language models.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

System for segmenting text and identifying segment topics that match a user-specified topic. Topic tracking system creates a set of topic models from training text containing topic boundaries using a clustering algorithm. User supplies topic text. System creates a topic model of the topic text and adds the topic model to the set of topic models. User-supplied test text is segmented according to the set of topic models. Segments relating to the same topic as the topic text are selected.

Citations

45 Claims

1. A method for segmenting a stream of text into segments using a plurality of language models, the stream of text including a sequence of blocks of text, the method comprising:
- scoring the blocks of text against the language models to generate language model scores for the blocks of text, the language model score for a block of text against a language model indicating a correlation between the block of text and the language model;
  
  generating language model sequence scores for different sequences of language models to which a sequence of blocks of text may correspond, a language model sequence score being a function of the scores of a sequence of blocks of text against a sequence of language models;
  
  selecting a sequence of language models that satisfies a predetermined condition; and
  
  identifying segment boundaries in the stream of text that correspond to language model transitions in the selected sequence of language models.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. The method of claim 1, wherein generating a language model sequence score for a sequence of language models comprises summing language model scores for the sequence of blocks of text corresponding to the sequence of language models.
  - 3. The method of claim 2, further comprising:
    - for each language model transition in the sequence of language models, adding to the language model sequence score a switch penalty.
  - 4. The method of claim 3, wherein the switch penalty is the same for each language model transition in the sequence of language models.
  - 5. The method of claim 4, wherein the switch penalty is determined by:
    - selecting a stream of text for which the number of language model transitions is known;
      
      repeatedly segmenting the stream of text into segments using a plurality of switch penalties; and
      
      selecting a switch penalty resulting in a number of language model transitions that is similar to the known number of language model transitions.
  - 6. The method of claim 1, wherein generating language model sequence scores comprises:
    - generating multiple language model sequence scores for a subsequence of the sequence of blocks of text;
      
      eliminating poorly scoring sequences of language models; and
      
      adding a block of text to the subsequence and repeating the generating and eliminating steps.
  - 7. The method of claim 6, wherein:
    - a poorly scoring sequence of language models is a sequence of language models with a language model sequence score that is worse than another language model sequence score by more than a fall-behind amount.
  - 8. The method of claim 7, wherein:
    - generating a language model sequence score for a sequence of language models comprises, for each language model transition in the sequence of language models, adding to the language model sequence score a switch penalty; and
      
      the fall-behind amount equals the switch penalty.
  - 9. The method of claim 1, wherein selecting a language model sequence based on a predetermined condition comprises:
    - selecting a language model sequence with a language model sequence score that is the minimum of the calculated language model sequence scores.
  - 10. The method of claim 1, wherein a block of text comprises a sentence.
  - 11. The method of claim 1, wherein a block of text comprises a paragraph.
  - 12. The method of claim 1, wherein a block of text comprises an utterance identified by a speech recognizor.
  - 13. The method of claim 12, wherein an utterance comprises a sequence of words.
  - 14. The method of claim 1, wherein the language models are generated by:
    - clustering a stream of training text into a specified number of clusters; and
      
      generating a language model for each cluster.
  - 15. The method of claim 1, wherein the language models comprise unigram language models.
  - 16. The method of claim 1, wherein the language models comprise bigram language models.
  - 17. The method of claim 1, further comprising scoring the blocks of text against a language model for a topic of interest.
  - 18. The method of claim 17, further comprising identifying segments that correspond to the language model for the topic of interest as corresponding to the topic of interest.

19. A method for identifying a block of text as relating to a topic of interest, in a system comprising a plurality of language models, including a language model for the topic of interest, the method comprising:
- obtaining a stream of text comprising text segments;
  
  scoring the text segments against the plurality of language models to generate language model scores for the text segments;
  
  identifying a text segment from among the text segments as block of text relating to the topic of interest if the score of the text segment against the language model for the topic of interest satisfies a predetermined condition.
- View Dependent Claims (20, 21, 22)
- - 20. The method of claim 19, wherein the predetermined condition requires the score of the text segment against the language model for the topic of interest to differ from the lowest score among the scores of the text segment against the plurality of language models by less than a predetermined amount, or to be the lowest score.
  - 21. The method of claim 19, wherein the predetermined condition requires the score of the text segment against the language model for the topic of interest to be the lowest score among the scores of the text segment against the plurality of language models, and that the next lowest score among the scores of the text segment against the plurality of language models be greater than the score of the text segment against the language model for the topic of interest by more than a predetermined amount.
  - 22. The method of claim 21, wherein the predetermined amount is zero.

23. A computer program tangibly stored on a computer-readable medium and operable to cause a computer to segment a stream of text into segments using a plurality of language models, the stream of text including a sequence of blocks of text, comprising instructions to:
- score the blocks of text against the language models to generate language model scores for the blocks of text, the language model score for a block of text against a language model indicating a correlation between the block of text and the language model;
  
  generate language model sequence scores for different sequences of language models to which a sequence of blocks of text may correspond, a language model sequence score being a function of the scores of a sequence of blocks of text against a sequence of language models;
  
  select a sequence of language models based on a predetermined condition; and
  
  identify segment boundaries in the stream of text that correspond to language model transitions in the selected sequence of language models.
- View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40)
- - 24. The computer program of claim 23, wherein instructions to generate a language model sequence score for a sequence of language models comprise instructions to sum language model scores for the sequence of blocks of text corresponding to the sequence of language models.
  - 25. The computer program of claim 24, further comprising instructions to, for each language model transition in the sequence of language models, add to the language model sequence score a switch penalty.
  - 26. The computer program of claim 25, wherein the switch penalty is the same for each language model transition in the sequence of language models.
  - 27. The computer program of claim 26, wherein the switch penalty is determined by instructions to:
    - select a stream of text for which the number of language model transitions is known;
      
      repeatedly segment the stream of text into segments using a plurality of switch penalties;
      
      select a switch penalty resulting in a number of language model transitions that is similar to the known number of language model transitions.
  - 28. The computer program of claim 23, wherein instructions to generate language model sequence scores comprises instructions to:
    - generate multiple language model sequence scores for a subsequence of the sequence of blocks of text;
      
      eliminate poorly scoring sequences of language models; and
      
      add a block of text to the set and repeat the instructions to generate and eliminate steps.
  - 29. The computer program of claim 28, wherein a poorly scoring sequence of language models is a sequence of language models with a language model sequence score that is worse than another language model sequence score by more than a fall-behind amount.
  - 30. The computer program of claim 29, wherein instructions to generate a language model sequence score comprises instructions, for each language model transition in the sequence of language models, to add to the language model sequence score a switch penalty, and wherein the fall-behind amount equals the switch penalty.
  - 31. The computer program of claim 23, wherein instructions to select a language model sequence based on the predetermined condition comprise instructions to select a language model sequence with a language model sequence score that is the minimum of the calculated language model sequence scores.
  - 32. The computer program of claim 23, wherein a block of text comprises a sentence.
  - 33. The computer program of claim 23, wherein a block of text comprises a paragraph.
  - 34. The computer program of claim 23, wherein a block of text comprises an utterance identified by a speech recognizor.
  - 35. The computer program of claim 34, wherein an utterance comprises a sequence of words.
  - 36. The computer program of claim 23, wherein the language models are generated by instructions to:
    - cluster a stream of training text into a specified number of clusters; and
      
      generate a language model for each cluster.
  - 37. The computer program of claim 23, wherein the language models comprise unigram language models.
  - 38. The computer program of claim 23, wherein the language models comprise bigram language models.
  - 39. The computer program of claim 23, further comprising instructions to score the blocks of text against a language model for a topic of interest.
  - 40. The computer program of claim 39, further comprising instructions to identify segments that correspond to the language model for the topic of interest as corresponding to the topic of interest.

41. A computer program tangibly stored on a computer-readable medium and operable to cause a computer to identify a block of text relating to a topic of interest, in a system comprising a plurality of language models, including a language model for a topic of interest, comprising instructions to:
- obtain a stream of text comprising text segments;
  
  score the text segments against the plurality of language models to generate language model scores for the segments of text; and
  
  identify a text segment from among the text segments as a block of text relating to the topic of interest if the score of the text segment against the language model for the topic of interest satisfies a predetermined condition.
- View Dependent Claims (42, 43, 44)
- - 42. The computer program of claim 41, wherein the predetermined condition requires the score of the text segment against the language model for the topic of interest to differ from the lowest score among the scores of the text segment against the plurality of language models by less than a predetermined amount, or to be the lowest score.
  - 43. The computer program of claim 41, wherein the predetermined condition requires that the score of the text segment against the language model for the topic of interest be the lowest score among the scores of the text segment against the plurality of language models, and that the next lowest score among the scores of the text segment against the plurality of language models be greater than the score of the text segment against the language model for the topic of interest by more than a predetermined amount.
  - 44. The computer program of claim 43, wherein the predetermined amount is zero.

45. A method for identifying text relating to a topic of interest, in a system comprising a plurality of language models, lm_j, where j ranges from 1 to n, and n is a maximum number of language models, including a language model lm_t relating to a topic of interest t, the method comprising:
- obtaining a stream of text comprising text segments s_i, where i ranges from 1 to m, and m is a maximum number of text segments in the stream of text;
  
  scoring the text segments s_i against the plurality of language models lm_j to generate language model scores score_i,j for each of the segments of text s_i, where score_i,j is a score of text segment i of the stream of text against language model number j;
  
  for a text segment s_k from among the set of text segments s_i for 1ε
  
  {1,m}, relating that text segment s_k to the topic of interest t if the score score_k,t of the text segment against the language model lm_t for the topic of interest t satisfies a predetermined condition.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
Dragon Systems, Inc. (Microsoft Corporation)
Inventors
Yamron, Jonathan P., Barnett, James, Bamberg, Paul G., Gillick, Laurence S., van Mulbregt, Paul A.
Primary Examiner(s)
Isen, Forester W.
Assistant Examiner(s)
EDOUARD, PATRICK NESTOR

Application Number

US08/978,487
Time in Patent Office

875 Days
Field of Search

704/1, 704/9, 704/10, 704/255, 704/257, 707/530, 707/531
US Class Current

704/9
CPC Class Codes

G06F 16/313 Selection or weighting of t...

Text segmentation and identification of topic using language models

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

45 Claims

Specification

Solutions

Use Cases

Quick Links

Text segmentation and identification of topic using language models

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

45 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links