Text segmentation and identification of topic using language models
First Claim
Patent Images
1. A method for segmenting a stream of text into segments using a plurality of language models, the stream of text including a sequence of blocks of text, the method comprising:
- scoring the blocks of text against the language models to generate language model scores for the blocks of text, the language model score for a block of text against a language model indicating a correlation between the block of text and the language model;
generating language model sequence scores for different sequences of language models to which a sequence of blocks of text may correspond, a language model sequence score being a function of the scores of a sequence of blocks of text against a sequence of language models;
selecting a sequence of language models that satisfies a predetermined condition; and
identifying segment boundaries in the stream of text that correspond to language model transitions in the selected sequence of language models.
8 Assignments
0 Petitions
Accused Products
Abstract
System for segmenting text and identifying segment topics that match a user-specified topic. Topic tracking system creates a set of topic models from training text containing topic boundaries using a clustering algorithm. User supplies topic text. System creates a topic model of the topic text and adds the topic model to the set of topic models. User-supplied test text is segmented according to the set of topic models. Segments relating to the same topic as the topic text are selected.
-
Citations
45 Claims
-
1. A method for segmenting a stream of text into segments using a plurality of language models, the stream of text including a sequence of blocks of text, the method comprising:
-
scoring the blocks of text against the language models to generate language model scores for the blocks of text, the language model score for a block of text against a language model indicating a correlation between the block of text and the language model; generating language model sequence scores for different sequences of language models to which a sequence of blocks of text may correspond, a language model sequence score being a function of the scores of a sequence of blocks of text against a sequence of language models; selecting a sequence of language models that satisfies a predetermined condition; and identifying segment boundaries in the stream of text that correspond to language model transitions in the selected sequence of language models. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A method for identifying a block of text as relating to a topic of interest, in a system comprising a plurality of language models, including a language model for the topic of interest, the method comprising:
-
obtaining a stream of text comprising text segments; scoring the text segments against the plurality of language models to generate language model scores for the text segments; identifying a text segment from among the text segments as block of text relating to the topic of interest if the score of the text segment against the language model for the topic of interest satisfies a predetermined condition. - View Dependent Claims (20, 21, 22)
-
-
23. A computer program tangibly stored on a computer-readable medium and operable to cause a computer to segment a stream of text into segments using a plurality of language models, the stream of text including a sequence of blocks of text, comprising instructions to:
-
score the blocks of text against the language models to generate language model scores for the blocks of text, the language model score for a block of text against a language model indicating a correlation between the block of text and the language model; generate language model sequence scores for different sequences of language models to which a sequence of blocks of text may correspond, a language model sequence score being a function of the scores of a sequence of blocks of text against a sequence of language models; select a sequence of language models based on a predetermined condition; and identify segment boundaries in the stream of text that correspond to language model transitions in the selected sequence of language models. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40)
-
-
41. A computer program tangibly stored on a computer-readable medium and operable to cause a computer to identify a block of text relating to a topic of interest, in a system comprising a plurality of language models, including a language model for a topic of interest, comprising instructions to:
-
obtain a stream of text comprising text segments; score the text segments against the plurality of language models to generate language model scores for the segments of text; and identify a text segment from among the text segments as a block of text relating to the topic of interest if the score of the text segment against the language model for the topic of interest satisfies a predetermined condition. - View Dependent Claims (42, 43, 44)
-
-
45. A method for identifying text relating to a topic of interest, in a system comprising a plurality of language models, lmj, where j ranges from 1 to n, and n is a maximum number of language models, including a language model lmt relating to a topic of interest t, the method comprising:
-
obtaining a stream of text comprising text segments si, where i ranges from 1 to m, and m is a maximum number of text segments in the stream of text; scoring the text segments si against the plurality of language models lmj to generate language model scores scorei,j for each of the segments of text si, where scorei,j is a score of text segment i of the stream of text against language model number j; for a text segment sk from among the set of text segments si for 1ε
{1,m}, relating that text segment sk to the topic of interest t if the score scorek,t of the text segment against the language model lmt for the topic of interest t satisfies a predetermined condition.
-
Specification