Systems and methods for determining the topic structure of a portion of text
First Claim
1. A method for determining the topic structure of a portion of text, comprising:
- identifying candidate segmentation points of the portion of text corresponding to locations between text blocks;
applying a folding-in process to each text block to determine a distribution of probabilities over a plurality of latent variables for each text block;
using the determined distributions to estimate a distribution of words for each text block;
making comparisons of the distributions of words in adjacent text blocks using a similarity metric to determine similarity values; and
selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments.
8 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for determining the topic structure of a document including text utilize a Probabilistic Latent Semantic Analysis (PLSA) model and select segmentation points based on similarity values between pairs of adjacent text blocks. PLSA forms a framework for both text segmentation and topic identification. The use of PLSA provides an improved representation for the sparse information in a text block, such as a sentence or a sequence of sentences. Topic characterization of each text segment is derived from PLSA parameters that relate words to “topics”, latent variables in the PLSA model, and “topics” to text segments. A system executing the method exhibits significant performance improvement. Once determined, the topic structure of a document may be employed for document retrieval and/or document summarization.
-
Citations
33 Claims
-
1. A method for determining the topic structure of a portion of text, comprising:
-
identifying candidate segmentation points of the portion of text corresponding to locations between text blocks;
applying a folding-in process to each text block to determine a distribution of probabilities over a plurality of latent variables for each text block;
using the determined distributions to estimate a distribution of words for each text block;
making comparisons of the distributions of words in adjacent text blocks using a similarity metric to determine similarity values; and
selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. A method for determining the topic structure of a portion of text, comprising:
-
applying a folding-in process to each of a plurality of segments of the portion of text to determine a probability distribution over latent variables for each segment;
using each determined distribution to estimate a distribution of words for each segment; and
identifying at least one topic for each segment based on the distribution of words for each segment.
-
-
23. A system for determining the topic structure of a portion of text, comprising:
-
an input device for inputting a portion of a text; and
at least one processor for applying a folding-in process to each of a plurality of segments of the portion of text to determine a probability distribution over latent variables for each segment, estimating a distribution of words for each segment based on the determined probability distribution and identifying at least one topic for each segment based on the distribution of words for each segment. - View Dependent Claims (24)
-
-
25. A method for determining the topic structure of a portion of text, comprising:
-
identifying candidate segmentation points of the portion of text corresponding to locations between text blocks;
determining a plurality of PLSA models;
applying a folding-in process with each model to each text block to determine a distribution of probabilities over a plurality of a latent variables for each text block;
using the determined distributions to estimate a distribution of words for each text block;
averaging the word distributions for each text block;
making comparisons of the distributions of words in adjacent text blocks using a similarity metric to determine similarity values based on the averaged word distributions; and
selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments. - View Dependent Claims (26, 27, 28)
-
-
29. A method for determining the topic structure of a portion of text, comprising:
-
identifying candidate segmentation points of the portion of text corresponding to locations between text blocks;
determining a plurality of PLSA models;
applying a folding-in process with each model to each text block to determine a distribution of probabilities over a plurality of a latent variables for each text block;
using the determined distributions to estimate a distribution of words for each text block;
making comparisons of the distributions of words in adjacent text blocks using a similarity metric to determine similarity values;
averaging the similarity values; and
selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments. - View Dependent Claims (30, 31, 32)
-
-
33. A method for determining the number of training iterations for obtaining a PLSA model, comprising:
-
determining a likelihood value according to the formula for each iteration;
comparing the likelihood value of a current iteration to the likelihood value of a previous iteration; and
determining the number of iterations as the current iteration when the likelihood value of the current iteration is within a selected threshold value of the likelihood value of the previous iteration.
-
Specification