Systems and methods for determining the topic structure of a portion of text
First Claim
1. A computerized method for determining the topic structure of a portion of text, comprising:
- identifying candidate segmentation points of the portion of text corresponding to locations between text blocks;
determining a distribution of probabilities over a plurality of latent classes for each text block;
using the determined distributions to estimate a distribution of words for each text block;
making comparisons of the distribution of words in adjacent text blocks using a similarity metric to determine similarity valuessmoothing the similarity values after making comparisons; and
selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments.
7 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for determining the topic structure of a document including text utilize a Probabilistic Latent Semantic Analysis (PLSA) model and select segmentation points based on similarity values between pairs of adjacent text blocks. PLSA forms a framework for both text segmentation and topic identification. The use of PLSA provides an improved representation for the sparse information in a text block, such as a sentence or a sequence of sentences. Topic characterization of each text segment is derived from PLSA parameters that relate words to “topics”, latent variables in the PLSA model, and “topics” to text segments. A system executing the method exhibits significant performance improvement. Once determined, the topic structure of a document may be employed for document retrieval and/or document summarization.
45 Citations
25 Claims
-
1. A computerized method for determining the topic structure of a portion of text, comprising:
-
identifying candidate segmentation points of the portion of text corresponding to locations between text blocks; determining a distribution of probabilities over a plurality of latent classes for each text block; using the determined distributions to estimate a distribution of words for each text block; making comparisons of the distribution of words in adjacent text blocks using a similarity metric to determine similarity values smoothing the similarity values after making comparisons; and selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments. - View Dependent Claims (2, 3, 4)
-
-
5. A computerized method for determining the topic structure of a portion of text, comprising:
-
identifying candidate segmentation points of the position of test corresponding to locations between text blocks; determining a distribution of probabilities over a plurality of latent classes for each text block; using the determined distributions to estimate a distribution of words for each text block; making comparisons of the distributions of words in adjacent text blocks using a similarity metric to determine similarity values; and selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments; wherein making comparisons of the distributions of words in adjacent text blocks using a similarity metric is based on at least one of a Hellinger distance or a Jensen-Shannon divergence.
-
-
6. A computerized method for determining the topic structure of a portion of text having one or more segments, comprising:
-
applying a folding-in process to each segment to determine a probability distribution over latent classes for each segment; using each determined distribution to estimate a distribution of word groups for each segment; and identifying at least one topic for each segment based on the distribution of word groups for the segment; wherein identifying at least one topic for each segment is also based on at least one of a measure of an occurrence of word groups in each segment, term vectors for each segment, on parts of speech of the word groups in the segments, and on mutual information between word groups in each segment and each segment. - View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A computerized method for retrieving a portion of text, comprising:
-
determining topic structures of a plurality of portions of text by; determining a distribution of probabilities over a plurality of latent classes for each text block; using the determined distributions to estimate a distribution of words for each text block; making comparisons of the distributions of words in adjacent text blocks using a similarity metric to determine similarity values; and selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments; and retrieving at least one of the plurality of portions of text using the topic structures of the portions of text. - View Dependent Claims (16)
-
-
17. A system for determining the topic structure of a portion of text, comprising:
-
an input device for inputting a portion of a text; and at least one processor for applying a folding-in process to each of a plurality of segments of the portion of text to determine a probability distribution over latent classes for each segment, estimating a distribution of word groups for each segment based on the determined probability distribution and identifying at least one topic for each segment based on the distribution of word groups for each segment, wherein the at least one processor identifies the plurality of segments of the portion of text.
-
-
18. A computerized method for determining the topic structure of a portion of text, comprising:
-
identifying candidate segmentation points of the portion of text corresponding to locations between text blocks; determining a plurality of PLSA models; applying a folding-in process with each model to each text block to determine a distribution of probabilities over a plurality of latent classes for each text block; using the determined distributions to estimate a plurality of word group distributions for each text block; averaging the plurality of word group distributions for each text block; making comparisons of the plurality of word groups distributions in adjacent text blocks using a similarity metric to determine similarity values based on the averaged plurality of word group distributions; and selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments; wherein determining a plurality of PLSA models uses at least one of different initializations of the models prior to training each model and different numbers of latent variables.
-
-
19. A computerized method for determining the topic structure of a portion of text comprising;
-
identifying candidate segmentation points of the portion of text corresponding to locations between text blocks; determining a plurality of PLSA models; applying a folding-in process with each model to each text block to determine a distribution of probabilities over a plurality of latent classes for each text block; using the determined distributions to estimate a plurality of word group distributions for each text block; averaging the plurality of word group distributions for each text block; smoothing the similarity values after averaging; making comparisons of the plurality of word groups distributions in adjacent text blocks using a similarity metric to determine similarity values based on the averaged plurality of word group distributions; and selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments.
-
-
20. A computerized method for determining the topic structure of a portion of text, comprising:
-
identifying candidate segmentation points of the portion of text corresponding to locations between text blocks; determining a plurality of PLSA models; applying a folding-in process with each model to each text block to determine a distribution of probabilities over a plurality of latent classes for each text block; using the determined distributions to estimate a plurality of word group distributions for each text block; making comparisons of the plurality of word group distributions in adjacent text blocks using a similarity metric to determine similarity values; averaging the similarity values; and selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments; wherein determining a plurality of PLSA models uses at least one of different initializations of the models prior to training each model and different numbers of latent variables.
-
-
21. A computerized method for determining the topic structure of a portion of text comprising:
-
identifying candidate segmentation points of the portion of text corresponding to locations between text blocks; determining a plurality of PLSA models; applying a folding-in process with each model to each text block to determine a distribution of probabilities over a plurality of latent classes for each text block; using the determined distributions to estimate a plurality of word group distributions for each text block; making comparisons of the plurality of word group distributions in adjacent text blocks using a similarity metric to determine similarity values; averaging the similarity values; smoothing the similarity values after averaging; and selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments.
-
-
22. A computerized method for determining the number of training iterations for obtaining a PLSA model, comprising:
determining a likelihood value according to the formula
-
23. A computerized method for determining the topic structure of a portion of text, comprising:
-
identifying candidate segmentation points of the portion of text corresponding to locations between text blocks; determining a distribution of probabilities over a plurality of latent classes for each text block; using the determined distributions to estimate a distribution of words for each text block; making comparisons of the distributions of words in adjacent text blocks using a similarity metric to determine similarity values; and selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments; wherein determining a distribution of probabilities is performed by applying a folding-in process to each text block.
-
-
24. A computerized method for determining the topic structure of a portion of text, comprising:
-
identifying candidate segmentation points of the portion of text corresponding to locations between text blocks; determining a distribution of probabilities over a plurality of latent classes for each text block; using the determined distributions to estimate a distribution of words for each text block; making comparisons of the distributions of words in adjacent text blocks using a similarity metric to determine similarity values; determining local dips in the similarity values after making comparisons; and selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments.
-
-
25. A computerized method for determining the topic structure of a portion of text comprising:
-
identifying candidate segmentation points of the portion of text corresponding to locations between text blocks; determining a distribution of probabilities over a plurality of latent classes for each text block; using the determined distributions to estimate a distribution of words for each text block; making comparisons of the distributions of words in adjacent text blocks using a similarity metric to determine similarity values; and selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments; wherein making comparisons of the distribution of words in adjacent text blocks using a similarity metric is based on a cosine distance or a variational distance.
-
Specification