Systems and methods for determining the topic structure of a portion of text

US 7,130,837 B2
Filed: 03/22/2002
Issued: 10/31/2006
Est. Priority Date: 03/22/2002
Status: Active Grant

First Claim

Patent Images

1. A computerized method for determining the topic structure of a portion of text, comprising:

identifying candidate segmentation points of the portion of text corresponding to locations between text blocks;

determining a distribution of probabilities over a plurality of latent classes for each text block;

using the determined distributions to estimate a distribution of words for each text block;

making comparisons of the distribution of words in adjacent text blocks using a similarity metric to determine similarity valuessmoothing the similarity values after making comparisons; and

selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for determining the topic structure of a document including text utilize a Probabilistic Latent Semantic Analysis (PLSA) model and select segmentation points based on similarity values between pairs of adjacent text blocks. PLSA forms a framework for both text segmentation and topic identification. The use of PLSA provides an improved representation for the sparse information in a text block, such as a sentence or a sequence of sentences. Topic characterization of each text segment is derived from PLSA parameters that relate words to “topics”, latent variables in the PLSA model, and “topics” to text segments. A system executing the method exhibits significant performance improvement. Once determined, the topic structure of a document may be employed for document retrieval and/or document summarization.

45 Citations

View as Search Results

25 Claims

1. A computerized method for determining the topic structure of a portion of text, comprising:
- identifying candidate segmentation points of the portion of text corresponding to locations between text blocks;
  
  determining a distribution of probabilities over a plurality of latent classes for each text block;
  
  using the determined distributions to estimate a distribution of words for each text block;
  
  making comparisons of the distribution of words in adjacent text blocks using a similarity metric to determine similarity valuessmoothing the similarity values after making comparisons; and
  
  selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, wherein smoothing the similarity values is based on a weighted sum.
  - 3. The method of claim 1, wherein smoothing the similarity values is based on a geometric mean.
  - 4. The method of claim 1, wherein smoothing the similarity values is based on a n-point median smoother.

5. A computerized method for determining the topic structure of a portion of text, comprising:
- identifying candidate segmentation points of the position of test corresponding to locations between text blocks;
  
  determining a distribution of probabilities over a plurality of latent classes for each text block;
  
  using the determined distributions to estimate a distribution of words for each text block;
  
  making comparisons of the distributions of words in adjacent text blocks using a similarity metric to determine similarity values; and
  
  selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments;
  
  wherein making comparisons of the distributions of words in adjacent text blocks using a similarity metric is based on at least one of a Hellinger distance or a Jensen-Shannon divergence.

6. A computerized method for determining the topic structure of a portion of text having one or more segments, comprising:
- applying a folding-in process to each segment to determine a probability distribution over latent classes for each segment;
  
  using each determined distribution to estimate a distribution of word groups for each segment; and
  
  identifying at least one topic for each segment based on the distribution of word groups for the segment;
  
  wherein identifying at least one topic for each segment is also based on at least one of a measure of an occurrence of word groups in each segment, term vectors for each segment, on parts of speech of the word groups in the segments, and on mutual information between word groups in each segment and each segment.
- View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14)
- - 7. The method of claim 6, wherein, when identifying at least one topic for each segment is also based on the measure of the occurrence of word groups in each segment, identifying at least one topic for each segment is based on the distribution of word groups for each segment and on inverse segment frequency of word groups in the segments.
  - 8. The method of claim 6, wherein, when identifying at least one topic for each segment is also based on the term vectors for each segment, identifying at least one topic for each segment is also based on a measure of an occurrence of word groups in the segments.
  - 9. The method of claim 8, wherein identifying at least one topic for each segment is based on the distribution of word groups for each segment and an inverse segment frequency of word groups in the segments.
  - 10. The method of claim 6, wherein, when identifying at least one topic for each segment is also based on the parts of speech of the word groups in the segments, identifying at least one topic for each segment is also based on a measure of an occurrence of word groups in each segment.
  - 11. The method of claim 10, wherein identifying at least one topic for each segment is based on the distribution of word groups for each segment and an inverse segment frequency of word groups in the segments.
  - 12. The method of claim 6, wherein when identifying at least one topic for each segment is also based on the mutual information between word groups in each segment and each segment, identifying at least one topic for each segment is also based on a measure of an occurrence of word groups in each segment.
  - 13. The method of claim 11, wherein identifying at least one topic for each segment is based on the distribution of word groups for each segment and an inverse segment frequency of word groups in the segments.
  - 14. The method of claim 6, wherein each word group includes one or more words.

15. A computerized method for retrieving a portion of text, comprising:
- determining topic structures of a plurality of portions of text by;
  
  determining a distribution of probabilities over a plurality of latent classes for each text block;
  
  using the determined distributions to estimate a distribution of words for each text block;
  
  making comparisons of the distributions of words in adjacent text blocks using a similarity metric to determine similarity values; and
  
  selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments; and
  
  retrieving at least one of the plurality of portions of text using the topic structures of the portions of text.
- View Dependent Claims (16)
- - 16. The method of claim 15, wherein retrieving at least one of the plurality of portions of text using the topic structures of the portions of text is based on at least one key word.

17. A system for determining the topic structure of a portion of text, comprising:
- an input device for inputting a portion of a text; and
  
  at least one processor for applying a folding-in process to each of a plurality of segments of the portion of text to determine a probability distribution over latent classes for each segment, estimating a distribution of word groups for each segment based on the determined probability distribution and identifying at least one topic for each segment based on the distribution of word groups for each segment, wherein the at least one processor identifies the plurality of segments of the portion of text.

18. A computerized method for determining the topic structure of a portion of text, comprising:
- identifying candidate segmentation points of the portion of text corresponding to locations between text blocks;
  
  determining a plurality of PLSA models;
  
  applying a folding-in process with each model to each text block to determine a distribution of probabilities over a plurality of latent classes for each text block;
  
  using the determined distributions to estimate a plurality of word group distributions for each text block;
  
  averaging the plurality of word group distributions for each text block;
  
  making comparisons of the plurality of word groups distributions in adjacent text blocks using a similarity metric to determine similarity values based on the averaged plurality of word group distributions; and
  
  selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments;
  
  wherein determining a plurality of PLSA models uses at least one of different initializations of the models prior to training each model and different numbers of latent variables.

19. A computerized method for determining the topic structure of a portion of text comprising;
- identifying candidate segmentation points of the portion of text corresponding to locations between text blocks;
  
  determining a plurality of PLSA models;
  
  applying a folding-in process with each model to each text block to determine a distribution of probabilities over a plurality of latent classes for each text block;
  
  using the determined distributions to estimate a plurality of word group distributions for each text block;
  
  averaging the plurality of word group distributions for each text block;
  
  smoothing the similarity values after averaging;
  
  making comparisons of the plurality of word groups distributions in adjacent text blocks using a similarity metric to determine similarity values based on the averaged plurality of word group distributions; and
  
  selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments.

20. A computerized method for determining the topic structure of a portion of text, comprising:
- identifying candidate segmentation points of the portion of text corresponding to locations between text blocks;
  
  determining a plurality of PLSA models;
  
  applying a folding-in process with each model to each text block to determine a distribution of probabilities over a plurality of latent classes for each text block;
  
  using the determined distributions to estimate a plurality of word group distributions for each text block;
  
  making comparisons of the plurality of word group distributions in adjacent text blocks using a similarity metric to determine similarity values;
  
  averaging the similarity values; and
  
  selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments;
  
  wherein determining a plurality of PLSA models uses at least one of different initializations of the models prior to training each model and different numbers of latent variables.

21. A computerized method for determining the topic structure of a portion of text comprising:
- identifying candidate segmentation points of the portion of text corresponding to locations between text blocks;
  
  determining a plurality of PLSA models;
  
  applying a folding-in process with each model to each text block to determine a distribution of probabilities over a plurality of latent classes for each text block;
  
  using the determined distributions to estimate a plurality of word group distributions for each text block;
  
  making comparisons of the plurality of word group distributions in adjacent text blocks using a similarity metric to determine similarity values;
  
  averaging the similarity values;
  
  smoothing the similarity values after averaging; and
  
  selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments.

22. A computerized method for determining the number of training iterations for obtaining a PLSA model, comprising:
- determining a likelihood value according to the formula

23. A computerized method for determining the topic structure of a portion of text, comprising:
- identifying candidate segmentation points of the portion of text corresponding to locations between text blocks;
  
  determining a distribution of probabilities over a plurality of latent classes for each text block;
  
  using the determined distributions to estimate a distribution of words for each text block;
  
  making comparisons of the distributions of words in adjacent text blocks using a similarity metric to determine similarity values; and
  
  selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments;
  
  wherein determining a distribution of probabilities is performed by applying a folding-in process to each text block.

24. A computerized method for determining the topic structure of a portion of text, comprising:
- identifying candidate segmentation points of the portion of text corresponding to locations between text blocks;
  
  determining a distribution of probabilities over a plurality of latent classes for each text block;
  
  using the determined distributions to estimate a distribution of words for each text block;
  
  making comparisons of the distributions of words in adjacent text blocks using a similarity metric to determine similarity values;
  
  determining local dips in the similarity values after making comparisons; and
  
  selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments.

25. A computerized method for determining the topic structure of a portion of text comprising:
- identifying candidate segmentation points of the portion of text corresponding to locations between text blocks;
  
  determining a distribution of probabilities over a plurality of latent classes for each text block;
  
  using the determined distributions to estimate a distribution of words for each text block;
  
  making comparisons of the distributions of words in adjacent text blocks using a similarity metric to determine similarity values; and
  
  selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments;
  
  wherein making comparisons of the distribution of words in adjacent text blocks using a similarity metric is based on a cosine distance or a variational distance.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Tsochantaridis, Ioannis, Chen, Francine R., Brants, Thorsten H.
Primary Examiner(s)
Hirl, Joseph P.

Application Number

US10/103,053
Publication Number

US 20030182631A1
Time in Patent Office

1,684 Days
Field of Search

706/55, 706/46, 706/45
US Class Current

706/55
CPC Class Codes

G06F 16/313 Selection or weighting of t...

Systems and methods for determining the topic structure of a portion of text

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

45 Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for determining the topic structure of a portion of text

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

45 Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links