Systems and methods for determining the topic structure of a portion of text

US 20030182631A1
Filed: 03/22/2002
Published: 09/25/2003
Est. Priority Date: 03/22/2002
Status: Active Grant

First Claim

Patent Images

1. A method for determining the topic structure of a portion of text, comprising:

identifying candidate segmentation points of the portion of text corresponding to locations between text blocks;

applying a folding-in process to each text block to determine a distribution of probabilities over a plurality of latent variables for each text block;

using the determined distributions to estimate a distribution of words for each text block;

making comparisons of the distributions of words in adjacent text blocks using a similarity metric to determine similarity values; and

selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for determining the topic structure of a document including text utilize a Probabilistic Latent Semantic Analysis (PLSA) model and select segmentation points based on similarity values between pairs of adjacent text blocks. PLSA forms a framework for both text segmentation and topic identification. The use of PLSA provides an improved representation for the sparse information in a text block, such as a sentence or a sequence of sentences. Topic characterization of each text segment is derived from PLSA parameters that relate words to “topics”, latent variables in the PLSA model, and “topics” to text segments. A system executing the method exhibits significant performance improvement. Once determined, the topic structure of a document may be employed for document retrieval and/or document summarization.

Citations

33 Claims

1. A method for determining the topic structure of a portion of text, comprising:
- identifying candidate segmentation points of the portion of text corresponding to locations between text blocks;
  
  applying a folding-in process to each text block to determine a distribution of probabilities over a plurality of latent variables for each text block;
  
  using the determined distributions to estimate a distribution of words for each text block;
  
  making comparisons of the distributions of words in adjacent text blocks using a similarity metric to determine similarity values; and
  
  selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
- - 2. The method of claim 1, further comprising smoothing the similarity values after making comparisons.
  - 3. The method of claim 1, wherein making comparisons of the distributions of words in adjacent text blocks using a similarity metric is based on a Hellinger or Bhattacharyya distance.
  - 4. The method of claim 1, wherein making comparisons of the distributions of words in adjacent text blocks using a similarity metric is based on a Jensen-Shannon divergence.
  - 5. The method of claim 1, wherein making comparisons of the distributions of words in adjacent text blocks using a similarity metric is based on a weighted sum.
  - 6. The method of claim 1, wherein making comparisons of the distributions of words in adjacent text blocks using a similarity metric is based on a geometric mean.
  - 7. The method of claim 1, wherein making comparisons of the distributions of words in adjacent text blocks using a similarity metric is based on relative sizes of dips.
  - 8. The method of claim 1, further comprising:
    - applying a folding-in process to each segment to determine a probability distribution over latent variables for each segment;
      
      using each determined distribution to estimate a distribution of words for each segment; and
      
      identifying at least one topic for each segment based on the distribution of words for each segment.
  - 9. The method of claim 8, wherein identifying at least one topic for each segment is also based on a measure of an occurrence of a word in each segment.
  - 10. The method of claim 9, wherein identifying at least one topic for each segment is based on the distribution of words for each segment and an inverse segment frequency of the words in the segments.
  - 11. The method of claim 8, wherein identifying at least one topic for each segment is also based on term vectors for each segment.
  - 12. The method of claim 11, wherein identifying at least one topic for each segment is also based on a measure of an occurrence of the words in the segments.
  - 13. The method of claim 12, wherein identifying at least one topic for each segment is based on the distribution of words for each segment and an inverse segment frequency of the words in the segments.
  - 14. The method of claim 8, wherein identifying at least one topic for each segment is also based on parts of speech of the words in the segments.
  - 15. The method of claim 14, wherein identifying at least one topic for each segment is also based on a measure of an occurrence of a word in each segment.
  - 16. The method of claim 15, wherein identifying at least one topic for each segment is based on the distribution of words for each segment and an inverse segment frequency of the words in the segments.
  - 17. The method of claim 8, wherein identifying at least one topic for each segment is also based on mutual information between words in each segment and each segment.
  - 18. The method of claim 17, wherein identifying at least one topic for each segment is also based on a measure of an occurrence of a word in each segment.
  - 19. The method of claim 18, wherein identifying at least one topic for each segment is based on the distribution of words for each segment and an inverse segment frequency of the words in the segments.
  - 20. A method for retrieving a portion of text, comprising:
    - determining topic structures of a plurality of portions of text using the method of claim 1;
      
      and retrieving at least one of the plurality of portions of text using the topic structures of the portions of text.
  - 21. The method of claim 20, wherein retrieving at least one of the plurality of portions of text using the topic structures of the portions of text is based on at least one key word.

22. A method for determining the topic structure of a portion of text, comprising:
- applying a folding-in process to each of a plurality of segments of the portion of text to determine a probability distribution over latent variables for each segment;
  
  using each determined distribution to estimate a distribution of words for each segment; and
  
  identifying at least one topic for each segment based on the distribution of words for each segment.

23. A system for determining the topic structure of a portion of text, comprising:
- an input device for inputting a portion of a text; and
  
  at least one processor for applying a folding-in process to each of a plurality of segments of the portion of text to determine a probability distribution over latent variables for each segment, estimating a distribution of words for each segment based on the determined probability distribution and identifying at least one topic for each segment based on the distribution of words for each segment.
- View Dependent Claims (24)
- - 24. The system of claim 23, wherein the at least one processor identifies the plurality of segments of the portion of text.

25. A method for determining the topic structure of a portion of text, comprising:
- identifying candidate segmentation points of the portion of text corresponding to locations between text blocks;
  
  determining a plurality of PLSA models;
  
  applying a folding-in process with each model to each text block to determine a distribution of probabilities over a plurality of a latent variables for each text block;
  
  using the determined distributions to estimate a distribution of words for each text block;
  
  averaging the word distributions for each text block;
  
  making comparisons of the distributions of words in adjacent text blocks using a similarity metric to determine similarity values based on the averaged word distributions; and
  
  selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments.
- View Dependent Claims (26, 27, 28)
- - 26. The method of claim 25, wherein determining a plurality of PLSA models uses different initializations of the models prior to training each model.
  - 27. The method of claim 25, wherein determining a plurality of PLSA models uses different numbers of latent variables.
  - 28. The method of claim 25, further comprising smoothing the similarity values after averaging.

29. A method for determining the topic structure of a portion of text, comprising:
- identifying candidate segmentation points of the portion of text corresponding to locations between text blocks;
  
  determining a plurality of PLSA models;
  
  applying a folding-in process with each model to each text block to determine a distribution of probabilities over a plurality of a latent variables for each text block;
  
  using the determined distributions to estimate a distribution of words for each text block;
  
  making comparisons of the distributions of words in adjacent text blocks using a similarity metric to determine similarity values;
  
  averaging the similarity values; and
  
  selecting segmentation points from the candidate segmentation points of the portion of text based on the comparison to define a plurality of segments.
- View Dependent Claims (30, 31, 32)
- - 30. The method of claim 29, wherein determining a plurality of PLSA models uses different initializations of the models prior to training each model.
  - 31. The method of claim 29, wherein determining a plurality of PLSA models uses different numbers of latent variables.
  - 32. The method of claim 29, further comprising smoothing the similarity values after averaging.

33. A method for determining the number of training iterations for obtaining a PLSA model, comprising:
- determining a likelihood value according to the formula $L_{fi} (Q) = \sum_{q \in Q} \sum_{w \in q} f (w, q) \log \sum_{z} P (w \langle z) P_{fi} (z \langle q)$ for each iteration;
  
  comparing the likelihood value of a current iteration to the likelihood value of a previous iteration; and
  
  determining the number of iterations as the current iteration when the likelihood value of the current iteration is within a selected threshold value of the likelihood value of the previous iteration.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Chen, Francine R., Tsochantaridis, Ioannis, Brants, Thorsten H.

Granted Patent

US 7,130,837 B2
Time in Patent Office

Days
Field of Search
US Class Current

715/531
CPC Class Codes

G06F 16/313 Selection or weighting of t...

Systems and methods for determining the topic structure of a portion of text

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

33 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for determining the topic structure of a portion of text

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

33 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links