Clustering of Text for Structuring of Text Documents and Training of Language Models

US 20070244690A1
Filed: 11/11/2004
Published: 10/18/2007
Est. Priority Date: 11/21/2003
Status: Abandoned Application

First Claim

Patent Images

1. A method of text clustering for the generation of language models, a text (300) featuring a plurality of text units (320, 322, . . . ), each of which having at least one word (302, 304, . . . ), the method of text clustering comprising the steps of:

assigning each of the text units (320, 322, . . . ) to one of a plurality of provided clusters (330, 332, . . . ), determining for each text unit a set of emission probabilities (340, 350), each emission probability (342, 344, . . . , 352, 354, . . . ) being indicative of a correlation between the text unit (320, 322, . . . ) and a cluster (330, 332, . . . ), the set of emission probabilities being indicative of the correlations between the text unit and the plurality of clusters, determining a transition probability (362, 364, . . . ) being indicative that a first cluster (330) being assigned to a first text unit (320) in the text is followed by a second cluster (332) being assigned to a second text unit (322) in the text, the second text unit (322) subsequently following the first text unit (320) within the text, performing an optimization procedure based on the emission probability and the transition probability in order to assign each text unit to a cluster.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention relates to a method, a text segmentation system and a computer program product for clustering of text into text clusters representing a distinct semantic meaning. The text clustering method identifies text portions and assigns text portions to different clusters in such a way that each text cluster refers to one or several semantic topics. The clustering method incorporates an optimization procedure based on a re-clustering procedure evaluating a target function being indicative of the correlation between a text unit and a cluster. The text clustering method makes use of a text emission model and a cluster transition model and makes further use of various smoothing techniques.

105 Citations

View as Search Results

20 Claims

1. A method of text clustering for the generation of language models, a text (300) featuring a plurality of text units (320, 322, . . . ), each of which having at least one word (302, 304, . . . ), the method of text clustering comprising the steps of:
- assigning each of the text units (320, 322, . . . ) to one of a plurality of provided clusters (330, 332, . . . ), determining for each text unit a set of emission probabilities (340, 350), each emission probability (342, 344, . . . , 352, 354, . . . ) being indicative of a correlation between the text unit (320, 322, . . . ) and a cluster (330, 332, . . . ), the set of emission probabilities being indicative of the correlations between the text unit and the plurality of clusters, determining a transition probability (362, 364, . . . ) being indicative that a first cluster (330) being assigned to a first text unit (320) in the text is followed by a second cluster (332) being assigned to a second text unit (322) in the text, the second text unit (322) subsequently following the first text unit (320) within the text, performing an optimization procedure based on the emission probability and the transition probability in order to assign each text unit to a cluster.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method according to claim 1, wherein the optimization procedure comprises evaluating a target function by making use of statistical parameters based on the emission and transition probability, the statistical parameters comprising word counts, transition counts, cluster sizes and cluster frequencies.
  - 3. The method according to claim 2, wherein the optimization procedure comprises a re-clustering procedure, the re-clustering procedure comprising the steps of:
    - (a) performing a modification by assigning a first text unit (320) that has been assigned to a first cluster (330) to a second cluster (332), (b) evaluating the target function by making use of the statistical parameters accounting for the performed modification, (c) assigning the text unit (320) to the second cluster (332) when the result of the target function has improved compared to the corresponding result based on the first text unit (320) being assigned to the first cluster (330), (d) repeating steps (a) through (c) for each of the plurality of clusters (330, 332, . . . ) being the second cluster, (e) repeating steps (a) through (d), for each of the plurality of text units (320, 322, . . . ) being the first text unit.
  - 4. The method according to claim 2, wherein a smoothing procedure is applied to the target function, the smoothing procedure comprising a discount technique, a backing-off technique, or an add-one smoothing technique.
  - 5. The method according to claim 1, comprising a weighting functionality in order to decrease or increase the impact of the transition or emission probability on the target function.
  - 6. The method according to claim 4, wherein the smoothing procedure further comprises an add-x smoothing technique making use of adding a number x to the word counts and adding a number y to the transition counts in order to modify the smoothing procedure and/or the weighting functionality.
  - 7. The method according to claim 2, wherein evaluating of the target function further comprises making use of modified emission (340, 350) and transitions probabilities (360) in form of a leaving-one-out technique.
  - 8. The method according to claim 1, wherein a text unit (320) either comprises a single word (302), a set of words (302, 304, . . . ), a sentence or a set of sentences.
  - 9. The method according to claim 1, wherein the number of clusters (330, 332, . . . ) does not exceed a predefined maximum number of clusters.
  - 10. The method according to claim 1, wherein the text (300) comprises a weakly annotated structure with a number of labels assigned to at least one text unit (320) or to a set of text units (320, 322, . . . ), the method of text clustering further comprising assigning the same cluster to text units having assigned the same label.

11. A computer program product for text clustering for the generation of language models, a text (300) featuring a plurality of text units (320, 322, . . . ), each of which having at least one word (302, 304, . . . ), the computer program product comprising program means for:
- assigning each of the text units (320, 322, . . . ) to one of a plurality of provided clusters (330, 332, . . . ), determining for each text unit a set of emission probabilities (340, 350), each emission probability (342, 344, . . . , 352, 354, . . . ) being indicative of a correlation between the text unit (320, 322, . . . ) and a cluster (330, 332, . . . ), the set of emission probabilities being indicative of the correlations between the text unit and the plurality of clusters, determining a transition probability (362, 364, . . . ) being indicative that a first cluster (330) being assigned to a first text unit (320) in the text is followed by a second cluster (332) being assigned to a second text unit (322) in the text, the second text unit (322) subsequently following the first text unit (320) within the text, performing an optimization procedure based on the emission probability and the transition probability in order to assign each text unit to a cluster.
- View Dependent Claims (12, 13, 14, 15, 16)
- - 12. The computer program product according to claim 11, wherein the program means for performing the optimization procedure further comprise evaluating a target function by making use of statistical parameters based on the emission and transition probability, the statistical parameters comprising word counts, transition counts, cluster sizes and cluster frequencies.
  - 13. The computer program product according to claim 11, wherein the program means for performing the optimization procedure further comprise program means for re-clustering, the re-clustering program means are adapted to perform the steps of:
    - (a) performing a modification by assigning a first text unit (320) that has been assigned to a first cluster (330) to a second cluster (332), (b) evaluating the target function by making use of the statistical parameters accounting for the performed modification, (c) assigning the text unit (320) to the second cluster (332) when the result of the target function has improved compared to the corresponding result based on the first text (320) unit being assigned to the first cluster (330), (d) repeating steps (a) through (c) for each of the plurality of clusters (330, 332, . . . ) being the second cluster, (e) repeating steps (a) through (d), for each of the plurality of text units (320, 322, . . . ) being the first text unit.
  - 14. The computer program product according to claim 12, further comprising program means being adapted to perform a smoothing procedure for the target function, the smoothing procedure comprising a discount technique, a backing-off technique, an add-one smoothing technique or separate add-x and add-y smoothing techniques for the word and cluster transition counts.
  - 15. The computer program product according to claim 11, further comprising program means providing a weighting functionality in order to decrease or increase the impact of the transition or emission probability on the target function.
  - 16. The computer program product according to claim 11, wherein a text unit (320) either comprises a single word (302), a set of words (302, 304, . . . ), a sentence or a set of sentences.

17. A text clustering system for the generation of language models, a text (300) featuring a plurality of text units (320, 322, . . . ), each of which having at least one word (302, 304, . . . ), the text clustering system comprising:
- means for assigning each of the text units (320, 322, . . . ) to one of a plurality of provided clusters (330, 332, . . . ), means for determining for each text unit a set of emission probabilities (340, 350), each emission probability (342, 344, . . . , 352, 354) being indicative of a correlation between the text unit (320, 322, . . . ) and a cluster (330, 332, . . . ), the set of emission probabilities being indicative of the correlations between the text unit and the plurality of clusters, means for determining a transition probability (362, 364, . . . ) being indicative that a first cluster (330) being assigned to a first text unit (320) in the text is followed by a second cluster (332) being assigned to a second text unit (322) in the text, the second text unit (322) subsequently following the first text unit (320) within the text, means for performing an optimization procedure based on the emission probability and the transition probability in order to assign each text unit to a cluster.
- View Dependent Claims (18, 19, 20)
- - 18. The text clustering system according to claim 17, wherein means for performing the optimization procedure are adapted to evaluate a target function and to perform a re-clustering procedure by making use of statistical parameters based on the emission and transition probability, the statistical parameters comprising word counts, transition counts, cluster sizes and cluster frequencies comprises a re-clustering procedure, the re-clustering procedure comprising the steps of:
    - (a) performing a modification by assigning a first text unit (320) that has been assigned to a first cluster (330) to a second cluster (332), (b) evaluating the target function by making use of the statistical parameters accounting for the performed modification, (c) assigning the text unit (320) to the second cluster (332) when the result of the target function has improved compared to the corresponding result based on the first text unit (320) being assigned to the first cluster (330), (d) repeating steps (a) through (c) for each of the plurality of clusters (330, 332, . . . ) being the second cluster, (e) repeating steps (a) through (d), for each of the plurality of text units (320, 322, . . . ) being the first text unit.
  - 19. The text clustering system according to claim 18, further comprising means being adapted to apply a smoothing procedure to the target function, the smoothing procedure comprising a discount technique, a backing-off technique, an add-one smoothing technique or separate add-x and add-y smoothing techniques for the word and cluster transition counts.
  - 20. The text clustering system according to claim 17, wherein a text unit (320) can either comprise a single word (302), a set of words (302, 304, . . . ), a sentence or a set of sentences, the clustering further comprising means being adapted to provide a weighting functionality in order to decrease or increase the impact of the transition and emission probability on the target function.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Koninklijke Philips Electronics N.V. (Koninklijke Philips N.V.)
Original Assignee
Koninklijke Philips N.V.
Inventors
Peters, Jochen

Application Number

US10/595,829
Publication Number

US 20070244690A1
Time in Patent Office

Days
Field of Search
US Class Current

704/8
CPC Class Codes

G06F 16/353   into predefined classes

G06F 40/279   Recognition of textual enti...

G06F 40/30   Semantic analysis

Clustering of Text for Structuring of Text Documents and Training of Language Models

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

105 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Clustering of Text for Structuring of Text Documents and Training of Language Models

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

105 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links