Clustering of Text for Structuring of Text Documents and Training of Language Models
First Claim
1. A method of text clustering for the generation of language models, a text (300) featuring a plurality of text units (320, 322, . . . ), each of which having at least one word (302, 304, . . . ), the method of text clustering comprising the steps of:
- assigning each of the text units (320, 322, . . . ) to one of a plurality of provided clusters (330, 332, . . . ), determining for each text unit a set of emission probabilities (340, 350), each emission probability (342, 344, . . . , 352, 354, . . . ) being indicative of a correlation between the text unit (320, 322, . . . ) and a cluster (330, 332, . . . ), the set of emission probabilities being indicative of the correlations between the text unit and the plurality of clusters, determining a transition probability (362, 364, . . . ) being indicative that a first cluster (330) being assigned to a first text unit (320) in the text is followed by a second cluster (332) being assigned to a second text unit (322) in the text, the second text unit (322) subsequently following the first text unit (320) within the text, performing an optimization procedure based on the emission probability and the transition probability in order to assign each text unit to a cluster.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention relates to a method, a text segmentation system and a computer program product for clustering of text into text clusters representing a distinct semantic meaning. The text clustering method identifies text portions and assigns text portions to different clusters in such a way that each text cluster refers to one or several semantic topics. The clustering method incorporates an optimization procedure based on a re-clustering procedure evaluating a target function being indicative of the correlation between a text unit and a cluster. The text clustering method makes use of a text emission model and a cluster transition model and makes further use of various smoothing techniques.
105 Citations
20 Claims
-
1. A method of text clustering for the generation of language models, a text (300) featuring a plurality of text units (320, 322, . . . ), each of which having at least one word (302, 304, . . . ), the method of text clustering comprising the steps of:
-
assigning each of the text units (320, 322, . . . ) to one of a plurality of provided clusters (330, 332, . . . ), determining for each text unit a set of emission probabilities (340, 350), each emission probability (342, 344, . . . , 352, 354, . . . ) being indicative of a correlation between the text unit (320, 322, . . . ) and a cluster (330, 332, . . . ), the set of emission probabilities being indicative of the correlations between the text unit and the plurality of clusters, determining a transition probability (362, 364, . . . ) being indicative that a first cluster (330) being assigned to a first text unit (320) in the text is followed by a second cluster (332) being assigned to a second text unit (322) in the text, the second text unit (322) subsequently following the first text unit (320) within the text, performing an optimization procedure based on the emission probability and the transition probability in order to assign each text unit to a cluster. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer program product for text clustering for the generation of language models, a text (300) featuring a plurality of text units (320, 322, . . . ), each of which having at least one word (302, 304, . . . ), the computer program product comprising program means for:
-
assigning each of the text units (320, 322, . . . ) to one of a plurality of provided clusters (330, 332, . . . ), determining for each text unit a set of emission probabilities (340, 350), each emission probability (342, 344, . . . , 352, 354, . . . ) being indicative of a correlation between the text unit (320, 322, . . . ) and a cluster (330, 332, . . . ), the set of emission probabilities being indicative of the correlations between the text unit and the plurality of clusters, determining a transition probability (362, 364, . . . ) being indicative that a first cluster (330) being assigned to a first text unit (320) in the text is followed by a second cluster (332) being assigned to a second text unit (322) in the text, the second text unit (322) subsequently following the first text unit (320) within the text, performing an optimization procedure based on the emission probability and the transition probability in order to assign each text unit to a cluster. - View Dependent Claims (12, 13, 14, 15, 16)
-
-
17. A text clustering system for the generation of language models, a text (300) featuring a plurality of text units (320, 322, . . . ), each of which having at least one word (302, 304, . . . ), the text clustering system comprising:
-
means for assigning each of the text units (320, 322, . . . ) to one of a plurality of provided clusters (330, 332, . . . ), means for determining for each text unit a set of emission probabilities (340, 350), each emission probability (342, 344, . . . , 352, 354) being indicative of a correlation between the text unit (320, 322, . . . ) and a cluster (330, 332, . . . ), the set of emission probabilities being indicative of the correlations between the text unit and the plurality of clusters, means for determining a transition probability (362, 364, . . . ) being indicative that a first cluster (330) being assigned to a first text unit (320) in the text is followed by a second cluster (332) being assigned to a second text unit (322) in the text, the second text unit (322) subsequently following the first text unit (320) within the text, means for performing an optimization procedure based on the emission probability and the transition probability in order to assign each text unit to a cluster. - View Dependent Claims (18, 19, 20)
-
Specification