Text segmentation with multiple granularity levels
First Claim
1. A method of text processing, comprising:
- training, using a processor, a classifier for classifying text, wherein;
the training is based on a plurality of training sample entries;
a training sample entry in the plurality of training sample entries includes;
a character count;
an independent use rate;
a phrase structure rule value indicating whether the training sample entry complies with phrase structure rules;
a semantic attribute value indicating an inclusion state of the training sample entry in a predetermined set of enumerated entries;
an overlap attribute value indicating overlap of the training sample entry with another entry in the predetermined set of enumerated entries; and
a classification result indicating whether the training sample entry is a compound semantic unit or a smallest semantic unit;
building, using the processor, a lexicon of smallest semantic units, comprising;
receiving an entry to be classified;
using the trained classifier to determine whether the entry to be classified is a smallest semantic unit or a compound semantic unit; and
in the event that the entry is determined to be a smallest semantic unit, adding the entry to the lexicon of smallest semantic units;
segmenting, using the processor, received text based on the lexicon of smallest semantic units to obtain medium-grained segmentation results;
merging, using the processor, the medium-grained segmentation results to obtain coarse-grained segmentation results, the coarse-grained segmentation results having coarser granularity than the medium-grained segmentation results;
looking up, using the processor, in the lexicon of smallest semantic units respective search elements that correspond to segments in the medium-grained segmentation results; and
forming, using the processor, fine-grained segmentation results based on the respective search elements, the fine-grained segmentation results having finer granularity than the medium-grained segmentation results.
1 Assignment
0 Petitions
Accused Products
Abstract
Text processing includes: segmenting received text based on a lexicon of smallest semantic units to obtain medium-grained segmentation results; merging the medium-grained segmentation results to obtain coarse-grained segmentation results, the coarse-grained segmentation results having coarser granularity than the medium-grained segmentation results; looking up in the lexicon of smallest semantic units respective search elements that correspond to segments in the medium-grained segmentation results; and forming fine-grained segmentation results based on the respective search elements, the fine-grained segmentation results having finer granularity than the medium-grained segmentation results.
-
Citations
16 Claims
-
1. A method of text processing, comprising:
-
training, using a processor, a classifier for classifying text, wherein; the training is based on a plurality of training sample entries; a training sample entry in the plurality of training sample entries includes; a character count; an independent use rate; a phrase structure rule value indicating whether the training sample entry complies with phrase structure rules; a semantic attribute value indicating an inclusion state of the training sample entry in a predetermined set of enumerated entries; an overlap attribute value indicating overlap of the training sample entry with another entry in the predetermined set of enumerated entries; and a classification result indicating whether the training sample entry is a compound semantic unit or a smallest semantic unit; building, using the processor, a lexicon of smallest semantic units, comprising; receiving an entry to be classified; using the trained classifier to determine whether the entry to be classified is a smallest semantic unit or a compound semantic unit; and in the event that the entry is determined to be a smallest semantic unit, adding the entry to the lexicon of smallest semantic units; segmenting, using the processor, received text based on the lexicon of smallest semantic units to obtain medium-grained segmentation results; merging, using the processor, the medium-grained segmentation results to obtain coarse-grained segmentation results, the coarse-grained segmentation results having coarser granularity than the medium-grained segmentation results; looking up, using the processor, in the lexicon of smallest semantic units respective search elements that correspond to segments in the medium-grained segmentation results; and forming, using the processor, fine-grained segmentation results based on the respective search elements, the fine-grained segmentation results having finer granularity than the medium-grained segmentation results. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system for text processing, comprising:
-
one or more processors configured to; train a classifier for classifying text, wherein; the training is based on a plurality of training sample entries; a training sample entry in the plurality of training sample entries includes; a character count; an independent use rate; a phrase structure rule value indicating whether the training sample entry complies with phrase structure rules; a semantic attribute value indicating an inclusion state of the training sample entry in a predetermined set of enumerated entries; an overlap attribute value indicating overlap of the training sample entry with another entry in the predetermined set of enumerated entries; and a classification result indicating whether the training sample entry is a compound semantic unit or a smallest semantic unit; and build a lexicon of smallest semantic units, comprising; receiving an entry to be classified; using the trained classifier to determine whether the entry to be classified is a smallest semantic unit or a compound semantic unit; and in the event that the entry is determined to be a smallest semantic unit, adding the entry to the lexicon of smallest semantic units; segment received text based on the lexicon of smallest semantic units to obtain medium-grained segmentation results; merge the medium-grained segmentation results to obtain coarse-grained segmentation results, the coarse-grained segmentation results having coarser granularity than the medium-grained segmentation results; look up in the lexicon of smallest semantic units respective search elements that correspond to segments in the medium-grained segmentation results; and form fine-grained segmentation results based on the respective search elements, the fine-grained segmentation results having finer granularity than the medium-grained segmentation results; and one or more memories coupled to the one or more processors, configured to provide the one or more processors with instructions. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A computer program product for text processing, the computer program product being embodied in a non-transitory computer readable storage medium and comprising computer instructions for:
-
training a classifier for classifying text, wherein; the training is based on a plurality of training sample entries; a training sample entry in the plurality of training sample entries includes; a character count; an independent use rate; a phrase structure rule value indicating whether the training sample entry complies with phrase structure rules; a semantic attribute value indicating an inclusion state of the training sample entry in a predetermined set of enumerated entries; an overlap attribute value indicating overlap of the training sample entry with another entry in the predetermined set of enumerated entries; and a classification result indicating whether the training sample entry is a compound semantic unit or a smallest semantic unit; and building a lexicon of smallest semantic units, comprising; receiving an entry to be classified; using the trained classifier to determine whether the entry to be classified is a smallest semantic unit or a compound semantic unit; and in the event that the entry is determined to be a smallest semantic unit, adding the entry to a lexicon of smallest semantic units; segmenting received text based on the lexicon of smallest semantic units to obtain medium-grained segmentation results; merging the medium-grained segmentation results to obtain coarse-grained segmentation results, the coarse-grained segmentation results having coarser granularity than the medium-grained segmentation results; looking up in the lexicon of smallest semantic units respective search elements that correspond to segments in the medium-grained segmentation results; and forming fine-grained segmentation results based on the respective search elements, the fine-grained segmentation results having finer granularity than the medium-grained segmentation results.
-
-
16. A system for text processing, comprising:
-
one or more processors configured to; train a classifier for classifying text, wherein; the training is based on a plurality of training sample entries; a training sample entry in the plurality of training sample entries includes; a character count; an independent use rate; a phrase structure rule value indicating whether the training sample entry complies with phrase structure rules; a semantic attribute value indicating an inclusion state of the training sample entry in a predetermined set of enumerated entries; an overlap attribute value indicating overlap of the training sample entry with another entry in the predetermined set of enumerated entries; and a classification result indicating whether the training sample entry is a compound semantic unit or a smallest semantic unit; and build a lexicon of smallest semantic units, comprising; receiving an entry to be classified; using the trained classifier to determine whether the entry to be classified is a smallest semantic unit or a compound semantic unit; and in the event that the entry is determined to be a smallest semantic unit, adding the entry to the lexicon of smallest semantic units; and one or more memories coupled to the one or more processors, configured to provide the one or more processors with instructions.
-
Specification