Automatic segmentation of continuous text using statistical approaches
First Claim
1. A computer implemented method of segmenting continuous text comprising the steps of:
- a) determining a phrase from a string of characters in a first direction;
b) determining from a beginning of the phrase a longest possible word beginning at the beginning of the phrase;
c) repeating steps a) and b) until the phrase is completed;
d) repeating steps a), b) and c) in a direction opposite said first direction, beginning with the end of the phrase and working backwards; and
e) choosing a result having a higher likelihood than other possible results.
1 Assignment
0 Petitions
Accused Products
Abstract
An automatic segmenter for continuous text segments such text in a rapid, consistent and semantically accurate manner. Two statistical methods for segmentation of continuous text are used. The first method, called "forward-backward matching", is easy and fast but can produce occasional errors in long phrases. The second method, called "statistical stack search segmenter", utilizes statistical language models to generate more accurate segmentation output at an expense of two times more execution time than the "forward-backward matching" method. In some applications where speed is a major concern, "forward-backward matching" can be used, while in other applications where highly accurate output is desired, "statistical stack search segmenter" is ideal.
178 Citations
9 Claims
-
1. A computer implemented method of segmenting continuous text comprising the steps of:
-
a) determining a phrase from a string of characters in a first direction; b) determining from a beginning of the phrase a longest possible word beginning at the beginning of the phrase; c) repeating steps a) and b) until the phrase is completed; d) repeating steps a), b) and c) in a direction opposite said first direction, beginning with the end of the phrase and working backwards; and e) choosing a result having a higher likelihood than other possible results. - View Dependent Claims (2, 3)
-
-
4. A computer implemented method of segmenting continuous text comprising the steps of:
-
searching for every possible word that begins with a first character in the phrase and putting the words in a stack in order of language model likelihood; expanding a word at a top of the stack with words from a vocabulary by a) starting with a highest likelihood result, searching for every possible word beginning with the character immediately following that word; b) for each next word, computing a probability of a word stream containing that word and preceding words, and putting that word and the preceding words in the stack; c) sorting and pruning the stack based upon the computed probability; d) repeating steps a), b) and c) until a top entry in the stack matches an input string; and outputting the top of the stack entry as a result. - View Dependent Claims (5, 6)
-
-
7. A computer implemented method of segmenting continuous text comprising the steps of:
-
a) inputting unsegmented text; b) inputting an initial vocabulary and an initial language model; c) segmenting the input unsegmented text; d) testing the segmented text to determine if a satisfactory result has been obtained; e) outputting the segmented text if the result is satisfactory; f) otherwise, refining the vocabulary and rebuilding the language model; and g) repeating steps a) and c) using the refined vocabulary and rebuilt language model and again repeating steps d) and f) until the segmented text result is satisfactory. - View Dependent Claims (8, 9)
-
Specification