Method and apparatus for efficient segmentation of compound words using probabilistic breakpoint traversal
First Claim
1. A method for segmenting compound words in an unrestricted natural-language input, the method comprising:
- receiving a natural-language input consisting of a plurality of characters;
constructing a set of probabilistic breakpoints in the natural-language input based on probabilistic breakpoint analysis;
identifying a plurality of linkable components by traversal of substrings of the natural-language input delimited by the set of probabilistic breakpoints; and
returning a segmented string consisting of a plurality of linkable components spanning the natural-language input, wherein the segmented string is interpretable as a compound word.
2 Assignments
0 Petitions
Accused Products
Abstract
A method for segmenting a compound word in an unrestricted natural-language input is disclosed. The method comprises receiving a natural-language input consisting of a plurality of characters. Next, a set of probabilistic breakpoints based on a probabilistic breakpoint analysis is constructed in the natural-language input. A plurality of linkable components is identified by traversal of substrings of the natural-language input delimited by the set of probabilistic breakpoints. Finally, a segmented string consisting of a plurality of linkable components spanning the natural-language input is returned. The segmented string can be interpreted as a compound word.
-
Citations
9 Claims
-
1. A method for segmenting compound words in an unrestricted natural-language input, the method comprising:
-
receiving a natural-language input consisting of a plurality of characters;
constructing a set of probabilistic breakpoints in the natural-language input based on probabilistic breakpoint analysis;
identifying a plurality of linkable components by traversal of substrings of the natural-language input delimited by the set of probabilistic breakpoints; and
returning a segmented string consisting of a plurality of linkable components spanning the natural-language input, wherein the segmented string is interpretable as a compound word. - View Dependent Claims (2)
-
-
3. An apparatus for segmenting compound words in a natural-language input, the apparatus comprising:
-
a startpoint probability matrix;
an endpoint probability matrix;
a probabilistic breakpoint analyzer coupled to the startpoint probability matrix, the endpoint probability matrix and the natural-language input, the probabilistic breakpoint analyzer being operative to generate a breakpoint-annotated input from the natural-language input; and
a probabilistic breakpoint processor coupled to the probabilistic breakpoint analyzer, the probabilistic breakpoint analyzer being operative to generate a segmented string for the compound words in the natural-language input in response to the breakpoint-annotated input. - View Dependent Claims (4, 5, 6, 7, 8, 9)
-
Specification