Method and apparatus for breaking words in a stream of text
First Claim
1. A method for locating unidentified breaks between words in an input character string formed of a plurality of characters, the method comprising the successive steps ofstoring said input character string in a computer memory element,identifying at least one morpheme in a first segment of said stored character string,reducing the number of unidentified word breaks in said stored character string by locating a first word break in said first segment of said stored character string based upon said at least one morpheme, said first word break dividing said first segment into a first sub-segment and a second sub-segment, andlocating further unidentified word breaks in said first and second sub-segments by comparing said first and second sub-segments to entries in a dictionary.
7 Assignments
0 Petitions
Accused Products
Abstract
A word breaker utilizing a lexicon module and a processing module to identify word breaks in a stream of Asian (e.g. Japanese, Chinese, or Korean) language text. The lexicon module is a dictionary or database containing words native to the language of the input text. The processing module includes a plurality of analysis modules which operate on the input text. In particular, the processing module can include modules that analyze the input text using heuristic rules and statistical analysis to identify a first set of work breaks, thereby reducing the size of segments with undefined word breaks. The processing module also includes a database analysis module that identifies the remaining undefined word breaks in those smaller segments that have undergone heuristic or statistical analysis.
-
Citations
41 Claims
-
1. A method for locating unidentified breaks between words in an input character string formed of a plurality of characters, the method comprising the successive steps of
storing said input character string in a computer memory element, identifying at least one morpheme in a first segment of said stored character string, reducing the number of unidentified word breaks in said stored character string by locating a first word break in said first segment of said stored character string based upon said at least one morpheme, said first word break dividing said first segment into a first sub-segment and a second sub-segment, and locating further unidentified word breaks in said first and second sub-segments by comparing said first and second sub-segments to entries in a dictionary.
-
27. A programmable computer an apparatus for locating unidentified breaks between words in an input character string, comprising
A) a computer memory element for storing the input character string, B) first memory means for storing a character-transition table including character segments of morphemes, C) second memory means for storing a dictionary, said dictionary including lexical entries, D) a statistical analysis module operably coupled with said first memory means storing character-transition table for reducing the number of unidentified word breaks by locating a first word break in a first segment of said input character string as a function of at least one statistical morpheme in said first segment, said first word break dividing said first segment into a first sub-segment and a second sub-segment, and E) a database analysis module operably coupled with said dictionary for locating substantially all of the remaining unidentified word breaks in said first and second sub-segments by comparing said first and second sub-segments with entries in said dictionary.
-
40. A machine readable data storage medium, comprising
means for reducing the number of unidentified word breaks in a character string by locating a first word break in a first segment of said character string as a function of at least one statistical morpheme in said first segment, said first word break dividing said first segment into a first sub-segment and a second sub-segment, and means for locating substantially all of the remaining unidentified word breaks in said first and second sub-segments by comparing said first and second sub-segments with entries in a dictionary of lexical entries.
Specification