Method and apparatus for breaking words in a stream of text

US 6,035,268 A
Filed: 08/21/1997
Issued: 03/07/2000
Est. Priority Date: 08/22/1996
Status: Expired due to Fees

First Claim

Patent Images

1. A method for locating unidentified breaks between words in an input character string formed of a plurality of characters, the method comprising the successive steps ofstoring said input character string in a computer memory element,identifying at least one morpheme in a first segment of said stored character string,reducing the number of unidentified word breaks in said stored character string by locating a first word break in said first segment of said stored character string based upon said at least one morpheme, said first word break dividing said first segment into a first sub-segment and a second sub-segment, andlocating further unidentified word breaks in said first and second sub-segments by comparing said first and second sub-segments to entries in a dictionary.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A word breaker utilizing a lexicon module and a processing module to identify word breaks in a stream of Asian (e.g. Japanese, Chinese, or Korean) language text. The lexicon module is a dictionary or database containing words native to the language of the input text. The processing module includes a plurality of analysis modules which operate on the input text. In particular, the processing module can include modules that analyze the input text using heuristic rules and statistical analysis to identify a first set of work breaks, thereby reducing the size of segments with undefined word breaks. The processing module also includes a database analysis module that identifies the remaining undefined word breaks in those smaller segments that have undergone heuristic or statistical analysis.

Citations

41 Claims

1. A method for locating unidentified breaks between words in an input character string formed of a plurality of characters, the method comprising the successive steps ofstoring said input character string in a computer memory element,identifying at least one morpheme in a first segment of said stored character string,reducing the number of unidentified word breaks in said stored character string by locating a first word break in said first segment of said stored character string based upon said at least one morpheme, said first word break dividing said first segment into a first sub-segment and a second sub-segment, andlocating further unidentified word breaks in said first and second sub-segments by comparing said first and second sub-segments to entries in a dictionary.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
- - 2. The method of claim 1, wherein said reducing step further includes verifying said first word break by matching a word preceding said first word break with a first entry in said dictionary and by matching a word following said first word break with a second entry in said dictionary.
  - 3. The method of claim 1, wherein said identifying step includes the steps of locating word breaks and character-transitions by applying a set of rules to said stored character string to identify said at least one morpheme.
  - 4. The method of claim 3, wherein said applying step further comprisesforming a window of successive characters from said stored character string,comparing said window of successive characters to entries in a character-transition table, andidentifying said window of successive characters that matches an entry in the character-transition table as said at least one morpheme.
  - 5. The method of claim 4, further comprising the step of decreasing the size of said window of characters if no entries in said character-transition table match said window of successive characters.
  - 6. The method of claim 4, further comprising the step of sliding the window of successive characters across said stored character string if no entries in said character-transition table match said window of successive characters.
  - 7. The method of claim 4, including the step of forming the character-transition table by generating a minimum spanning set of character strings necessary to identify character-transitions.
  - 8. The method of claim 7, wherein the spanning set of character strings includes a plurality of character strings having different lengths.
  - 9. The method of claim 1, wherein said reducing step includes the steps ofdetecting a first character-transition in said stored character string based upon said at least one morpheme, andlocating said first word break as a function of said at least one morpheme and said first character-transition.
  - 10. The method of claim 9, wherein said locating step includes the step of concatenating a first character and a second character together when said first character-transition indicates the existence of a connection between characters.
  - 11. The method of claim 9, wherein said locating step further comprises the step of identifying a break between a first character and a second character when said first character-transition indicates the existence of a break between characters.
  - 12. The method of claim 1, wherein said locating step further comprises the steps ofcreating a lookup string from characters within said first sub-segment,identifying a dictionary entry that substantially matches said lookup string, andmarking a second word break between the matched lookup string and a character that precedes the lookup string and marking a third word break between the matched lookup string and a character that follows the lookup string.
  - 13. A method according to claim 12, further comprising the steps of creating a candidate word list from a dictionary as a function of said lookup string, and wherein said identifying step includes comparing an entry in said candidate word list with said lookup string.
  - 14. The method of claim 12, further comprising the step ofvalidating that the matched lookup string is a word.
  - 15. The method of claim 14, wherein the step of validating the matched lookup string includesselecting an identified word, from the matched lookup string, andcomparing said matched lookup string to a dictionary for determining the validity of the identified word.
  - 16. The method of claim 1, further comprising the step, prior to said identifying step, of applying a set of heuristic rules to said stored character string to identify a character-transition in said first segment of said stored character string, said identification of a character-transition reducing the number of possible character combinations forming words in said stored character string.
  - 17. The method of claim 16 further comprising the step of identifying a concatenation between characters in said first segment as a function of said heuristic rules.
  - 18. The method of claim 16 further comprising the step of selecting said heuristic rules for identifying a break between characters in said first segment.
  - 19. The method of claim 16, wherein said step of applying the set of heuristic rules further compriseslocating a number in said stored character string, andidentifying a character-transition that precedes and a character-transition that follows said located number.
  - 20. The method of claim 16, wherein said step of applying the set of heuristic rules further compriseslocating identifying punctuation in said stored character string, andidentifying a character-transition that precedes and a character-transition that follows said located punctuation.
  - 21. The method of claim 16, wherein said step of applying the set of heuristic rules further compriseslocating identifying Roman letters in said stored character string, andidentifying a character-transition that precedes and a character-transition that follows said located Roman letters.
  - 22. The method of claim 16, wherein said step of applying the set of heuristic rules further compriseslocating identifying classifiers in said stored character string;
    - andidentifying a character-transition that precedes and a character-transition that follows said located classifiers.
  - 23. The method of claim 16, wherein said step of applying the set of heuristic rules further compriseslocating identifying particles in said stored character string, andidentifying a character-transition that precedes and a character-transition that follows said located particles.
  - 24. The method of claim 16, wherein said step of applying the set of heuristic rules further compriseslocating identifying honorific prefixes in said stored character string, andidentifying a character-transition that precedes and a character-transition that follows said located honorific prefixes.
  - 25. The method of claim 16, wherein said step of applying the set of heuristic rules further compriseslocating an identifying emperor year in said stored character string, andidentifying a character-transition that precedes and a character-transition that follows said located emperor year.
  - 26. The method of claim 16, wherein said step of applying the set of heuristic rules further compriseslocating identifying Kanji-Katakana character-transitions in said stored character string, andidentifying a character-transition that occurs at said located Kanji-Katakana character-transition.

27. A programmable computer an apparatus for locating unidentified breaks between words in an input character string, comprisingA) a computer memory element for storing the input character string,B) first memory means for storing a character-transition table including character segments of morphemes,C) second memory means for storing a dictionary, said dictionary including lexical entries,D) a statistical analysis module operably coupled with said first memory means storing character-transition table for reducing the number of unidentified word breaks by locating a first word break in a first segment of said input character string as a function of at least one statistical morpheme in said first segment, said first word break dividing said first segment into a first sub-segment and a second sub-segment, andE) a database analysis module operably coupled with said dictionary for locating substantially all of the remaining unidentified word breaks in said first and second sub-segments by comparing said first and second sub-segments with entries in said dictionary.
- View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39)
- - 28. The apparatus of claim 27, wherein said statistical analysis module further comprisesfirst processing means for identifying said at least one statistical morpheme in said first segment by comparing said first segment with entries in said character-transition table and for detecting a character-transition associated with said at least one statistical morpheme, andsecond processing means for locating a first word break in said first segment as a function of said at least one statistical morpheme and said character-transition.
  - 29. The apparatus of claim 28, wherein said first processing means further comprises a windowing module for forming a window of successive characters from said first segment such that said window of characters can be compared with entries in said character-transition table.
  - 30. The apparatus of claim 29, wherein said first processor module includes means for sliding said window of successive characters along said first segment of said input character string.
  - 31. The apparatus of claim 29, further comprising means for changing the size of said window of characters.
  - 32. The apparatus of claim 28, further comprising means for associating a character-transition tag with characters in said input string.
  - 33. The apparatus of claim 32, wherein said means for associating a character-transition tag includes means for indicating a concatenation between successive characters.
  - 34. The apparatus of claim 32, wherein said character-transition tag indicates a break between successive characters.
  - 35. The apparatus of claim 27, wherein said database analysis module further comprises:
    - third processing means for identifying a match between said first sub-segment and an entry in said dictionary, andfourth processing means for locating a second word break in said first sub-segment as a function of said matched entry.
  - 36. The apparatus of claim 27, further comprising:
    - a heuristic rule table including a set of heuristic rules,a heuristic rule module operably coupled with said heuristic rule table for identifying a character-transition in said first segment of said stored character string, such that the number of possible character combinations forming words in said stored character string are reduced.
  - 37. The apparatus of claim 27, further comprising a word verification module, operably coupled with said dictionary, for verifying matches between an identified word in said input character string and dictionary entries.
  - 38. The apparatus of claim 27, wherein said character-transition table includes character strings of morphemes that form a minimum spanning set necessary to identify character-transitions in said input character string.
  - 39. The apparatus of claim 38, wherein the spanning set includes a plurality of character strings having different lengths.

40. A machine readable data storage medium, comprisingmeans for reducing the number of unidentified word breaks in a character string by locating a first word break in a first segment of said character string as a function of at least one statistical morpheme in said first segment, said first word break dividing said first segment into a first sub-segment and a second sub-segment, andmeans for locating substantially all of the remaining unidentified word breaks in said first and second sub-segments by comparing said first and second sub-segments with entries in a dictionary of lexical entries.
- View Dependent Claims (41)
- - 41. The machine readable data storage medium of claim 40, further comprising a character-transition table including character segments of morphemes.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Vantage Technology Holdings LLC
Original Assignee
Lernout & Hauspie Speech Products NV (Intel Corporation)
Inventors
Carus, Alwin B., Wiesner, Michael, Krause, Deborah
Primary Examiner(s)
Isen, Forester W.
Assistant Examiner(s)
Edouard, Patrick N.

Application Number

US08/915,628
Time in Patent Office

929 Days
Field of Search

704/1, 704/9, 704/10, 707/530, 707/531, 707/532, 707/535, 341/28, 382/185
US Class Current

704/9
CPC Class Codes

G06F 40/253 Grammatical analysis; Style...

G06F 40/53 Processing of non-Latin tex...

Method and apparatus for breaking words in a stream of text

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

41 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for breaking words in a stream of text

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

41 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links