Multi-lingual word hyphenation using inductive machine learning on training data
First Claim
1. A method comprising:
- receiving training data at a server computer system that include a plurality of hyphenated words;
inductively generating hyphenation patterns at the server computer system from the received training data, via machine learning, comprising substrings occurring within the plurality of hyphenated words, and hyphenation codes identifying hyphenation points within the hyphenation patterns and associated respectively with characters occurring in the substrings, the hyphenation codes each corresponding to an action to be performed after a given input character occurs within a word, the action being selected from at least one ofdeleting a letter before the hyphen,adding a specified letter before the hyphen,changing the letter before the hyphen to some specified letter,changing a letter after the hyphen to some other specified letter, ordeleting the letter before the hyphen and changing the letter after the hyphen to another specified letter;
building at least one data structure at the server computer system comprising representations of the individual characters appearing within the substrings, wherein the data structure indicates how many times the substrings occur with the training data;
receiving, at the server computer system, at least one induction parameter applicable to generating the hyphenation patterns, wherein the at least one induction parameter specifies a precision applicable to generating the hyphenation patterns, wherein the precision is related to a lower bound on accuracy with which input words not included in the training data may be hyphenated;
storing at least one inductively generated hyphenation pattern at the server computer system comprising the substrings and the hyphenation codes as entries into a language-specific lexicon file; and
receiving at least one request from a client computer system to hyphenate at least one input word occurring in a human language based on the stored at least one inductively generated hyphenation pattern.
2 Assignments
0 Petitions
Accused Products
Abstract
Tools and techniques are described for providing multi-lingual word hyphenation using inductive machine learning on training data. Methods provided by these techniques may receive training data that includes hyphenated words, and may inductively generate hyphenation patterns that represent substrings of these words. The hyphenation patterns may include the substrings and hyphenation codes associated with characters occurring in the substrings. The methods may receive induction parameters applicable to generating the hyphenation patterns, and may store the hyphenation patterns into a language-specific lexicon file. These methods may also receive requests to hyphenate input words that occur in a human language, and may evaluate how to process the request based on the language. The methods may search for hyphenation patterns occurring in the input words, with the hyphenation patterns being stored in the lexicon file. Finally, the methods may respond to the request, indicating whether the hyphenation patterns occurred in the input words.
22 Citations
19 Claims
-
1. A method comprising:
-
receiving training data at a server computer system that include a plurality of hyphenated words; inductively generating hyphenation patterns at the server computer system from the received training data, via machine learning, comprising substrings occurring within the plurality of hyphenated words, and hyphenation codes identifying hyphenation points within the hyphenation patterns and associated respectively with characters occurring in the substrings, the hyphenation codes each corresponding to an action to be performed after a given input character occurs within a word, the action being selected from at least one of deleting a letter before the hyphen, adding a specified letter before the hyphen, changing the letter before the hyphen to some specified letter, changing a letter after the hyphen to some other specified letter, or deleting the letter before the hyphen and changing the letter after the hyphen to another specified letter; building at least one data structure at the server computer system comprising representations of the individual characters appearing within the substrings, wherein the data structure indicates how many times the substrings occur with the training data;
receiving, at the server computer system, at least one induction parameter applicable to generating the hyphenation patterns, wherein the at least one induction parameter specifies a precision applicable to generating the hyphenation patterns, wherein the precision is related to a lower bound on accuracy with which input words not included in the training data may be hyphenated;storing at least one inductively generated hyphenation pattern at the server computer system comprising the substrings and the hyphenation codes as entries into a language-specific lexicon file; and receiving at least one request from a client computer system to hyphenate at least one input word occurring in a human language based on the stored at least one inductively generated hyphenation pattern. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. One of an optical storage device, a semiconductor storage device or a magnetic storage device having computer-executable instructions stored thereon which, when executed by a computer, cause the computer to perform a method comprising:
-
receiving at least one request from a client computer system to hyphenate at least one input word occurring in a human language; evaluating at a server computer system how to process the request based at least in part on the human language; searching, at the server computer system, for a least one hyphenation pattern comprising a substring occurring in the at least one input word and at least one hyphenation code identifying hyphenation points within the at least one hyphenation pattern and associated respectively with characters occurring in the substrings, the at least one hyphenation code corresponding to an action to be performed after a given input character occurs within a word, the action being selected from at least one of deleting a letter before the hyphen, adding a specified letter before the hyphen, changing the letter before the hyphen to some specified letter, changing a letter after the hyphen to some other specified letter, or deleting the letter before the hyphen and changing the letter after the hyphen to another specified letter, wherein at least the one hyphenation pattern is stored in a language-specific lexicon file that is created inductively, via machine learning at the server computer system, based on training data; and responding from the server computer system to the at least one request from the client computer system, wherein the response at least indicates whether the at least one hyphenation pattern occurred in the at least one input word. - View Dependent Claims (13, 14, 15, 16, 17)
-
-
18. A word hyphenation system comprising:
-
a least one server adapted to; receive training data that includes a plurality of hyphenated words; receive at least one induction parameter applicable to generating hyphenation patterns; based on the at least one induction parameter, inductively generate the hyphenation patterns, from the training data via machine learning, comprising substrings occurring within the plurality of hyphenated words, and hyphenation codes identifying hyphenation points within the hyphenation patterns and associated respectively with characters occurring in the substrings, the hyphenation codes each corresponding to an action to be performed after a given input character occurs within a word, the action being selected from at least one of deleting a letter before the hyphen, adding a specified letter before the hyphen, changing the letter before the hyphen to some specified letter, changing a letter after the hyphen to some other specified letter, or deleting the letter before the hyphen and changing the letter after the hyphen to another specified letter; and store at least the substrings and the hyphenation codes as entries into a language-specific lexicon file specific to a human language; wherein the server is further adapted to; receive at least one request to hyphenate of an input word occurring in the human language; determine the human language in the context of the at least one request; evaluate how to process the at least one request based at least in part on the human language; search for at least one hyphenation pattern occurring in the input word, wherein a substring of at least one hyphenation pattern is stored in the language-specific lexicon file; and respond to the at least one request, the response at least indicates based in part by determining whether the substring of the at least one hyphenation pattern occurred in the input word, wherein the at least one server may achieve complete accuracy in hyphenating input words that occur within the training data, and wherein the at least one server may achieve at least a lower bound on accuracy in hyphenating input words that do not occur in the training data, wherein the lower bound is based on a precision applicable to generating the hyphenation patterns; and at least one client system adapted to send the request and to receive the response thereto. - View Dependent Claims (19)
-
Specification