×

Multi-lingual word hyphenation using inductive machine learning on training data

  • US 8,996,994 B2
  • Filed: 01/16/2008
  • Issued: 03/31/2015
  • Est. Priority Date: 01/16/2008
  • Status: Active Grant
First Claim
Patent Images

1. A method comprising:

  • receiving training data at a server computer system that include a plurality of hyphenated words;

    inductively generating hyphenation patterns at the server computer system from the received training data, via machine learning, comprising substrings occurring within the plurality of hyphenated words, and hyphenation codes identifying hyphenation points within the hyphenation patterns and associated respectively with characters occurring in the substrings, the hyphenation codes each corresponding to an action to be performed after a given input character occurs within a word, the action being selected from at least one ofdeleting a letter before the hyphen,adding a specified letter before the hyphen,changing the letter before the hyphen to some specified letter,changing a letter after the hyphen to some other specified letter, ordeleting the letter before the hyphen and changing the letter after the hyphen to another specified letter;

    building at least one data structure at the server computer system comprising representations of the individual characters appearing within the substrings, wherein the data structure indicates how many times the substrings occur with the training data;

    receiving, at the server computer system, at least one induction parameter applicable to generating the hyphenation patterns, wherein the at least one induction parameter specifies a precision applicable to generating the hyphenation patterns, wherein the precision is related to a lower bound on accuracy with which input words not included in the training data may be hyphenated;

    storing at least one inductively generated hyphenation pattern at the server computer system comprising the substrings and the hyphenation codes as entries into a language-specific lexicon file; and

    receiving at least one request from a client computer system to hyphenate at least one input word occurring in a human language based on the stored at least one inductively generated hyphenation pattern.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×