×

System and method for disambiguating non diacritized arabic words in a text

  • US 8,041,559 B2
  • Filed: 12/09/2005
  • Issued: 10/18/2011
  • Est. Priority Date: 12/10/2004
  • Status: Active Grant
First Claim
Patent Images

1. A method in a specific language, for lexically disambiguating non diacritized words in a text and restoring vowels, said method comprising the steps of:

  • automatically generating, by a lexicon generation sub-system, a domain specific lexicon based on a diacritized training corpus pertaining to a specific domain, including, for each word in the diacritized training corpus;

    removing the diacritics from the word;

    obtaining all possible valid vowelization patterns for the word, each vowelization pattern belonging to a different stem;

    selecting from the obtained vowelization patterns, the vowelization pattern that matches the vowelization pattern of the word before the diacritics have been removed from the word;

    identifying in a generic lexicon, a stem associated with the vowelization pattern that matches vowelization pattern of the word; and

    disambiguating, by a lexicon generation sub-system, non diacritized words in a text pertaining to the specific domain and restoring vowels by means of the previously generated domain specific lexicon thereby converting non diacritized words in the text to diacritized words.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×