System and method for disambiguating non diacritized arabic words in a text
First Claim
1. A method in a specific language, for lexically disambiguating non diacritized words in a text and restoring vowels, said method comprising the steps of:
- automatically generating, by a lexicon generation sub-system, a domain specific lexicon based on a diacritized training corpus pertaining to a specific domain, including, for each word in the diacritized training corpus;
removing the diacritics from the word;
obtaining all possible valid vowelization patterns for the word, each vowelization pattern belonging to a different stem;
selecting from the obtained vowelization patterns, the vowelization pattern that matches the vowelization pattern of the word before the diacritics have been removed from the word;
identifying in a generic lexicon, a stem associated with the vowelization pattern that matches vowelization pattern of the word; and
disambiguating, by a lexicon generation sub-system, non diacritized words in a text pertaining to the specific domain and restoring vowels by means of the previously generated domain specific lexicon thereby converting non diacritized words in the text to diacritized words.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention proposes a solution to the problem of word lexical disambiguation in Arabic texts. This solution is based on text domain-specific knowledge, which facilitates the automatic vowel restoration of modern standard Arabic scripts. Texts similar in their contents, restricted to a specific field or sharing a common knowledge can be grouped in a specific category or in a specific domain (examples of specific domains; sport, art, economic, science . . . ). The present invention discloses a method, system and computer program for lexically disambiguating non diacritized Arabic words in a text based on a learning approach that exploits; Arabic lexical look-up, and Arabic morphological analysis, to train the system on a corpus of diacritized Arabic text pertaining to a specific domain. Thereby, the contextual relationships of the words related to a specific domain are identified, based on the valid assumption that there is less lexical variability in the use of the words and their morphological variants within a domain compared to an unrestricted text.
15 Citations
10 Claims
-
1. A method in a specific language, for lexically disambiguating non diacritized words in a text and restoring vowels, said method comprising the steps of:
-
automatically generating, by a lexicon generation sub-system, a domain specific lexicon based on a diacritized training corpus pertaining to a specific domain, including, for each word in the diacritized training corpus; removing the diacritics from the word; obtaining all possible valid vowelization patterns for the word, each vowelization pattern belonging to a different stem; selecting from the obtained vowelization patterns, the vowelization pattern that matches the vowelization pattern of the word before the diacritics have been removed from the word; identifying in a generic lexicon, a stem associated with the vowelization pattern that matches vowelization pattern of the word; and disambiguating, by a lexicon generation sub-system, non diacritized words in a text pertaining to the specific domain and restoring vowels by means of the previously generated domain specific lexicon thereby converting non diacritized words in the text to diacritized words. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
Specification