Method and system for generating new entries in natural language dictionary
First Claim
Patent Images
1. A computer system to create a new entry in morphological electronic dictionary for a natural language, the computer system comprising:
- a processor; and
an electronic memory configured with electronic instructions to cause the computer system to perform steps, the electronic instructions including;
identifying a word token in a text corpus;
applying one or more morphological paradigm rules to the word token to generate one or more hypotheses about a base form of the word token;
generating other word forms for the base form, where the other word forms correspond to the generated one or more hypotheses;
verifying at least one hypothesis of the one or more hypotheses for at least one of the other word forms of the word token;
estimating the at least one hypothesis to get rating scores by checking in the text corpus for the generated other word forms;
identifying a best verified hypothesis, wherein the best verified hypothesis is a verified hypothesis with the highest rating scores;
adding an inflection paradigm and a grammatical value to the base form of the word token based on the best verified hypothesis; and
adding a new entry in a morphological electronic dictionary, the new entry comprising the base form of the word token according to the best verified hypothesis.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and computer system for analyzing a text corpus in a natural language is provided. An initial morphological description having word inflection rules for various groups of words in the natural language is created by a linguist. A plurality of text corpuses are analyzed to obtain information on the occurrence of a plurality of word forms for each word token in each text corpus. A morphological dictionary which contains information about each base form and word inflection rules for each word token with verified hypothesis is generated.
116 Citations
24 Claims
-
1. A computer system to create a new entry in morphological electronic dictionary for a natural language, the computer system comprising:
-
a processor; and an electronic memory configured with electronic instructions to cause the computer system to perform steps, the electronic instructions including; identifying a word token in a text corpus; applying one or more morphological paradigm rules to the word token to generate one or more hypotheses about a base form of the word token; generating other word forms for the base form, where the other word forms correspond to the generated one or more hypotheses; verifying at least one hypothesis of the one or more hypotheses for at least one of the other word forms of the word token; estimating the at least one hypothesis to get rating scores by checking in the text corpus for the generated other word forms; identifying a best verified hypothesis, wherein the best verified hypothesis is a verified hypothesis with the highest rating scores; adding an inflection paradigm and a grammatical value to the base form of the word token based on the best verified hypothesis; and adding a new entry in a morphological electronic dictionary, the new entry comprising the base form of the word token according to the best verified hypothesis. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A method for creating a new entry in morphological electronic dictionary for a natural language, using a computer system comprising:
-
one or more processors; and an electronic memory; the method comprising; identifying a word token in a text corpus; applying one or more morphological paradigm rules to the word token to generate one or more hypotheses about a base form of the word token; generating other word forms for the base form, where the other word forms correspond to the generated one or more hypotheses; verifying at least one hypothesis of the one or more hypotheses for at least one of the other word forms of the word token; estimating the at least one hypothesis to get rating scores by checking in the text corpus for the generated other word forms; identifying a best verified hypothesis, wherein the best verified hypothesis is a verified hypothesis with the highest rating scores; adding an inflection paradigm and a grammatical value to the base form of the word token based on the best verified hypothesis; and adding a new entry in a morphological electronic dictionary, the new entry comprising the base form of the word token according to the best verified hypothesis. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
-
Specification