Method and system for natural language dictionary generation
First Claim
Patent Images
1. A method for a computer system to create a morphological dictionary for a natural language, the method comprising:
- identifying a word token in a text corpus;
applying by the computer system one or more paradigm rules to the word token;
generating by the computer system one or more hypotheses about a part of speech for a base form of the word token;
searching by the computer system for one or more word inflected forms corresponding to the base form of the word token;
verifying by the computer system a hypothesis of the one or more hypotheses for the base form of the word token;
adding by the computer system at least one grammatical value and at least one inflection paradigm to the base form of the word token based at least in part on the verified hypothesis;
obtaining by the computer system one or more morphological descriptions for the word token based at least in part on the verified hypothesis; and
adding the base form of the word token with the one or more morphological descriptions to the morphological dictionary of the natural language.
6 Assignments
0 Petitions
Accused Products
Abstract
A method and computer system for analyzing a text corpus in a natural language is provided. An initial morphological description having word inflection rules for various groups of words in the natural language is created by a linguist. A plurality of text corpuses are analyzed to obtain information on the occurrence of a plurality of word forms for each word token in each text corpus. A morphological dictionary which contains information about each base form and word inflection rules for each word token with verified hypothesis is generated.
-
Citations
30 Claims
-
1. A method for a computer system to create a morphological dictionary for a natural language, the method comprising:
-
identifying a word token in a text corpus; applying by the computer system one or more paradigm rules to the word token; generating by the computer system one or more hypotheses about a part of speech for a base form of the word token; searching by the computer system for one or more word inflected forms corresponding to the base form of the word token; verifying by the computer system a hypothesis of the one or more hypotheses for the base form of the word token; adding by the computer system at least one grammatical value and at least one inflection paradigm to the base form of the word token based at least in part on the verified hypothesis; obtaining by the computer system one or more morphological descriptions for the word token based at least in part on the verified hypothesis; and adding the base form of the word token with the one or more morphological descriptions to the morphological dictionary of the natural language. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A method for a computer system to generate a morphological dictionary for a natural language, the method comprising:
-
creating by the computer system an initial morphological description having word inflection rules for groups of words in the natural language; analyzing by the computer system a plurality of text corpuses in the natural language, wherein the analyzing includes; identifying a word token in the plurality of text corpuses; applying one or more paradigm rules to the word token; generating one or more hypotheses about one or more parts of speech of a base form of the word token; searching for one or more word inflected forms corresponding to the base form of the word token; verifying a hypothesis of the one or more hypotheses for the base form of the word token based on ratings; adding at least one grammatical value and at least one inflection paradigm to the base form of the word token based at least in part on the verified hypothesis; and obtaining one or more morphological descriptions for the word token with a verified hypothesis; and adding the base form of the word token with the one or more morphological descriptions to the morphological dictionary. - View Dependent Claims (19, 20, 21, 22)
-
-
23. A computer readable non-transitory medium comprising instructions for causing a computing system to carry out operations for analyzing a text corpus in a natural language, the operations comprising:
-
identifying a word token in the text corpus; applying one or more paradigm rules to the word token; generating one or more hypotheses about a part of speech of a base form of the word token; searching for one or more word inflected forms corresponding to the base form of the word token; verifying a hypothesis of the one or more hypotheses for the base form of the word token; adding at least one grammatical value and at least one inflection paradigm to the base form of the word token based at least in part on the verified hypothesis; and obtaining one or more morphological descriptions for the word token based at least in part on the verified hypothesis. - View Dependent Claims (24)
-
-
25. A computer readable non-transitory medium comprising instructions for causing a computing system to carry out operations to generate a morphological dictionary for a natural language, the instructions comprising:
-
analyzing a plurality of text corpuses in the natural language, wherein the analyzing includes; identifying a word token in the plurality of text corpuses; applying one or more paradigm rules to the word token; generating one or more hypotheses based in part on a part of speech of a base form of the word token; searching for one or more word inflected forms corresponding to the base form of the word token; verifying a hypothesis of the one or more hypotheses for the base form of the word token based on ratings; adding at least one grammatical value and at least one inflection paradigm to the base form of the word token; and obtaining one or more morphological descriptions for the word token based at least in part on the verified hypothesis; and adding the base form of the word token with the one or more morphological descriptions to the morphological dictionary. - View Dependent Claims (26)
-
-
27. A system for capturing data from a document image, the system comprising:
-
a processor; and a memory coupled to the processor and in electronic communication with the imaging component, the memory configured with instructions for causing the processor to; identify a word token in a text corpus; apply by the computer system one or more paradigm rules to the word token; generate by the computer system one or more hypotheses about a part of speech for a base form of the word token; search by the computer system for one or more word inflected forms corresponding to the base form of the word token; verify by the computer system a hypothesis of the one or more hypotheses for the base form of the word token; add by the computer system at least one grammatical value and at least one inflection paradigm to the base form of the word token based at least in part on the verified hypothesis; obtain by the computer system one or more morphological descriptions for the word token based at least in part on the verified hypothesis; and add the base form of the word token with the one or more morphological descriptions to the morphological dictionary. - View Dependent Claims (28, 29, 30)
-
Specification