Method and apparatus for efficient morphological text analysis using a high-level language for compact specification of inflectional paradigms
First Claim
1. A computer implemented method for performing morphological analysis in a computer system by describing inflectional operations on a natural language, the computer system comprising an input device, a memory and a computer processor capable of manipulating a data structure, the natural language comprising sequences of surface strings encoding grammatical properties, the computer implemented method comprising the steps of:
- a. providing a language syntax that defines statement forms for statements that describe the inflectional morphology of the natural language;
b. accepting as input at the input device, which is coupled to the computer processor, a set of language statements that follow said language syntax to describe the inflectional morphology of the natural language, said input comprising rule statements;
i. defining a set of morpho-syntactic features corresponding to grammatical distinctions within the parts of speech categories in the natural language;
ii. defining a set of inflectional morphological paradigms, said inflectional morphological paradigms comprising form rule statements to describe the construction of word forms and associate with each construction pre-selected ones of said morpho-syntactic features, said form rule statements comprising stem and affix components that describe the inflectional morphology corresponding to grammatical construction rules commonly known in the natural language;
iii. defining a lexicon, said lexicon containing a set of word entries containing representative forms of each word of a language that are associated to a pre-selected one of said inflectional morphological paradigms;
c. creating a computer-manipulatable data structure from the set of language statements; and
d. performing a morphological analysis utilizing said computer-manipulatable data structure.
3 Assignments
0 Petitions
Accused Products
Abstract
A language syntax defines statement forms for statements that describe the inflectional morphology of the natural language. A set of language statements that follow the language syntax to describe the inflectional morphology of the natural language is accepted as input into to a computer. Rule statements define a set of morpho-syntactic features corresponding to grammatical distinctions within the parts of speech categories in the natural language and define a set of inflectional morphological paradigms. The inflectional morphological paradigms include form rule statements to describe the construction of word forms and associate with each construction pre-selected morpho-syntactic features.
95 Citations
18 Claims
-
1. A computer implemented method for performing morphological analysis in a computer system by describing inflectional operations on a natural language, the computer system comprising an input device, a memory and a computer processor capable of manipulating a data structure, the natural language comprising sequences of surface strings encoding grammatical properties, the computer implemented method comprising the steps of:
-
a. providing a language syntax that defines statement forms for statements that describe the inflectional morphology of the natural language; b. accepting as input at the input device, which is coupled to the computer processor, a set of language statements that follow said language syntax to describe the inflectional morphology of the natural language, said input comprising rule statements; i. defining a set of morpho-syntactic features corresponding to grammatical distinctions within the parts of speech categories in the natural language; ii. defining a set of inflectional morphological paradigms, said inflectional morphological paradigms comprising form rule statements to describe the construction of word forms and associate with each construction pre-selected ones of said morpho-syntactic features, said form rule statements comprising stem and affix components that describe the inflectional morphology corresponding to grammatical construction rules commonly known in the natural language; iii. defining a lexicon, said lexicon containing a set of word entries containing representative forms of each word of a language that are associated to a pre-selected one of said inflectional morphological paradigms; c. creating a computer-manipulatable data structure from the set of language statements; and d. performing a morphological analysis utilizing said computer-manipulatable data structure. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. For use in computer-based morphological text analysis of natural languages, a computer implemented method for creating a data structure for computer-based generating and recognizing of word forms in a natural language, the method comprising the steps of:
-
a. providing a morphological description of a natural language stored in a memory device, said description comprising statements in a morphological description language, said morphological description language comprising statements arranged according to a predetermined syntax, said syntax permitting the specification of inflectional morphologic paradigms, said morphologic paradigms comprising form rules including surface form rules and intermediate form rules, said form rules comprising a left-hand-side identifier and a right-hand-side specifying a word stem and the concatenation or removal of an affix, including a prefix or a suffix, said stems and affixes comprising strings of characters, said syntax capable of specifying that the form rules of one paradigm are inherited by another paradigm, said syntax permitting the stem of any form rule that does not contain a form set reference to be the left-hand-side identifier of another form rule, said syntax permitting the stem in a form rule to be a reference to a string in the lexicon, said syntax permitting a form set variable to identify a plurality of left-hand-side form rule identifiers and the form set variable to be used as the stem in the right hand side of a form rule, said syntax permitting an affix set variable to identify a set of affix strings with the affix set variable being used as an affix in a right-hand-side of a form rule; b. disambiguating the stem components of the right-hand-sides of the form rules in each paradigm, said disambiguation process comprising the steps of; i. determining in each form rule whether the stem component is an identifier of another form rule; ii. replacing each stem component that is an identifier with a link to the identified form rule; c. determining for each paradigm whether there is a declaration stating that the paradigm inherits the form rules of another parent paradigm; d. creating form rules for the paradigms that will inherit the form rule from a parent paradigm by sharing references to the rules of the parent paradigm; e. replacing in the memory device, for each form rule that contains a right-hand-side reference to a form set, the form rule with a set of form rules, one for each form in the corresponding form set, each created form rule corresponding to the form set rule containing the right-hand-side reference to the form set; f. checking each surface form for cycles, said cycle check process comprising the steps of; i. creating a cycle check list initialized to empty; ii. locating a surface form rule; iii. checking stem components on the right-hand-side to determine if the stem is an identifier to another form rule; iv. comparing the stem that is an identifier of another form rule to the entries on the cycle check list; v. adding the stem that is an identifier to the cycle check list unless the identifier is included in the cycle check list; vi. checking the form rule referenced by the identifier for cycles; g. providing a set of orthographic rules; h. conflating the set of orthographic rules, said process of conflation comprising the steps of; i. finding the set of form rules that match an orthographic rule in the set of orthographic rules in terms of operator type, affix and affix type; ii. creating an inner form rule variant, the form rule variant comprising the stem form rule from the RHS of the matching form rule as the RHS stem and as the affix, an affix sequence comprising character strings and string variables, indicating the correct context determined by the orthographic rule, and as the operator a minus; and iii. creating an outer form rule variant, said outer form rule variant comprising a newly created outer form rule as the right-hand-side stem and as the affix, an affix sequence comprising character strings and string variables, indicating the correct spelling as determined by the orthographic rule and as the operator a plus.
-
-
17. For use in performing morphological analysis in a computer system, morphological analysis comprising inflectional operation on words found in a natural language, the computer system comprised of hardware and software elements capable of manipulating a data structure stored in a memory storage device, a method for creating a computer-manipulatable data structure, the method comprising the steps of:
-
a. providing a syntax for a description of the inflectional morphology of a natural language, said description comprising a set of statements made according to the syntax; b. accepting as input to the computer system a set of statements for the description of the inflectional morphology, the set of statements specified according to the syntax; and c. creating a computer-manipulatable data structure, using the set of statements made according to the syntax the data structure comprising a set of interconnected nodes, said nodes comprising information on the statements of the natural language and the nodes being linked by a hierarchical structure and a plurality of interconnecting references.
-
-
18. For use in a computer-based morphological text analyzer used to analyze inflected word forms in a natural language, a data structure coupled to and used in conjunction with morphological text analyzer modules, said data structure stored in a memory storage device and comprising:
-
a. a computer-manipulatable data structure comprising a hierarchical tree with interconnected nodes, the nodes containing computer-manipulatable information concerning the inflectional morphology of a natural language, wherein the computer-manipulatable data structure is created from a mapping of a high-level description of the inflectional morphology of a natural language;
said high-level language comprising a syntax to provide for the specification of inflectional paradigms;
the mapping being a compiling process to transform a set of statements of the high-level language into the computer-manipulatable data structure; andb. a lexicon comprising a set of word entries, each word entry comprising inflection information for a word of the natural language.
-
Specification