Paradigm-based morphological text analysis for natural languages
First Claim
1. In a computer system, a method for generating a file structure for enabling paradigm-based morphological text analysis upon words in an input word stream of natural language text, the computer system including a memory, comprising the steps of:
- compiling in said computer system from a starting list of lemmas associated with their corresponding paradigm references, a list of all the word forms generated by each lemma and its paradigm and associating each word form with its paradigm reference;
ordering in said computer system said compiled list of word forms and paradigm references in any desired collating sequence;
consolidating in said computer system duplicate word entries generated from multiple lemma and paradign combinations and compiling a file structure of unique word forms, each of which is associated with a list of paradigm references, and a list of paradigm tables which may be accessed by these references.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer method is disclosed for analyzing text by employing a model known as a paradigm, that provides all the inflectional forms of a word. A file structure is created consisting of two components, a list of words (a dictionary), each word of which is associated with a set of paradigm references, and the file of paradigms consisting of grammatical categories paired with their corresponding ending or affix portions (known as the desinence) specifying tense, mood, number, gender or other linguistic attribute. A computer method is disclosed for generating the file structure of the dictionary by generating all forms of the words from a list of standard forms of the words (known as the lemma) which is generally the infinitive of a verb or the singular form of a noun, the lemmas being generated with their corresponding paradigms.
-
Citations
5 Claims
-
1. In a computer system, a method for generating a file structure for enabling paradigm-based morphological text analysis upon words in an input word stream of natural language text, the computer system including a memory, comprising the steps of:
-
compiling in said computer system from a starting list of lemmas associated with their corresponding paradigm references, a list of all the word forms generated by each lemma and its paradigm and associating each word form with its paradigm reference; ordering in said computer system said compiled list of word forms and paradigm references in any desired collating sequence; consolidating in said computer system duplicate word entries generated from multiple lemma and paradign combinations and compiling a file structure of unique word forms, each of which is associated with a list of paradigm references, and a list of paradigm tables which may be accessed by these references.
-
-
2. A computer method for matching in a computer system an input word against a dictionary structured as a list of lemmas and their associated paradigms, comprising the steps of:
-
selecting in said computer system a set of entries in said dictionary to scan based on the collating sequence of the input word; testing in said computer system the input word against paradigm screens to avoid useless comparisons if a dictionary entry has an associated paradigm; comparing in said computer system the input word against the words generated by application of the paradigm to the dictionary entry and retrieving associated grammatical information when there is a match.
-
-
3. A computer method for matching in a computer system an input word against a dictionary structured as a list of lemmas and their associated paradigms, comprising the steps of:
-
selecting in said computer system a set of lemmas to scan; testing in said computer system the input word against its associated paradigm; comparing in said computer system the input word against the words generated by application of the paradigm to the lemma and retrieving associated grammatical information when there is a match.
-
-
4. A computer method for generating in a computer system the lemma of a word from a file structure consisting of two components, the first component being a dictionary list of words, each word of which is associated with a set of paradigm references and the second component being a file of paradigm consisting of grammatical categories paired with their corresponding desinences, comprising:
-
matching in said computer system the input word against said dictionary; accessing in said computer system from said file of paradigms a set of paradigms using the paradigm references from the matched word found in said dictionary; matching in said computer system the desinences of the paradigm corresponding to the matched word, against the input word; recording in said computer system the grammatical categories corresponding to each matching desinence and generating the lemma by replacing the matching desinence of the input word with the desinence of the lemma.
-
-
5. A method for generating in a computer system a specific grammatical form of a word from the lemma and the grammatical category, using a file structure consisting of two components, the first component being a dictionary list of words, each word of which is associated with a set of paradigm references and the second component being a file of paradigms consisting of grammatical categories paired with their corresponding desinences, comprising:
-
matching in said computer system the lemma against said dictionary; accessing in said computer system a set of paradigms from said file of paradigms corresponding to the paradigm reference of the matched lemma; matching in said computer system the desinences of the paradigm against the lemma; selecting in said computer system the desinence corresponding to the specified grammatical category and generating the specific grammatical form by replacing the desinence of the lemma with the desinence of the desired grammatical form.
-
Specification