Method of text processing
First Claim
1. A method of text processing, comprising the steps of:
- a) receiving at least one textual unit, where each textual unit includes at least one sub-textual unit;
b) selecting a language;
c) if a plurality of textual units are received, and said plurality of textual units do not include delimiters, then inserting delimiters between adjacent textual units;
d) selecting one of said delimited textual units;
e) if the selected language is selected from the group of languages consisting of Russian, Somali, user-definable language, then;
i) getting a morphology of the selected textual unit from a corresponding look-up table;
ii) outputting the corresponding input in the look-up table; and
iii) if there are any unprocessed textual units from the textual units received in step (a), selecting one of said unprocessed textual units and returning to substep (i), otherwise stopping;
f) setting a value n equal to the total number of sub-textual units in the selected textual unit and setting a value s equal to n;
g) setting a test-suffix equal to the rightmost s sub-textual units in said selected textual unit and setting stem equal to n−
s leftmost sub-textual units in said selected textual unit;
h) comparing the test-suffix to an inflected suffix field for each entry within a rules database, where each entry in said rules database further includes a base-suffix field, a model number field, a part of speech field, and a morphological feature field;
i) if no match is made in step (h) then setting s equal to s−
1 and returning to step (g);
j) identifying all model numbers from the model number field of the rules database that correspond to the inflected suffixes in the rules database that matched test-suffix in step (h);
k) identifying all base suffixes from the base-suffix field of the rules database that correspond to the model numbers identified in step (j);
l) combining stem with each base suffix identified in step (k) to create at least one test-lemma;
m) comparing the at least one test-lemma to a lemma field for each entry in a lexicon database where each entry in said lexicon database further includes a model number field, a part of speech field, a morphological feature field, a definition field, and an exception field;
n) if no match is found in step (m) then outputting a message to that effect, selecting the next unprocessed textual unit if any and returning to step (f), otherwise stopping;
o) identifying a model number for each lemma that matches the test lemma;
p) identifying each entry in the rules database that has a model number that matches one of the model numbers identified in step (o);
q) combining stem with each inflected suffix field of each entry identified in step (p) to form inflected forms of the textual unit;
r) outputting a user-definable subset of the result of step (q) and a user-definable subset of the corresponding entries in the rules database and the lexicon database; and
s) if there are any unprocessed textual units selecting an unprocessed textual unit and returning to step (f).
1 Assignment
0 Petitions
Accused Products
Abstract
A method of text processing by receiving textual units. Then, select a language and a textual unit. Identify the selected textual unit'"'"'s stem and suffix. Search a rules database for the suffix. If a base suffix is found in the rules database, combine it with the stem to form a lemma. Search a lexicon database for the lemma. If the lemma is found, a model number from the lexicon database is retrieved and cross-referenced with the rules database to obtain all inflected suffixes for the selected textual unit. Combine the inflected suffixes with the stem to form inflected forms. Output a subset of inflected-forms and information associated with the lemma and inflected suffixes. The method is repeated for unprocessed textual units. If the language selected is Russian or Somali, the textual units are processed separately.
-
Citations
16 Claims
-
1. A method of text processing, comprising the steps of:
-
a) receiving at least one textual unit, where each textual unit includes at least one sub-textual unit; b) selecting a language; c) if a plurality of textual units are received, and said plurality of textual units do not include delimiters, then inserting delimiters between adjacent textual units; d) selecting one of said delimited textual units; e) if the selected language is selected from the group of languages consisting of Russian, Somali, user-definable language, then; i) getting a morphology of the selected textual unit from a corresponding look-up table; ii) outputting the corresponding input in the look-up table; and iii) if there are any unprocessed textual units from the textual units received in step (a), selecting one of said unprocessed textual units and returning to substep (i), otherwise stopping; f) setting a value n equal to the total number of sub-textual units in the selected textual unit and setting a value s equal to n; g) setting a test-suffix equal to the rightmost s sub-textual units in said selected textual unit and setting stem equal to n−
s leftmost sub-textual units in said selected textual unit;h) comparing the test-suffix to an inflected suffix field for each entry within a rules database, where each entry in said rules database further includes a base-suffix field, a model number field, a part of speech field, and a morphological feature field; i) if no match is made in step (h) then setting s equal to s−
1 and returning to step (g);j) identifying all model numbers from the model number field of the rules database that correspond to the inflected suffixes in the rules database that matched test-suffix in step (h); k) identifying all base suffixes from the base-suffix field of the rules database that correspond to the model numbers identified in step (j); l) combining stem with each base suffix identified in step (k) to create at least one test-lemma; m) comparing the at least one test-lemma to a lemma field for each entry in a lexicon database where each entry in said lexicon database further includes a model number field, a part of speech field, a morphological feature field, a definition field, and an exception field; n) if no match is found in step (m) then outputting a message to that effect, selecting the next unprocessed textual unit if any and returning to step (f), otherwise stopping; o) identifying a model number for each lemma that matches the test lemma; p) identifying each entry in the rules database that has a model number that matches one of the model numbers identified in step (o); q) combining stem with each inflected suffix field of each entry identified in step (p) to form inflected forms of the textual unit; r) outputting a user-definable subset of the result of step (q) and a user-definable subset of the corresponding entries in the rules database and the lexicon database; and s) if there are any unprocessed textual units selecting an unprocessed textual unit and returning to step (f). - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
Specification