Method of text processing

US 7,409,334 B1
Filed: 07/22/2004
Issued: 08/05/2008
Est. Priority Date: 07/22/2004
Status: Active Grant

First Claim

Patent Images

1. A method of text processing, comprising the steps of:

a) receiving at least one textual unit, where each textual unit includes at least one sub-textual unit;

b) selecting a language;

c) if a plurality of textual units are received, and said plurality of textual units do not include delimiters, then inserting delimiters between adjacent textual units;

d) selecting one of said delimited textual units;

e) if the selected language is selected from the group of languages consisting of Russian, Somali, user-definable language, then;

i) getting a morphology of the selected textual unit from a corresponding look-up table;

ii) outputting the corresponding input in the look-up table; and

iii) if there are any unprocessed textual units from the textual units received in step (a), selecting one of said unprocessed textual units and returning to substep (i), otherwise stopping;

f) setting a value n equal to the total number of sub-textual units in the selected textual unit and setting a value s equal to n;

g) setting a test-suffix equal to the rightmost s sub-textual units in said selected textual unit and setting stem equal to n−

s leftmost sub-textual units in said selected textual unit;

h) comparing the test-suffix to an inflected suffix field for each entry within a rules database, where each entry in said rules database further includes a base-suffix field, a model number field, a part of speech field, and a morphological feature field;

i) if no match is made in step (h) then setting s equal to s−

1 and returning to step (g);

j) identifying all model numbers from the model number field of the rules database that correspond to the inflected suffixes in the rules database that matched test-suffix in step (h);

k) identifying all base suffixes from the base-suffix field of the rules database that correspond to the model numbers identified in step (j);

l) combining stem with each base suffix identified in step (k) to create at least one test-lemma;

m) comparing the at least one test-lemma to a lemma field for each entry in a lexicon database where each entry in said lexicon database further includes a model number field, a part of speech field, a morphological feature field, a definition field, and an exception field;

n) if no match is found in step (m) then outputting a message to that effect, selecting the next unprocessed textual unit if any and returning to step (f), otherwise stopping;

o) identifying a model number for each lemma that matches the test lemma;

p) identifying each entry in the rules database that has a model number that matches one of the model numbers identified in step (o);

q) combining stem with each inflected suffix field of each entry identified in step (p) to form inflected forms of the textual unit;

r) outputting a user-definable subset of the result of step (q) and a user-definable subset of the corresponding entries in the rules database and the lexicon database; and

s) if there are any unprocessed textual units selecting an unprocessed textual unit and returning to step (f).

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of text processing by receiving textual units. Then, select a language and a textual unit. Identify the selected textual unit'"'"'s stem and suffix. Search a rules database for the suffix. If a base suffix is found in the rules database, combine it with the stem to form a lemma. Search a lexicon database for the lemma. If the lemma is found, a model number from the lexicon database is retrieved and cross-referenced with the rules database to obtain all inflected suffixes for the selected textual unit. Combine the inflected suffixes with the stem to form inflected forms. Output a subset of inflected-forms and information associated with the lemma and inflected suffixes. The method is repeated for unprocessed textual units. If the language selected is Russian or Somali, the textual units are processed separately.

Citations

16 Claims

1. A method of text processing, comprising the steps of:
- a) receiving at least one textual unit, where each textual unit includes at least one sub-textual unit;
  
  b) selecting a language;
  
  c) if a plurality of textual units are received, and said plurality of textual units do not include delimiters, then inserting delimiters between adjacent textual units;
  
  d) selecting one of said delimited textual units;
  
  e) if the selected language is selected from the group of languages consisting of Russian, Somali, user-definable language, then;
  
  i) getting a morphology of the selected textual unit from a corresponding look-up table;
  
  ii) outputting the corresponding input in the look-up table; and
  
  iii) if there are any unprocessed textual units from the textual units received in step (a), selecting one of said unprocessed textual units and returning to substep (i), otherwise stopping;
  
  f) setting a value n equal to the total number of sub-textual units in the selected textual unit and setting a value s equal to n;
  
  g) setting a test-suffix equal to the rightmost s sub-textual units in said selected textual unit and setting stem equal to n−
  
  s leftmost sub-textual units in said selected textual unit;
  
  h) comparing the test-suffix to an inflected suffix field for each entry within a rules database, where each entry in said rules database further includes a base-suffix field, a model number field, a part of speech field, and a morphological feature field;
  
  i) if no match is made in step (h) then setting s equal to s−
  
  1 and returning to step (g);
  
  j) identifying all model numbers from the model number field of the rules database that correspond to the inflected suffixes in the rules database that matched test-suffix in step (h);
  
  k) identifying all base suffixes from the base-suffix field of the rules database that correspond to the model numbers identified in step (j);
  
  l) combining stem with each base suffix identified in step (k) to create at least one test-lemma;
  
  m) comparing the at least one test-lemma to a lemma field for each entry in a lexicon database where each entry in said lexicon database further includes a model number field, a part of speech field, a morphological feature field, a definition field, and an exception field;
  
  n) if no match is found in step (m) then outputting a message to that effect, selecting the next unprocessed textual unit if any and returning to step (f), otherwise stopping;
  
  o) identifying a model number for each lemma that matches the test lemma;
  
  p) identifying each entry in the rules database that has a model number that matches one of the model numbers identified in step (o);
  
  q) combining stem with each inflected suffix field of each entry identified in step (p) to form inflected forms of the textual unit;
  
  r) outputting a user-definable subset of the result of step (q) and a user-definable subset of the corresponding entries in the rules database and the lexicon database; and
  
  s) if there are any unprocessed textual units selecting an unprocessed textual unit and returning to step (f).
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The method of claim 1, further including the step of modifying stem if there is an entry in the exception field associated with the lemma in the lexicon database with the corresponding model number, the modification to the stem being in accordance with the entry in the exception field.
  - 3. The method of claim 2, wherein each textual unit is selected from the group of textual units consisting of a word, character, and a symbol.
  - 4. The method of claim 3, wherein each sub-textual unit is selected from the group of sub-textual units consisting of letters and symbols.
  - 5. The method of claim 4, wherein the step of outputting a user-definable subset of the result of step (q) further comprises outputting a user-definable subset of the result of step (q) and a user-definable subset of the corresponding entries in the rules database and the lexicon database in a format selected from the group of formats consisting of a table and a narrative.
  - 6. The method of claim 5, wherein the step of outputting a user-definable subset of the result of step (q) further comprises outputting a user-definable subset of the result of step (q) and a user-definable subset of the corresponding entries in the rules database and the lexicon database in a format selected from the group of formats consisting of Romanized format, UNICODE, XML, HTML, ASCII, and plaintext.
  - 7. The method of claim 6, further including the step of outputting the result of each step of steps (b)-(r) of the method.
  - 8. The method of claim 7, further including the step of identifying the inflected forms that are identical to the textual unit selected in step (d).
  - 9. The method of claim 8, wherein the step of outputting a user-definable subset of the result of step (q) further comprises outputting only at least one inflected form that is identical to the textual unit selected in step (d) and all fields associated with said inflected suffixes included in said at least one inflected form that is identical to said received textual unit in both of said lexicon database and said rules database.
  - 10. The method of claim 1, wherein each textual unit is selected from the group of textual units consisting of a word, character, and a symbol.
  - 11. The method of claim 1, wherein each sub-textual unit is selected from the group of sub-textual units consisting of letters and symbols.
  - 12. The method of claim 1, wherein the step of outputting a user-definable subset of the result of step (q) further comprises outputting a user-definable subset of the result of step (q) and a user-definable subset of the corresponding entries in the rules database and the lexicon database in a format selected from the group of formats consisting of a table and a narrative.
  - 13. The method of claim 1, wherein the step of outputting a user-definable subset of the result of step (q) further comprises outputting a user-definable subset of the result of step (q) and a user-definable subset of the corresponding entries in the rules database and the lexicon database in a format selected from the group of formats consisting of Romanized format, UNICODE, XML, HTML, ASCII, and plaintext.
  - 14. The method of claim 1, further including the step of outputting the result of each step of steps (b)-(r) of the method.
  - 15. The method of claim 1, further including the step of identifying the inflected forms that are identical to the textual unit selected in step (d).
  - 16. The method of claim 15, wherein the step of outputting a user-definable subset of the result of step (q) further comprises outputting only at least one inflected form that is identical to the textual unit selected in step (d) and all fields associated with said inflected suffixes included in said at least one inflected form that is identical to said received textual unit in both of said lexicon database and said rules database.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
National Security Agency
Original Assignee
The United States of America As Represented By The Director National Security Agency
Inventors
Shoemaker, James Edward
Primary Examiner(s)
Smits; Talivaldis Ivars
Assistant Examiner(s)
Kovacek; David

Application Number

US10/896,803
Time in Patent Office

1,475 Days
Field of Search

704 1- 10, 704258-260, 704/270, 704276-277, 704E15003-E15006, 704/E11.011, 704E13001-E13014, 341 1- 10
US Class Current

704/8
CPC Class Codes

G06F 40/268 Morphological analysis

Method of text processing

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Method of text processing

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links