×

Document-based query data for information retrieval

  • US 6,396,951 B1
  • Filed: 12/23/1998
  • Issued: 05/28/2002
  • Est. Priority Date: 12/29/1997
  • Status: Expired due to Term
First Claim
Patent Images

1. A method of using documents with text to obtain data for use in information retrieval, the method comprising:

  • (A) scanning a document that includes text in a first language to obtain text image data defining an image of a segment of the text;

    (B) performing automatic recognition on at least part of the text image data to obtain text code data, the text code data including a series of element codes, each indicating an element that occurs in the first language, the series of element codes defining a first set of expressions, each of which occurs in the first language;

    (C) performing automatic translation on a version of the text code data to obtain translation data, the translation data indicating a second set of expressions, each of the second set of expressions being a counterpart in the second language of one or more of the first set of expressions, wherein performing automatic translation further comprises;

    (C1) using the version of the text code data to access a translation dictionary with each of the first set of expressions, the translation dictionary providing the translation data, such that the series of element codes define a first set of words that occur in the first language, and wherein (C1) further comprises;

    (C1a) tokenizing the text code data to obtain token data indicating tokens that occur in the sequence of element codes, the tokens including the first set of words;

    (C1b) disambiguating the token data to obtain disambiguated data, the disambiguated data including, for each of the first set of words, a part-of-speech indicator indicating the word'"'"'s part of speech;

    (C1c) lemmatizing the disambiguated data to obtain lemmatized data, the lemmatized data indicating, for each of the first set of words, either the word or a lemma for the word; and

    (C1d) translating the words and lemmas indicated by the lemmatized data to obtain the translation data, the translation data indicating possible counterparts in the second language for a subset of the words and lemmas indicated by the lemmatized data; and

    (D) using the second set of expressions to automatically obtain query data defining a query for use in retrieving a list of documents.

View all claims
  • 6 Assignments
Timeline View
Assignment View
    ×
    ×