Document-based query data for information retrieval

US 6,396,951 B1
Filed: 12/23/1998
Issued: 05/28/2002
Est. Priority Date: 12/29/1997
Status: Expired due to Term

First Claim

Patent Images

1. A method of using documents with text to obtain data for use in information retrieval, the method comprising:

(A) scanning a document that includes text in a first language to obtain text image data defining an image of a segment of the text;

(B) performing automatic recognition on at least part of the text image data to obtain text code data, the text code data including a series of element codes, each indicating an element that occurs in the first language, the series of element codes defining a first set of expressions, each of which occurs in the first language;

(C) performing automatic translation on a version of the text code data to obtain translation data, the translation data indicating a second set of expressions, each of the second set of expressions being a counterpart in the second language of one or more of the first set of expressions, wherein performing automatic translation further comprises;

(C1) using the version of the text code data to access a translation dictionary with each of the first set of expressions, the translation dictionary providing the translation data, such that the series of element codes define a first set of words that occur in the first language, and wherein (C1) further comprises;

(C1a) tokenizing the text code data to obtain token data indicating tokens that occur in the sequence of element codes, the tokens including the first set of words;

(C1b) disambiguating the token data to obtain disambiguated data, the disambiguated data including, for each of the first set of words, a part-of-speech indicator indicating the word'"'"'s part of speech;

(C1c) lemmatizing the disambiguated data to obtain lemmatized data, the lemmatized data indicating, for each of the first set of words, either the word or a lemma for the word; and

(C1d) translating the words and lemmas indicated by the lemmatized data to obtain the translation data, the translation data indicating possible counterparts in the second language for a subset of the words and lemmas indicated by the lemmatized data; and

(D) using the second set of expressions to automatically obtain query data defining a query for use in retrieving a list of documents.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

To obtain a query for use in information retrieval, a document is scanned. The resulting text image data define an image of a segment of text in a first language. Automatic recognition is then performed on at least part of the text image data to obtain text code data including a series of element codes. Each element code indicates an element that occurs in the first language, and the series of element codes defines a set of expressions that also occur in the first language. Automatic translation is then performed on a version of the text code data to obtain translation data indicating a set of counterpart expressions in a second language. The counterpart expressions are used to automatically obtain query data defining the query. The query can then be provided to an information retrieval engine.

Citations

14 Claims

1. A method of using documents with text to obtain data for use in information retrieval, the method comprising:
- (A) scanning a document that includes text in a first language to obtain text image data defining an image of a segment of the text;
  
  (B) performing automatic recognition on at least part of the text image data to obtain text code data, the text code data including a series of element codes, each indicating an element that occurs in the first language, the series of element codes defining a first set of expressions, each of which occurs in the first language;
  
  (C) performing automatic translation on a version of the text code data to obtain translation data, the translation data indicating a second set of expressions, each of the second set of expressions being a counterpart in the second language of one or more of the first set of expressions, wherein performing automatic translation further comprises;
  
  (C1) using the version of the text code data to access a translation dictionary with each of the first set of expressions, the translation dictionary providing the translation data, such that the series of element codes define a first set of words that occur in the first language, and wherein (C1) further comprises;
  
  (C1a) tokenizing the text code data to obtain token data indicating tokens that occur in the sequence of element codes, the tokens including the first set of words;
  
  (C1b) disambiguating the token data to obtain disambiguated data, the disambiguated data including, for each of the first set of words, a part-of-speech indicator indicating the word'"'"'s part of speech;
  
  (C1c) lemmatizing the disambiguated data to obtain lemmatized data, the lemmatized data indicating, for each of the first set of words, either the word or a lemma for the word; and
  
  (C1d) translating the words and lemmas indicated by the lemmatized data to obtain the translation data, the translation data indicating possible counterparts in the second language for a subset of the words and lemmas indicated by the lemmatized data; and
  
  (D) using the second set of expressions to automatically obtain query data defining a query for use in retrieving a list of documents.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1 in which the document includes manual markings indicating the segment of the text and in which (A) comprises:
3. The method of claim 1 in which (B) comprises:
- performing optical character recognition on at least part of the text image data;
  
  the element codes including character codes indicating characters that occur in the first language.
4. The method of claim 3 in which (B) further comprises:
- performing automatic language identification to obtain a language identifier indicating a candidate language that is likely to be the predominant language of the segment of the text;
  
  the optical character recognition being specific to the candidate language.
5. The method of claim 3, further comprising, after (B):
- presenting the elements indicated by the series of element codes to a user;
  
  receiving signals from the user indicating modifications of the presented elements; and
  
  modifying the series of element codes in accordance with the signals from the user to obtain the version of the text code data on which automatic translation is performed.
6. the method of claim 1 in which (C1d) comprises looking up the words and lemmas indicated by the lemmatized data in a bilingual translation dictionary to obtain counterparts in the second language.
7. The method of claim 1 in which the query data define the query in a format suitable for an information retrieval engine;
- the method further comprising;
  
  (E) providing the query data to the information retrieval engine.

8. A system for using documents with text to obtain data for use in information retrieval, the system comprising:
- a scanning device for scanning documents and providing image data;
  
  a processor connected for receiving image data from the scanning device, after receiving text image data defining an image of a segment of text in a first language from a document scanned by the scanning device, the processor operating to;
  
  (A) perform automatic recognition on at least part of the text image data to obtain text code data, the text code data including a series of element codes, each indicating an element that occurs in the first language, the series of element codes defining a first set of expressions, each of which occurs in the first language;
  
  (B) perform automatic translation on a version of the text code data to obtain translation data, the translation data indicating a second set of expressions, each of the second set of expressions being a counterpart in the second language of one or more of the first set of expressions, (B1) wherein during the automatic translation, the processor uses the version of the text code data to access a translation dictionary with each of the first set of expressions, the translation dictionary providing the translation data, such that the sequence of element codes define a first set of words that occur in the first language, and wherein the processor in (B1) further operates to;
  
  (B1a) tokenize the text code data to obtain token data indicating tokens that occur in the sequence of element codes, the tokens including the first set of words;
  
  (B1b) disambiguate the token data to obtain disambiguated data, the disambiguated data including, for each of the first set of words, a part-of-speech indicator indicating the word'"'"'s part of speech;
  
  (B1c) lemmatize the disambiguated data to obtain lemmatized data, the lemmatized data indicating, for each of the first set of words, either the word or a lemma for the word; and
  
  (B1d) translate the words and lemmas indicated by the lemmatized data to obtain the translation data, the translation data indicating possible counterparts in the second language for a subset of the words and lemmas indicated by the lemmatized data; and
  
  (C) use the second set of expressions to automatically obtain query data defining a query for use in retrieving a list of documents.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The system of claim 8, wherein the document includes manual markings indicating the segment of the text and in which the processor further operates to use the document image data provided by the scanning device to obtain text image data by extracting the segment indicated by the manual markings.
  - 10. The system of claim 8 in which the processor operates in (A) to further perform optical character recognition on at least part of the text image data;
    - the element codes including character codes indicating characters that occur in the first language.
  - 11. The system of claim 10 in which the processor operates in (A) to further perform automatic language identification to obtain a language identifier indicating a candidate language that is likely to be the predominant language of the segment of the text;
    - the optical character recognition being specific to the candidate language.
  - 12. The system of claim 10, further comprising, after processing (A), the processor operates to;
13. The system of claim 8 in which the processor in (B1d) operates to look up the words and lemmas indicated by the lemmatized data in a bilingual translation dictionary to obtain counterparts in the second language.
14. The system of claim 8, wherein the query data define the query in a format suitable for an information retrieval engine.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Grefenstette, Gregory
Primary Examiner(s)
Boudreau, Leo
Assistant Examiner(s)
Mariam, Daniel G.

Application Number

US09/218,357
Time in Patent Office

1,252 Days
Field of Search

382/181,186,187,190,305,185,229,231 704/2,3,9,10,5,7 707/3,4,1,5 358/403
US Class Current

382/187
CPC Class Codes

G06F 16/5846   using extracted text

G06F 40/211   Syntactic parsing, e.g. bas...

G06F 40/268   Morphological analysis

G06F 40/55   Rule-based translation

G06V 30/10   Character recognition

G06V 30/262   using context analysis, e.g...

H04N 1/00204   with a digital computer or ...

H04N 1/00241   using an image reading devi...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99935   Query augmenting and refini...

Document-based query data for information retrieval

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Document-based query data for information retrieval

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links