Document-based query data for information retrieval
First Claim
1. A method of using documents with text to obtain data for use in information retrieval, the method comprising:
- (A) scanning a document that includes text in a first language to obtain text image data defining an image of a segment of the text;
(B) performing automatic recognition on at least part of the text image data to obtain text code data, the text code data including a series of element codes, each indicating an element that occurs in the first language, the series of element codes defining a first set of expressions, each of which occurs in the first language;
(C) performing automatic translation on a version of the text code data to obtain translation data, the translation data indicating a second set of expressions, each of the second set of expressions being a counterpart in the second language of one or more of the first set of expressions, wherein performing automatic translation further comprises;
(C1) using the version of the text code data to access a translation dictionary with each of the first set of expressions, the translation dictionary providing the translation data, such that the series of element codes define a first set of words that occur in the first language, and wherein (C1) further comprises;
(C1a) tokenizing the text code data to obtain token data indicating tokens that occur in the sequence of element codes, the tokens including the first set of words;
(C1b) disambiguating the token data to obtain disambiguated data, the disambiguated data including, for each of the first set of words, a part-of-speech indicator indicating the word'"'"'s part of speech;
(C1c) lemmatizing the disambiguated data to obtain lemmatized data, the lemmatized data indicating, for each of the first set of words, either the word or a lemma for the word; and
(C1d) translating the words and lemmas indicated by the lemmatized data to obtain the translation data, the translation data indicating possible counterparts in the second language for a subset of the words and lemmas indicated by the lemmatized data; and
(D) using the second set of expressions to automatically obtain query data defining a query for use in retrieving a list of documents.
6 Assignments
0 Petitions
Accused Products
Abstract
To obtain a query for use in information retrieval, a document is scanned. The resulting text image data define an image of a segment of text in a first language. Automatic recognition is then performed on at least part of the text image data to obtain text code data including a series of element codes. Each element code indicates an element that occurs in the first language, and the series of element codes defines a set of expressions that also occur in the first language. Automatic translation is then performed on a version of the text code data to obtain translation data indicating a set of counterpart expressions in a second language. The counterpart expressions are used to automatically obtain query data defining the query. The query can then be provided to an information retrieval engine.
-
Citations
14 Claims
-
1. A method of using documents with text to obtain data for use in information retrieval, the method comprising:
-
(A) scanning a document that includes text in a first language to obtain text image data defining an image of a segment of the text;
(B) performing automatic recognition on at least part of the text image data to obtain text code data, the text code data including a series of element codes, each indicating an element that occurs in the first language, the series of element codes defining a first set of expressions, each of which occurs in the first language;
(C) performing automatic translation on a version of the text code data to obtain translation data, the translation data indicating a second set of expressions, each of the second set of expressions being a counterpart in the second language of one or more of the first set of expressions, wherein performing automatic translation further comprises;
(C1) using the version of the text code data to access a translation dictionary with each of the first set of expressions, the translation dictionary providing the translation data, such that the series of element codes define a first set of words that occur in the first language, and wherein (C1) further comprises;
(C1a) tokenizing the text code data to obtain token data indicating tokens that occur in the sequence of element codes, the tokens including the first set of words;
(C1b) disambiguating the token data to obtain disambiguated data, the disambiguated data including, for each of the first set of words, a part-of-speech indicator indicating the word'"'"'s part of speech;
(C1c) lemmatizing the disambiguated data to obtain lemmatized data, the lemmatized data indicating, for each of the first set of words, either the word or a lemma for the word; and
(C1d) translating the words and lemmas indicated by the lemmatized data to obtain the translation data, the translation data indicating possible counterparts in the second language for a subset of the words and lemmas indicated by the lemmatized data; and
(D) using the second set of expressions to automatically obtain query data defining a query for use in retrieving a list of documents. - View Dependent Claims (2, 3, 4, 5, 6, 7)
scanning the document to obtain document image data defining an image of the document including the text; and
using the document image data to obtain the text image data by extracting the segment indicated by the manual markings.
-
-
3. The method of claim 1 in which (B) comprises:
performing optical character recognition on at least part of the text image data;
the element codes including character codes indicating characters that occur in the first language.
-
4. The method of claim 3 in which (B) further comprises:
performing automatic language identification to obtain a language identifier indicating a candidate language that is likely to be the predominant language of the segment of the text;
the optical character recognition being specific to the candidate language.
-
5. The method of claim 3, further comprising, after (B):
-
presenting the elements indicated by the series of element codes to a user;
receiving signals from the user indicating modifications of the presented elements; and
modifying the series of element codes in accordance with the signals from the user to obtain the version of the text code data on which automatic translation is performed.
-
-
6. the method of claim 1 in which (C1d) comprises looking up the words and lemmas indicated by the lemmatized data in a bilingual translation dictionary to obtain counterparts in the second language.
-
7. The method of claim 1 in which the query data define the query in a format suitable for an information retrieval engine;
-
the method further comprising;
(E) providing the query data to the information retrieval engine.
-
-
8. A system for using documents with text to obtain data for use in information retrieval, the system comprising:
-
a scanning device for scanning documents and providing image data;
a processor connected for receiving image data from the scanning device, after receiving text image data defining an image of a segment of text in a first language from a document scanned by the scanning device, the processor operating to;
(A) perform automatic recognition on at least part of the text image data to obtain text code data, the text code data including a series of element codes, each indicating an element that occurs in the first language, the series of element codes defining a first set of expressions, each of which occurs in the first language;
(B) perform automatic translation on a version of the text code data to obtain translation data, the translation data indicating a second set of expressions, each of the second set of expressions being a counterpart in the second language of one or more of the first set of expressions, (B1) wherein during the automatic translation, the processor uses the version of the text code data to access a translation dictionary with each of the first set of expressions, the translation dictionary providing the translation data, such that the sequence of element codes define a first set of words that occur in the first language, and wherein the processor in (B1) further operates to;
(B1a) tokenize the text code data to obtain token data indicating tokens that occur in the sequence of element codes, the tokens including the first set of words;
(B1b) disambiguate the token data to obtain disambiguated data, the disambiguated data including, for each of the first set of words, a part-of-speech indicator indicating the word'"'"'s part of speech;
(B1c) lemmatize the disambiguated data to obtain lemmatized data, the lemmatized data indicating, for each of the first set of words, either the word or a lemma for the word; and
(B1d) translate the words and lemmas indicated by the lemmatized data to obtain the translation data, the translation data indicating possible counterparts in the second language for a subset of the words and lemmas indicated by the lemmatized data; and
(C) use the second set of expressions to automatically obtain query data defining a query for use in retrieving a list of documents. - View Dependent Claims (9, 10, 11, 12, 13, 14)
present the elements indicated by the series of element codes to a user;
receive signals from the user indicating modifications of the presented elements; and
modify the series of element codes in accordance with the signals from the user to obtain the version of the text code data on which automatic translation is performed.
-
-
13. The system of claim 8 in which the processor in (B1d) operates to look up the words and lemmas indicated by the lemmatized data in a bilingual translation dictionary to obtain counterparts in the second language.
-
14. The system of claim 8, wherein the query data define the query in a format suitable for an information retrieval engine.
Specification