Mining transliterations for out-of-vocabulary query terms
First Claim
1. A method, implemented using electrical data processing functionality, for retrieving information, comprising:
- receiving a query in a source language, the query having one or more query terms;
determining whether each of the query terms is present in a translation dictionary, the translation dictionary mapping terms from the source language to a target language, each query term that is present in the translation dictionary comprising an in-vocabulary term, and each query term that is not present in the translation dictionary comprising an out-of-vocabulary (OOV) term;
translating each in-vocabulary term to a translated term in the target language using the translation dictionary, to provide a set of one or more translated terms;
identifying at least one document that is selected from a collection of documents in the target language based on the set of translated terms, said at least one document including a plurality of candidate words in the target language;
performing mining analysis to attempt to extract a viable transliteration of each OOV term of the query from said at least one document by;
(a) selecting an OOV term in the query for analysis, to provide a selected OOV term;
(b) determining, after said selecting of the OOV term, whether the OOV term is a qualifying OOV term;
(c) selecting a candidate word in said at least one document, to provide a selected candidate word;
(d) determining, after said selecting of the candidate word, whether the selected candidate word is a qualifying candidate word;
(e) determining a transliteration measure between the selected candidate word and the selected OOV term, without first having generated a transliteration for the selected OOV term, said determining of the transliteration measure being performed when the selected candidate word is a qualifying candidate word and the selected OOV term is a qualifying OOV term;
(f) determining whether the selected candidate word is a viable transliteration of the selected OOV term based on the transliteration measure; and
performing operations (a) through (f) for each possible other pairing of an OOV term in the query and a candidate word in said at least one document;
updating the translation dictionary to include each viable transliteration identified by the mining analysis, to provide an updated translation dictionary and an updated set of translated terms for the query; and
repeating said identifying, performing the mining analysis, and updating, at least one time,said receiving, determining whether each of the query terms is present in a translation dictionary, translating, identifying, performing the mining analysis, updating, and repeating being performed by the electrical data processing functionality.
2 Assignments
0 Petitions
Accused Products
Abstract
An approach is described for using a query expressed in a source language to retrieve information expressed in a target language. The approach uses a translation dictionary to convert terms in the query from the source language to appropriate terms in the target language. The approach determines viable transliterations for out-of-vocabulary (OOV) query terms by retrieving a body of information based on an in-vocabulary component of the query, and then mining the body of information to identify the viable transliterations for the OOV query terms. The approach then adds the viable transliterations to the translation dictionary. The retrieval, mining, and adding operations can be repeated one or more or times.
242 Citations
20 Claims
-
1. A method, implemented using electrical data processing functionality, for retrieving information, comprising:
-
receiving a query in a source language, the query having one or more query terms; determining whether each of the query terms is present in a translation dictionary, the translation dictionary mapping terms from the source language to a target language, each query term that is present in the translation dictionary comprising an in-vocabulary term, and each query term that is not present in the translation dictionary comprising an out-of-vocabulary (OOV) term; translating each in-vocabulary term to a translated term in the target language using the translation dictionary, to provide a set of one or more translated terms; identifying at least one document that is selected from a collection of documents in the target language based on the set of translated terms, said at least one document including a plurality of candidate words in the target language; performing mining analysis to attempt to extract a viable transliteration of each OOV term of the query from said at least one document by; (a) selecting an OOV term in the query for analysis, to provide a selected OOV term; (b) determining, after said selecting of the OOV term, whether the OOV term is a qualifying OOV term; (c) selecting a candidate word in said at least one document, to provide a selected candidate word; (d) determining, after said selecting of the candidate word, whether the selected candidate word is a qualifying candidate word; (e) determining a transliteration measure between the selected candidate word and the selected OOV term, without first having generated a transliteration for the selected OOV term, said determining of the transliteration measure being performed when the selected candidate word is a qualifying candidate word and the selected OOV term is a qualifying OOV term; (f) determining whether the selected candidate word is a viable transliteration of the selected OOV term based on the transliteration measure; and performing operations (a) through (f) for each possible other pairing of an OOV term in the query and a candidate word in said at least one document; updating the translation dictionary to include each viable transliteration identified by the mining analysis, to provide an updated translation dictionary and an updated set of translated terms for the query; and repeating said identifying, performing the mining analysis, and updating, at least one time, said receiving, determining whether each of the query terms is present in a translation dictionary, translating, identifying, performing the mining analysis, updating, and repeating being performed by the electrical data processing functionality. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A tangible computer-readable storage medium for storing computer-readable instructions, the computer-readable instructions providing an information retrieval system when executed by one or more processing devices, the computer-readable instructions performing a method comprising:
-
receiving a query in a source language, the query having one or more query terms; determining whether each of the query terms is present in a translation dictionary, the translation dictionary mapping terms from the source language to a target language, each query term that is present in the translation dictionary comprising an in-vocabulary term, and each query term that is not present in the translation dictionary comprising an out-of-vocabulary (OOV) term; translating each in-vocabulary term to a translated term in the target language using the translation dictionary, to provide a set of one or more translated terms; identifying at least one document that is selected from a collection of documents in the target language based on the set of translated terms, said at least one document including a plurality of candidate words in the target language; performing mining analysis to attempt to extract a viable transliteration of each OOV term of the query from said at least one document by; (a) selecting an OOV term in the query for analysis, to provide a selected OOV term; (b) determining, after said selecting of the OOV term, whether the OOV term is transliteratable; (c) selecting a candidate word in said at least one document, to provide a selected candidate word; (d) determining, after said selecting of the candidate word, whether the selected candidate word is transliteratable, and whether the selected candidate word has a length that differs from a length of the selected OOV term by no more than a prescribed number of characters; (e) determining a transliteration measure between the selected candidate word and the selected OOV term, without first having generated a transliteration for the selected OOV term, said determining of the transliteration measure being performed when the selected OOV term and the selected candidate word are each transliteratable, and when the length of the selected OOV differs from the length of the selected OOV term by no more than a prescribed number of characters; (f) determining whether the selected candidate word is a viable transliteration of the selected OOV term based on the transliteration measure; and performing operations (a) through (f) for each possible other pairing of an OOV term in the query and a candidate word in said at least one document; updating the translation dictionary to include each viable transliteration identified by the mining analysis, to provide an updated translation dictionary and an updated set of translated terms for the query; and repeating said identifying, performing the mining analysis, and updating, at least one time. - View Dependent Claims (11)
-
-
12. An information retrieval system, implemented by electrical data processing functionality, for retrieving information, comprising:
-
a data store that provides a translation dictionary for correlating terms in a source language to corresponding terms in a target language; and a transliteration processing module configured to convert queries in the source language to respective counterparts in the target language, the queries encompassing a query that includes an in-vocabulary component and an out-of-vocabulary (OOV) component, the in-vocabulary component comprising at least one in-vocabulary term that is included in the translation dictionary, and the OOV component comprising at least one OOV term that is not included in the translation dictionary, the transliteration processing module comprising; an in-vocabulary determination module configured to determine, using the translation dictionary, a translated term for each in-vocabulary term of the query, to provide a set of one or more translated terms; a mining module configured to attempt to identify, within a body of information, a viable transliteration of each OOV term in the query, the body of information being identified based on the set of translated terms provided by the in-vocabulary determination module, wherein the body of information includes a plurality of candidate words in the target language, and wherein the mining module is configured to analyze each pairing of a selected OOV term in the query and a selected candidate word in the body of information without first having generated a transliteration for the selected OOV term and without consideration of whether the OOV term is a named entity, by attempting to; determine a transliteration measure between the selected candidate word and the selected OOV term when the selected candidate word is a qualifying candidate word and the selected OOV term is a qualifying OOV term; and determine whether the selected candidate word is a viable transliteration of the selected OOV term based on the transliteration measure; and an updating module configured to add each viable transliteration identified by the mining module to the translation dictionary to provide an updated translation dictionary. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
-
Specification