Multilingual document retrieval system and method using semantic vector matching
First Claim
1. A method of representing documents in a database that includes documents in a plurality of languages, the method carried out for each document, comprising:
- determining a set of potential conceptual-level meanings of at least some words in the document from a language-independent multilingual concept database that reflects the plurality of languages and comprises a collection of concept groups;
mapping the sets of potential conceptual-level meanings, so determined, to respective single language-independent conceptual-level meanings; and
generating a language-independent conceptual representation of the subject content of the document based on the language-independent conceptual-level meanings determined in said mapping step.
1 Assignment
0 Petitions
Accused Products
Abstract
A document retrieval system where a user can enter a query, including a natural language query, in a desired one of a plurality of supported languages, and retrieve documents from a database that includes documents in at least one other language of the plurality of supported languages. The user need not have any knowledge of the other languages. Each document in the database is subjected to a set of processing steps to generate a language-independent conceptual representation of the subject content of the document. This is normally done before the query is entered. The query is also subjected to a (possibly different) set of processing steps to generate a language-independent conceptual representation of the subject content of the query. The documents and queries can also be subjected to additional analysis to provide additional term-based representations, such as the extraction of information-rich terms and phrases (such as proper nouns). Documents are matched to queries based on the conceptual-level contents of the document and query, and, optionally, on the basis of the term-based representation. The query'"'"'s representation is then compared to each document'"'"'s representation to generate a measure of relevance of the document to the query.
-
Citations
35 Claims
-
1. A method of representing documents in a database that includes documents in a plurality of languages, the method carried out for each document, comprising:
-
determining a set of potential conceptual-level meanings of at least some words in the document from a language-independent multilingual concept database that reflects the plurality of languages and comprises a collection of concept groups; mapping the sets of potential conceptual-level meanings, so determined, to respective single language-independent conceptual-level meanings; and generating a language-independent conceptual representation of the subject content of the document based on the language-independent conceptual-level meanings determined in said mapping step. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A method of retrieving documents in response to a query, the query being in a user-selected language of a plurality of languages, the method comprising:
-
providing a corpus of documents, each in a language of said plurality of languages, at least one of the documents being in a language other than the user-selected language; for each document; determining a set of multilingual concepts of at least some words in the document using a language-independent multilingual concept database that reflects the plurality of languages and comprises a collection of concept groups; mapping the sets of multilingual concepts, so determined, to respective single language-independent conceptual-level meanings; and generating a language-independent conceptual representation of the subject content of the document based on the language-independent conceptual-level meanings determined in said mapping; generating a language-independent conceptual representation of the subject content of the query; and for each document, generating a measure of relevance of the document to the query using the conceptual representation of the subject content of the document and the conceptual representation of the subject content of the query. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
-
-
34. A method of retrieving documents in response to a query in a user-selected language of a plurality of languages, the method comprising:
-
(a) providing a corpus of documents, each in a language of said plurality of languages, at least some of the documents being in a language other than the user-selected language; (b) processing each document by determining the language of the document, determining conceptual-level meaning of at least some words in the document from a language-independent multilingual concept database comprising a collection of concept groups, mapping the conceptual-level meanings into language-independent concepts, and generating a conceptual-level vector representing the subject content of the document; (c) processing the query by mapping words or phrases in the query into language-independent concepts, and generating a conceptual-level vector representing the subject content of the query; and (d) for each document, determining a measure of relevance to the query. - View Dependent Claims (35)
-
Specification