System and method for context-based document retrieval
First Claim
1. A computer-implemented context-based document retrieval method comprising:
- a) for each document, i, in a document collection, computing a word relationship matrix D(i) comprising proximity statistics of word pairs in the document; and
calculating for each document a word frequency vector;
b) generating a context database comprising an N×
N matrix C, wherein C is computed from the word relationship matrices D(i) of all the documents in the document collection, where N is a total number of unique words in the document collection;
c) computing a search matrix S from a search query and the matrix C;
d) for each document, i, computing a weight W(i) from the search matrix S and the word relationship matrix D(i) for the document; and
e) retrieving and ranting documents based on the weights W(i).
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method for document retrieval is disclosed. The invention addresses a major problem in text-based document retrieval: rapidly finding a small subset of documents in a large document collection (e.g. Web pages on the Internet) that are relevant to a limited set of query terms supplied by the user. The invention is based on utilizing information contained in the document collection about the statistics of word relationships (“context”) to facilitate the specification of search queries and document comparison. The method consists of first compiling word relationships into a context database that captures the statistics of word proximity and occurrence throughout the document collection. At retrieval time, a search matrix is computed from a set of user-supplied keywords and the context database. For each document in the collection, a similar matrix is computed using the contents of the document and the context database. Document relevance is determined by comparing the similarity of the search and document matrices. The disclosed system therefore retrieves documents with contextual similarity rather than word frequency similarity, simplifying search specification while allowing greater search precision.
333 Citations
11 Claims
-
1. A computer-implemented context-based document retrieval method comprising:
-
a) for each document, i, in a document collection, computing a word relationship matrix D(i) comprising proximity statistics of word pairs in the document; and
calculating for each document a word frequency vector;
b) generating a context database comprising an N×
N matrix C, wherein C is computed from the word relationship matrices D(i) of all the documents in the document collection, where N is a total number of unique words in the document collection;
c) computing a search matrix S from a search query and the matrix C;
d) for each document, i, computing a weight W(i) from the search matrix S and the word relationship matrix D(i) for the document; and
e) retrieving and ranting documents based on the weights W(i). - View Dependent Claims (2, 3, 4, 5)
-
-
6. A computer-implemented context-based document retrieval method comprising:
-
a) for each document, i, in a document collection, computing a word-phrase relationship matrix D(i) comprising proximity statistics of word n-tuples in the document;
wherein n is a number of words in each word-phrase relationship; and
calculating for each document a word-phrase frequency vector;
b) generating a context database comprising an N×
N matrix C, wherein C is computed from the word-phrase relationship matrices D(i) of all the documents in the document collection, where N is a total number of unique word-phrases in the document collection;
c) computing a search matrix S from a search query and the matrix C;
d) for each document, i, computing a weight W(i) from the search matrix S and the word-phrase relationship matrix D(i) for the document; and
e) retrieving and ranking documents based on the weights W(i). - View Dependent Claims (7, 8, 9, 10)
-
-
11. A computer-implemented context-based document retrieval method comprising:
-
a) for each document, i, in a document collection, computing a word-phrase relationship tensor D(i) comprising proximity statistics of word-phrase n-tuples in the document, where n is at least two;
wherein n is a number of words in each word-phrase relationship; and
calculating for each document a word-phrase frequency vector;
b) generating a context database comprising a tensor C, wherein C is computed from the word-phrase relationship tensors D(i) of all the documents in the document collection;
c) computing a search tensor S from a search query and the tensor C;
d) for each document, i, computing a weight W(i) from the search tensor S and the word-phrase relationship tensor D(i) for the document, and e) retrieving and ranking documents based on the weights W(i).
-
Specification