System and method for context-based document retrieval

US 6,633,868 B1
Filed: 07/28/2000
Issued: 10/14/2003
Est. Priority Date: 07/28/2000
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented context-based document retrieval method comprising:

a) for each document, i, in a document collection, computing a word relationship matrix D(i) comprising proximity statistics of word pairs in the document; and

calculating for each document a word frequency vector;

b) generating a context database comprising an N×

N matrix C, wherein C is computed from the word relationship matrices D(i) of all the documents in the document collection, where N is a total number of unique words in the document collection;

c) computing a search matrix S from a search query and the matrix C;

d) for each document, i, computing a weight W(i) from the search matrix S and the word relationship matrix D(i) for the document; and

e) retrieving and ranting documents based on the weights W(i).

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for document retrieval is disclosed. The invention addresses a major problem in text-based document retrieval: rapidly finding a small subset of documents in a large document collection (e.g. Web pages on the Internet) that are relevant to a limited set of query terms supplied by the user. The invention is based on utilizing information contained in the document collection about the statistics of word relationships (“context”) to facilitate the specification of search queries and document comparison. The method consists of first compiling word relationships into a context database that captures the statistics of word proximity and occurrence throughout the document collection. At retrieval time, a search matrix is computed from a set of user-supplied keywords and the context database. For each document in the collection, a similar matrix is computed using the contents of the document and the context database. Document relevance is determined by comparing the similarity of the search and document matrices. The disclosed system therefore retrieves documents with contextual similarity rather than word frequency similarity, simplifying search specification while allowing greater search precision.

333 Citations

11 Claims

1. A computer-implemented context-based document retrieval method comprising:
- a) for each document, i, in a document collection, computing a word relationship matrix D(i) comprising proximity statistics of word pairs in the document; and
  
  calculating for each document a word frequency vector;
  
  b) generating a context database comprising an N×
  
  N matrix C, wherein C is computed from the word relationship matrices D(i) of all the documents in the document collection, where N is a total number of unique words in the document collection;
  
  c) computing a search matrix S from a search query and the matrix C;
  
  d) for each document, i, computing a weight W(i) from the search matrix S and the word relationship matrix D(i) for the document; and
  
  e) retrieving and ranting documents based on the weights W(i).
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1 wherein the word relationship matrix D(i) further comprises word pair incidence counts.
  - 3. The method of claim 1 wherein computing C comprises adding the matrices D(i) of all the documents in the document collection.
  - 4. The method of claim 1 wherein computing S comprises selecting column vectors of C corresponding to keywords of the search query, and forming S from the column vectors.
  - 5. The method of claim 1 wherein computing W(i) comprises element-by-element multiplication of D(i) and S, followed by a summation of all resulting elements.

6. A computer-implemented context-based document retrieval method comprising:
- a) for each document, i, in a document collection, computing a word-phrase relationship matrix D(i) comprising proximity statistics of word n-tuples in the document;
  
  wherein n is a number of words in each word-phrase relationship; and
  
  calculating for each document a word-phrase frequency vector;
  
  b) generating a context database comprising an N×
  
  N matrix C, wherein C is computed from the word-phrase relationship matrices D(i) of all the documents in the document collection, where N is a total number of unique word-phrases in the document collection;
  
  c) computing a search matrix S from a search query and the matrix C;
  
  d) for each document, i, computing a weight W(i) from the search matrix S and the word-phrase relationship matrix D(i) for the document; and
  
  e) retrieving and ranking documents based on the weights W(i).
- View Dependent Claims (7, 8, 9, 10)
- - 7. The method of claim 6 wherein the word-phrase relationship matrix D(i) further comprises word-phrase incidence counts.
  - 8. The method of claim 6 wherein computing C comprises adding the matrices D(i) of all the documents in the document collection.
  - 9. The method of claim 6 wherein computing S comprises selecting column vectors of C corresponding to keyword-phrases of the search query, and forming S from the column vectors.
  - 10. The method of claim 6 wherein computing W(i) comprises element-by-element multiplication of D(i) and S, followed by a summation of all resulting elements.

11. A computer-implemented context-based document retrieval method comprising:
- a) for each document, i, in a document collection, computing a word-phrase relationship tensor D(i) comprising proximity statistics of word-phrase n-tuples in the document, where n is at least two;
  
  wherein n is a number of words in each word-phrase relationship; and
  
  calculating for each document a word-phrase frequency vector;
  
  b) generating a context database comprising a tensor C, wherein C is computed from the word-phrase relationship tensors D(i) of all the documents in the document collection;
  
  c) computing a search tensor S from a search query and the tensor C;
  
  d) for each document, i, computing a weight W(i) from the search tensor S and the word-phrase relationship tensor D(i) for the document, and e) retrieving and ranking documents based on the weights W(i).

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Shermann Loyall Min
Original Assignee
Shermann Loyall Min
Inventors
Min, Shermann Loyall, Tanno, Constantin Lorenzo, Mainen, Zachary Frank, Softky, William Russell
Primary Examiner(s)
Rones, Charles
Assistant Examiner(s)
Wu, Yicun

Application Number

US09/627,617
Time in Patent Office

1,173 Days
Field of Search

705/5, 345/440, 400/63, 707/101, 707/3, 707/102, 707/4
US Class Current

1/1
CPC Class Codes

G06F 16/951   Indexing; Web crawling tech...

G06F 16/9532   Query formulation

G06F 16/9538   Presentation of query results

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99934   Query formulation, input pr...

System and method for context-based document retrieval

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

333 Citations

11 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for context-based document retrieval

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

333 Citations

11 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links