Compressed document surrogates

US 7,240,056 B2
Filed: 12/12/2003
Issued: 07/03/2007
Est. Priority Date: 07/30/1999
Status: Expired due to Term

First Claim

Patent Images

1. A method for returning a list of a number of documents N in order of predicted utility, from among a collection of documents, as predicted by a search query containing terms to be present or absent, the method comprising:

(a) creating a compressed document surrogate for each document in the database, where the compressed document surrogate contains information about each term, from among the terms of interest of interest in the database, which occurs in the document, and which compressed document surrogate is created with top and remainder inverted term lists that contain information about the terms of interest in the database, and where the information about each term included in the compressed document surrogate for a document includes at least one of;

the term identification number of the term, the location in a lookup table of an entry for the term, the number of times the term occurs in the document, the location in the document of each occurrence of the term, the address of the inverted term list of the term which contains the document, and the address of the location in the inverted term list of the document;

(b) choosing, from among the terms in the search query which are to be found in documents, the term whose top inverted term list has not yet considered, which occurs in the fewest documents in the collection;

(c) consulting the top inverted term list for said term, calculating the score for each document found in the top inverted term list;

(i) if the document has not previously been found on an inverted term list, assigning the document the calculated score;

(ii) if the document has previously been found on an inverted term list, increasing its previously-calculated score by the calculated score;

(d) calculating a maximum score, S_Max, achieved by a document, not already found on a top inverted term list, if it is found on all top inverted term lists, for terms to be found in documents, not yet consulted;

(e) calculating a maximum score, S_Sub, to be subtracted from a document score, as a result of said document being found to contain terms to be absent from a document;

(f) determining whether there are N or more documents already found, with scores such that if S_Subwere subtracted from their scores, the remainder would be greater than S_Max;

(g) if there are N or more such documents, determining by use of the compressed document surrogate for each document a final score for the documents that have so far been found in any inverted term list of a desired term, and providing a list of the N documents with the highest scores, ranked in order of score;

(h) if there are not N or more such documents, repeating (b) through (f) until either N or more such documents are found, or until no top inverted term list of a term to be found in the document has not been analyzed;

(i) if there are not N or more such documents, and the top inverted term lists of all terms desired to be found in the document have been analyzed, repeating (b) through (h) utilizing remainder inverted term lists instead of top inverted term lists, until either N or more such documents are found, or until no remainder inverted term lists of terms desired to be found in the document has not been analyzed; and

,(j) determining by use of the compressed document surrogate for each document the final score for the documents found on the inverted term lists of the desired terms, and providing a list of the documents ranked in order of score.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed is a method and device for storing information about Web documents such as pages or sites in a manner which may be used in conjunction with inverted term lists to facilitate the retrieval of documents of interest from the Web. The method involves constructing compressed surrogates for documents, such that various operations may be performed without the need to retrieve a copy of the document from the Web. The method permits the efficient updating of inverted term lists when documents on the Web have been modified or deleted, and also permits the efficient processing of search queries in a variety of circumstances.

Citations

8 Claims

1. A method for returning a list of a number of documents N in order of predicted utility, from among a collection of documents, as predicted by a search query containing terms to be present or absent, the method comprising:
- (a) creating a compressed document surrogate for each document in the database, where the compressed document surrogate contains information about each term, from among the terms of interest of interest in the database, which occurs in the document, and which compressed document surrogate is created with top and remainder inverted term lists that contain information about the terms of interest in the database, and where the information about each term included in the compressed document surrogate for a document includes at least one of;
  
  the term identification number of the term, the location in a lookup table of an entry for the term, the number of times the term occurs in the document, the location in the document of each occurrence of the term, the address of the inverted term list of the term which contains the document, and the address of the location in the inverted term list of the document;
  
  (b) choosing, from among the terms in the search query which are to be found in documents, the term whose top inverted term list has not yet considered, which occurs in the fewest documents in the collection;
  
  (c) consulting the top inverted term list for said term, calculating the score for each document found in the top inverted term list;
  
  (i) if the document has not previously been found on an inverted term list, assigning the document the calculated score;
  
  (ii) if the document has previously been found on an inverted term list, increasing its previously-calculated score by the calculated score;
  
  (d) calculating a maximum score, S_Max, achieved by a document, not already found on a top inverted term list, if it is found on all top inverted term lists, for terms to be found in documents, not yet consulted;
  
  (e) calculating a maximum score, S_Sub, to be subtracted from a document score, as a result of said document being found to contain terms to be absent from a document;
  
  (f) determining whether there are N or more documents already found, with scores such that if S_Subwere subtracted from their scores, the remainder would be greater than S_Max;
  
  (g) if there are N or more such documents, determining by use of the compressed document surrogate for each document a final score for the documents that have so far been found in any inverted term list of a desired term, and providing a list of the N documents with the highest scores, ranked in order of score;
  
  (h) if there are not N or more such documents, repeating (b) through (f) until either N or more such documents are found, or until no top inverted term list of a term to be found in the document has not been analyzed;
  
  (i) if there are not N or more such documents, and the top inverted term lists of all terms desired to be found in the document have been analyzed, repeating (b) through (h) utilizing remainder inverted term lists instead of top inverted term lists, until either N or more such documents are found, or until no remainder inverted term lists of terms desired to be found in the document has not been analyzed; and
  
  ,(j) determining by use of the compressed document surrogate for each document the final score for the documents found on the inverted term lists of the desired terms, and providing a list of the documents ranked in order of score.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, wherein the documents are Web pages.
  - 3. The method of claim 1, wherein the documents are Web sites.
  - 4. The method to claim 1, wherein only terms desired to be found are contained in a search query, so that S_Subis zero.

5. A device for returning a list of a number of documents N in order of predicted utility, from among a collection of documents, as predicted by a search query containing terms to be present or absent, the device comprising:
- (a) a processor;
  
  (b) means for creating a compressed document surrogate for each document in the database, where the compressed document surrogate contains information about each term, from among the terms of interest of interest in the database, which occurs in the document, and where the compressed document surrogate is created with top and remainder inverted term lists that contain information about the terms of interest in the database, and where the information about each term included in the compressed document surrogate for a document includes at least one of;
  
  the term identification number of the term, the location in a lookup table of an entry for the term, the number of times the term occurs in the document, the location in the document of each occurrence of the term, the address of the inverted term list of the term which contains the document, and the address of the location in the inverted term list of the document;
  
  (c) means for choosing, from among the terms in the search query which are to be found in documents, the term whose top inverted term list has not yet considered, which occurs in the fewest documents in the collection;
  
  (d) means for consulting the top inverted term list for said term, and calculating the score for each document found in the top inverted term list;
  
  (i) means for assigning the document the calculated score, in response to the document not having previously been found on an inverted term list;
  
  (ii) means for increasing the document'"'"'s previously-calculated score by the calculated score, in response to the document having previously been found on an inverted term list;
  
  (e) means for calculating a maximum score, S_Max, achieved by a document, not already found on a top inverted term list, in response to it being found on all top inverted term lists, for terms to be found in documents, not yet consulted;
  
  (f) means for calculating a maximum score, S_Sub, to be subtracted from a document score, as a result of said document being found to contain terms to be absent from a document;
  
  (g) means for determining whether there are N or more documents already found, with scores such that if S_Subwere subtracted from their scores, the remainder would be greater than S_Max;
  
  (h) means for determining by use of the compressed document surrogate for each document a final score for the documents that have so far been found in any inverted term list of a desired term, and providing a list of the N documents with the highest scores, ranked in order of score, in response to there being N or more documents already found with scores such that if S_Subwere subtracted from their scores, the remainder would be greater than S_Max;
  
  (i) means for repeating (c) through (g) until either N or more such documents are found, or until no top inverted term list of a term desired to be found in the document has not been analyzed, in response to there not being N or more such documents;
  
  (j) means for repeating (c) through (i) utilizing remainder inverted term lists instead of top inverted term lists, until either N or more such documents are found, or until no remainder inverted term lists of terms desired to be found in the document has not been analyzed, in response to there not being N or more such documents, and the top inverted term lists of all terms desired to be found in the document having been analyzed; and
  
  ,(k) means for determining by use of the compressed document surrogate for each document the final score for the documents found on the inverted term lists of the desired terms, and providing a list of the documents ranked in order of score.
- View Dependent Claims (6, 7, 8)
- - 6. The device of claim 5, wherein the documents are Web pages.
  - 7. The device of claim 5, wherein the documents are Web sites.
  - 8. The device of claim 5, wherein only terms desired to be found are contained in a search query, so that S_Subis zero.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Knapp Investment Company Limited
Original Assignee
Verizon Laboratories Incorporated (Verizon Communications Inc.)
Inventors
Ponte, Jay Michael
Primary Examiner(s)
Ali; Mohammad
Assistant Examiner(s)
Ahn; Sangwoo

Application Number

US10/735,609
Publication Number

US 20060184521A1
Time in Patent Office

1,299 Days
Field of Search

707 3- 5, 707/7
US Class Current

707/693
CPC Class Codes

G06F 16/9574   of access to content, e.g. ...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99937   Sorting

Compressed document surrogates

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

8 Claims

Specification

Solutions

Use Cases

Quick Links

Compressed document surrogates

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

8 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links