×

Compressed document surrogates

  • US 7,240,056 B2
  • Filed: 12/12/2003
  • Issued: 07/03/2007
  • Est. Priority Date: 07/30/1999
  • Status: Expired due to Term
First Claim
Patent Images

1. A method for returning a list of a number of documents N in order of predicted utility, from among a collection of documents, as predicted by a search query containing terms to be present or absent, the method comprising:

  • (a) creating a compressed document surrogate for each document in the database, where the compressed document surrogate contains information about each term, from among the terms of interest of interest in the database, which occurs in the document, and which compressed document surrogate is created with top and remainder inverted term lists that contain information about the terms of interest in the database, and where the information about each term included in the compressed document surrogate for a document includes at least one of;

    the term identification number of the term, the location in a lookup table of an entry for the term, the number of times the term occurs in the document, the location in the document of each occurrence of the term, the address of the inverted term list of the term which contains the document, and the address of the location in the inverted term list of the document;

    (b) choosing, from among the terms in the search query which are to be found in documents, the term whose top inverted term list has not yet considered, which occurs in the fewest documents in the collection;

    (c) consulting the top inverted term list for said term, calculating the score for each document found in the top inverted term list;

    (i) if the document has not previously been found on an inverted term list, assigning the document the calculated score;

    (ii) if the document has previously been found on an inverted term list, increasing its previously-calculated score by the calculated score;

    (d) calculating a maximum score, SMax, achieved by a document, not already found on a top inverted term list, if it is found on all top inverted term lists, for terms to be found in documents, not yet consulted;

    (e) calculating a maximum score, SSub, to be subtracted from a document score, as a result of said document being found to contain terms to be absent from a document;

    (f) determining whether there are N or more documents already found, with scores such that if SSub were subtracted from their scores, the remainder would be greater than SMax;

    (g) if there are N or more such documents, determining by use of the compressed document surrogate for each document a final score for the documents that have so far been found in any inverted term list of a desired term, and providing a list of the N documents with the highest scores, ranked in order of score;

    (h) if there are not N or more such documents, repeating (b) through (f) until either N or more such documents are found, or until no top inverted term list of a term to be found in the document has not been analyzed;

    (i) if there are not N or more such documents, and the top inverted term lists of all terms desired to be found in the document have been analyzed, repeating (b) through (h) utilizing remainder inverted term lists instead of top inverted term lists, until either N or more such documents are found, or until no remainder inverted term lists of terms desired to be found in the document has not been analyzed; and

    ,(j) determining by use of the compressed document surrogate for each document the final score for the documents found on the inverted term lists of the desired terms, and providing a list of the documents ranked in order of score.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×