Identifying duplicate documents from search results without comparing document content
First Claim
1. A method of automatically determining duplicate documents on a hit-list containing one or more duplicate documents and document instances, the hit-list having a hit-list record for each instance of the documents, each hit-list record having one or more attribute fields, each attribute field containing one or more attributes of the documents, the method comprising the steps of:
- selecting one or more of the attributes that are intrinsic attributes, the intrinsic attributes being established at a time of document creation and that are invariant with a location and replication of the document;
generating a pair of the hit-list records associated with the documents and intrinsic attributes;
comparing one or more of the intrinsic attributes of the pair of hit-list records;
using the comparison of the intrinsic attributes of the pair of hit-list records to determine if the documents are instances of the same document.
3 Assignments
0 Petitions
Accused Products
Abstract
A computer system has a document collection of one or more documents and one or more indexes that each include an inverted file with one or more terms. Each of the terms is associated with one or more document identifiers. The index further includes a document catalog that associates each of the document identifiers with one or more attributes, either intrinsic or non intrinsic. A search engine process produces a hit list having one or more hit list entries. Each hit list entry, with one or more hit list attributes, is associated with one of the documents that is determined by the search engine to be relevant to the query. A formatter processor selects one or more of the hit list attributes, identified by a hit list attribute selector and then compares the selected attributes of two or more entries on the hit list to determine whether or not documents associated with these entries are duplicate instances of one another. The determination can be made without examining the content of the document associated with the entries.
155 Citations
35 Claims
-
1. A method of automatically determining duplicate documents on a hit-list containing one or more duplicate documents and document instances, the hit-list having a hit-list record for each instance of the documents, each hit-list record having one or more attribute fields, each attribute field containing one or more attributes of the documents, the method comprising the steps of:
-
selecting one or more of the attributes that are intrinsic attributes, the intrinsic attributes being established at a time of document creation and that are invariant with a location and replication of the document; generating a pair of the hit-list records associated with the documents and intrinsic attributes; comparing one or more of the intrinsic attributes of the pair of hit-list records; using the comparison of the intrinsic attributes of the pair of hit-list records to determine if the documents are instances of the same document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method of automatically determining duplicate documents on a hit-list containing one or more documents and document instances, the hit-list having a hit-list record for each instance of the documents, each hit-list record having one or more attribute fields, each attribute field containing one or more attribute of the documents, the method comprising the steps of:
-
selecting one or more of the attributes that are intrinsic attributes, the intrinsic attributes being established at a time of document creation and that are invariant with a location and replication of the document; generating a pair of the hit-list records associated with the documents and the intrinsic attributes; comparing one or more of the intrinsic attributes of the pair of hit-list records selecting one or more of the attributes that are non intrinsic attributes, the non intrinsic attributes being variable with one or more document instance; comparing one or more of the non intrinsic attributes of the pair of hit-list records previously compared; noting that the pair of hit-list records failing a comparison test of the comparing of the intrinsic and non intrinsic attributes results in instances of not the same document. - View Dependent Claims (11, 12, 13, 14, 15)
-
-
16. A method of automatically determining duplicate documents on a hit-list containing one or more documents and document instances, the hit-list having a hit-list record for each instance of the documents, each hit-list record having one or more attribute fields, each attribute field containing an attribute of the documents, the method comprising the steps of:
-
a. selecting one or more of the attributes that are intrinsic attributes, intrinsic attributes being attributes that are established at a time of document creation and that are invariant with a location and replication of the document; b. sorting the hit-list using all of the intrinsic attributes as sort keys; c. comparing one or more intrinsic attributes of one or more of the adjacent documents on the sorted hit-list; and d. noting that documents with attributes that are not equal are not instances of the same document. - View Dependent Claims (17)
-
-
18. A computer system of one or more computers comprising:
-
one or more memory storage devices containing a document collection of one or more documents; an index including an inverted file with one or more terms, each term associated with one or more document identifiers, the index further including a document catalog that associates each of the document identifiers with one or more attributes; a search engine process that processes a query with one or more query elements to produce a hit list having one or more hit list entry, each hit list entry associated with one of the documents that is determined by the search engine to be relevant to the query; one or more hit list attributes associated with each of the hit list entries, each of the hit list attributes being one of the attributes; and
a formatter processor that identifies duplicate hit list entries by selecting one or more of the hit list attributes as selected attributes, identified by a hit list attribute selector, the formatter processor further selecting two or more hit list entries, called compared entries, that are each associated with one of the documents, each called a compared document, and the formatter process comparing the selected attributes to determine if the compared documents are duplicate instances of one another. - View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34)
-
-
35. A computer system of one or more computers comprising:
- one or more memory storage devices containing a document collection of one or more documents;
an index including an inverted file with one or more terms, each term associated with one or more document identifiers, the index further including a document catalog that associates each of the document identifiers with one or more attributes; search engine process means for processing a query with one or more query elements to produce hit list means having one or more hit list entries, each hit list entry associated with one of the documents that is determined by the search engine to be relevant to the query; one or more hit list attributes means for identifying the documents and associated with each of the hit list entries, each of the hit list attributes being one of the attributes; and formatter processor means for identifying duplicate hit entries by selecting one or more of the hit list attributes as selected attributes, identified by a hit list attribute selector, the formatter processor means further selecting two or more hit list entries, called compared entries, that are each associated with one of the documents, each called a compared document, and the formatter process comparing the selected attributes to determine if the compared documents are duplicate instances of one another.
- one or more memory storage devices containing a document collection of one or more documents;
Specification