Generic architecture for indexing document groups in an inverted text index
First Claim
1. A method for indexing a plurality of documents, the method comprising the steps of:
- a) identifying a duplicate group of documents from among the plurality of documents, each of the documents in the duplicate group comprising respective content and metadata, wherein the respective content of each document in the duplicate group is substantially similar and corresponds to a content for the duplicate group;
b) creating one index of content for the duplicate group;
c) indexing the metadata for each of the documents in the duplicate group;
d) receiving a query and executing said query as if duplicated content was indexed for each document of the duplicate group, ande) outputting results of said query.
4 Assignments
0 Petitions
Accused Products
Abstract
A method for indexing a plurality of documents, that includes a plurality of duplicate documents, first identifies one or more duplicate groups of documents from among the plurality of documents. Then, one index of content for the duplicate group is created instead of indexing the content from every document within the duplicate group. However, in contrast to the content index, an index of metadata for each of the documents in the duplicate group is created. Thus the content of each duplicate group is indexed only once, while a search engine using such indexing techniques retains the capability to answer queries as if the duplicated content was indexed for each document of the group.
16 Citations
26 Claims
-
1. A method for indexing a plurality of documents, the method comprising the steps of:
-
a) identifying a duplicate group of documents from among the plurality of documents, each of the documents in the duplicate group comprising respective content and metadata, wherein the respective content of each document in the duplicate group is substantially similar and corresponds to a content for the duplicate group; b) creating one index of content for the duplicate group; c) indexing the metadata for each of the documents in the duplicate group; d) receiving a query and executing said query as if duplicated content was indexed for each document of the duplicate group, and e) outputting results of said query. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. An apparatus for indexing a plurality of documents, the apparatus comprising:
-
at least one processor; a memory coupled with the at least one processor; a plurality of documents stored within said memory, each document including respective content and metadata; and a program code residing in the memory and executed by the at least one processor, the program code configured to; a) identify a duplicate group of documents from among the plurality of documents, wherein the respective content of each document in the duplicate group is substantially similar and corresponds to a content for the duplicate group; b) create one index of content for the duplicate group; c) index the metadata for each of the documents in the duplicate group; d) store the created indices in the memory; e) receive and execute a query as if duplicated content was indexed for each document of the duplicate group; and f) output results of said query. - View Dependent Claims (17, 18, 19, 20, 21, 22)
-
-
23. A program product comprising a computer storage medium having computer readable program code embodied therein which implements indexing of a plurality of documents, each document including respective content and metadata, said medium comprising:
-
a) computer readable program code indentifying a duplicate group of documents from among the plurality of documents, wherein the respective content of each of the documents in the duplicate group are substantially similar and corresponds to a content for the duplicate group; b) computer readable program code creating one index of content for the duplicate group; c) computer readable program code indexing the metadata for each of the documents in the duplicate group; d) computer readable program code aiding in receiving a query and executing said query as if duplicated content was indexed for each document of the duplicate group, and e) computer readable program code outputting results of said query. - View Dependent Claims (24, 25, 26)
-
Specification