Generic architecture for indexing document groups in an inverted text index

US 8,131,726 B2
Filed: 01/12/2005
Issued: 03/06/2012
Est. Priority Date: 01/12/2005
Status: Active Grant

First Claim

Patent Images

1. A method for indexing a plurality of documents, the method comprising the steps of:

a) identifying a duplicate group of documents from among the plurality of documents, each of the documents in the duplicate group comprising respective content and metadata, wherein the respective content of each document in the duplicate group is substantially similar and corresponds to a content for the duplicate group;

b) creating one index of content for the duplicate group;

c) indexing the metadata for each of the documents in the duplicate group;

d) receiving a query and executing said query as if duplicated content was indexed for each document of the duplicate group, ande) outputting results of said query.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for indexing a plurality of documents, that includes a plurality of duplicate documents, first identifies one or more duplicate groups of documents from among the plurality of documents. Then, one index of content for the duplicate group is created instead of indexing the content from every document within the duplicate group. However, in contrast to the content index, an index of metadata for each of the documents in the duplicate group is created. Thus the content of each duplicate group is indexed only once, while a search engine using such indexing techniques retains the capability to answer queries as if the duplicated content was indexed for each document of the group.

16 Citations

View as Search Results

26 Claims

1. A method for indexing a plurality of documents, the method comprising the steps of:
- a) identifying a duplicate group of documents from among the plurality of documents, each of the documents in the duplicate group comprising respective content and metadata, wherein the respective content of each document in the duplicate group is substantially similar and corresponds to a content for the duplicate group;
  
  b) creating one index of content for the duplicate group;
  
  c) indexing the metadata for each of the documents in the duplicate group;
  
  d) receiving a query and executing said query as if duplicated content was indexed for each document of the duplicate group, ande) outputting results of said query.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method of claim 1, wherein the step of creating one index includes the steps of:
    - identifying a master document from the documents in the duplicate group; and
      
      indexing the content of the master document but not indexing the content of other documents in the duplicate group.
  - 3. The method of claim 1, further comprising the step of:
    - repeating steps a), b), and c) for multiple duplicate groups of documents.
  - 4. The method of claim 3, further comprising the steps of:
    - for each duplicate group of documents, identifying a respective master document; and
      
      associating with each of the plurality of documents, its respective master document.
  - 5. The method of claim 4, wherein the step of associating includes the steps of:
    - creating a master posting list comprised of a plurality of entries corresponding to each of the plurality of documents, wherein each entry comprises a first identifier for a document and a second identifier for its associated master document.
  - 6. The method of claim 5, wherein the respective first identifiers for documents of a duplicate group are consecutively ordered.
  - 7. The method of claim 5, wherein the first identifier and second identifier for a master document are equal.
  - 8. The method of claim 5, wherein the first identifier of a master document in a particular duplicate group is less than the respective first identifiers of other documents of that particular duplicate group.
  - 9. The method of claim 1, wherein the step of outputting results comprises the step of returning a result set of matching documents.
  - 10. The method of claim 9, wherein the result set includes not more than one document from the duplicate group.
  - 11. The method of claim 9, further comprising the steps of:
    - determining if a master document of the duplicate group includes matching metadata and content based on the query; and
      
      if so, returning the master document in the result set.
  - 12. The method of claim 11, further comprising the steps of:
    - if the master document of the duplicate group includes metadata that does not match the query, then determining if another document in the duplicate group includes matching metadata; and
      
      if so, returning the other document in the result set.
  - 13. The method of claim 9, wherein the result set comprises a list of data sources output from a search engine.
  - 14. The method of claim 13, wherein the data sources are web pages.
  - 15. The method of claim 1, wherein the metadata comprises one or more of a Uniform Resource Locator (URL), a document rank, security flags, an author, a creation time, a modification time, and a document type.

16. An apparatus for indexing a plurality of documents, the apparatus comprising:
- at least one processor;
  
  a memory coupled with the at least one processor;
  
  a plurality of documents stored within said memory, each document including respective content and metadata; and
  
  a program code residing in the memory and executed by the at least one processor, the program code configured to;
  
  a) identify a duplicate group of documents from among the plurality of documents, wherein the respective content of each document in the duplicate group is substantially similar and corresponds to a content for the duplicate group;
  
  b) create one index of content for the duplicate group;
  
  c) index the metadata for each of the documents in the duplicate group;
  
  d) store the created indices in the memory;
  
  e) receive and execute a query as if duplicated content was indexed for each document of the duplicate group; and
  
  f) output results of said query.
- View Dependent Claims (17, 18, 19, 20, 21, 22)
- - 17. The apparatus of claim 16, wherein the program code is further configured to:
    - identify a master document from the documents in the duplicate group; and
      
      index the content of the master document but not index the content of other documents in the duplicate group.
  - 18. The apparatus of claim 16, wherein the program code is further configured to:
    - repeat steps a), b), c) and d) for multiple duplicate groups of documents.
  - 19. The apparatus of claim 18, wherein the program code is further configured to:
    - for each duplicate group of documents, identify a respective master document;
      
      associate with each of the plurality of documents, its respective master document; and
      
      create, in the memory, a master posting list comprised of a plurality of entries corresponding to each of the plurality of documents, wherein each entry comprises a first identifier for a document and a second identifier for its associated master document.
  - 20. The apparatus of claim 19, wherein the respective first identifiers for documents of a duplicate group are consecutively ordered and wherein the first identifier of a master document in a particular duplicate group is less than the respective first identifiers of other documents of that particular duplicate group.
  - 21. The apparatus of claim 16, wherein the program code is further configured to return a result set of matching documents as said output.
  - 22. The apparatus of claim 21, wherein the result set includes not more than one document from the duplicate group.

23. A program product comprising a computer storage medium having computer readable program code embodied therein which implements indexing of a plurality of documents, each document including respective content and metadata, said medium comprising:
- a) computer readable program code indentifying a duplicate group of documents from among the plurality of documents, wherein the respective content of each of the documents in the duplicate group are substantially similar and corresponds to a content for the duplicate group;
  
  b) computer readable program code creating one index of content for the duplicate group;
  
  c) computer readable program code indexing the metadata for each of the documents in the duplicate group;
  
  d) computer readable program code aiding in receiving a query and executing said query as if duplicated content was indexed for each document of the duplicate group, ande) computer readable program code outputting results of said query.
- View Dependent Claims (24, 25, 26)
- - 24. The program product of claim 23, wherein the program code is further configured to:
    - identify a master document from the documents in the duplicate group; and
      
      index the content of the master document but not index the content of other documents in the duplicate group.
  - 25. The program product of claim 23, wherein the program code is further configured to return a result set of matching documents as said output.
  - 26. The program product of claim 23, wherein the result set includes not more than one document from the duplicate group.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Daedalus Blue LLC
Original Assignee
International Business Machines Corporation
Inventors
Broder, Andrei Z., Fontoura, Marcus Felipe, Herscovici, Michael, Lempel, Ronny, McPherson, John Ai Jr., Neumann, Andreas, Qi, Runping, Shekita, Eugene Jon
Primary Examiner(s)
Mofiz, Apu
Assistant Examiner(s)
Bibbee, Jared

Application Number

US10/905,604
Publication Number

US 20060155739A1
Time in Patent Office

2,610 Days
Field of Search

707/3, 707/101, 707/102, 707/673, 707/736, 707/741
US Class Current

707/741
CPC Class Codes

G06F 16/319 Inverted lists

Generic architecture for indexing document groups in an inverted text index

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

16 Citations

26 Claims

Specification

Use Cases

Quick Links

Others

Generic architecture for indexing document groups in an inverted text index

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

16 Citations

26 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others