Representative document selection for a set of duplicate documents

US 8,868,559 B2
Filed: 08/30/2012
Issued: 10/21/2014
Est. Priority Date: 07/03/2003
Status: Expired due to Fees

First Claim

Patent Images

1. A method, comprising:

at a computing device having one or more processors and memory;

obtaining a plurality of documents, wherein a respective document in the plurality of document is associated with a query independent score;

selecting a first document in the plurality of documents in accordance with a query independent score associated with the first document, whereinthe first document has a fingerprint that indicates that the first document has substantially identical content to every other document in the plurality of documents;

indexing, in accordance with the query independent score, the first document thereby producing an indexed first document; and

with respect to the plurality of documents, including only the indexed first document in a document index.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for indexing a representative document from a set of duplicate documents are disclosed. Disclosed systems and methods comprise selecting a first document in a plurality of documents on the basis that the first document is associated with a query independent score. Each respective document in the plurality of documents has a fingerprint that indicates that the respective document has substantially identical content to every other document in the plurality of documents. Disclosed systems and methods further comprise indexing, in accordance with the query independent score, the first document thereby producing an indexed first document. With respect to the plurality of documents, only the indexed first document is included in a document index.

Citations

37 Claims

1. A method, comprising:
- at a computing device having one or more processors and memory;
  
  obtaining a plurality of documents, wherein a respective document in the plurality of document is associated with a query independent score;
  
  selecting a first document in the plurality of documents in accordance with a query independent score associated with the first document, whereinthe first document has a fingerprint that indicates that the first document has substantially identical content to every other document in the plurality of documents;
  
  indexing, in accordance with the query independent score, the first document thereby producing an indexed first document; and
  
  with respect to the plurality of documents, including only the indexed first document in a document index.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein the query independent score includes a document ranking value indicative of document importance.
  - 3. The method of claim 1, wherein indexing the first document includes:
    - identifying a canonical document for the plurality of documents, from one of;
      
      (i) the first document, or (ii) a second document in the plurality of documents, by;
      
      comparing the query independent score associated with the first document to a query independent score associated with the second document; and
      
      selecting the first document as the canonical document when the query independent score associated with the first document is higher than the query independent score associated with the second document by more than a predefined threshold.
  - 4. The method of claim 3, wherein the query independent score associated with the second document is the highest among a subset of the plurality of documents, and wherein the subset of the plurality of documents excludes the first document.
  - 5. The method of claim 1, further comprising removing at least one document from the plurality of documents when a total number of documents in the plurality of documents exceeds a predefined value.
  - 6. The method of claim 5, whereineach document in the plurality of documents is independently assigned a query independent score, andthe query independent score associated with the at least one document is lower than the query independent scores associated with every other document in the plurality of documents.
  - 7. The method of claim 1, wherein the first document has substantially identical content to the second document in the plurality of documents, when:
    - (i) the first document and the second document share the same page content;
      
      (ii) the first document and the second document share the same target network identifier;
      
      (iii) a network identifier of the first document is the same as the target network identifier of the second document;
      
      or(iv) the target network identifier of the first document is the same as the network identifier of the second document.
  - 8. The method of claim 1, further comprising:
    - obtaining a second document not included in the plurality of documents; and
      
      in accordance with a determination that the second document has substantially identical content to each document in the plurality of documents, adding the second document to the plurality of documents.
  - 9. The method of claim 1, wherein the fingerprint of a document in the plurality of documents is a function of (i) the content of the document, or (ii) a network address of the document.

10. A computing system, comprising:
- one or more processors;
  
  memory; and
  
  one or more programs, wherein the one or more programs are stored in the memory and are configured to be executed by the one or more processors, the one or more programs comprising instructions for;
  
  obtaining, using the one or more processors, a plurality of documents, wherein a respective document in the plurality of document is associated with a query independent score;
  
  selecting a first document in the plurality of documents in accordance with a query independent score associated with the first document, whereinthe first document has a fingerprint that indicates that the first document has substantially identical content to every other document in the plurality of documents;
  
  indexing, in accordance with the query independent score, the first document thereby producing an indexed first document; and
  
  with respect to the plurality of documents, including only the indexed first document in a document index.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The computing system of claim 10, wherein the query independent score includes a document ranking value indicative of document importance.
  - 12. The computing system of claim 10, wherein indexing the first document includes:
    - identifying a canonical document for the plurality of documents, from one of;
      
      (i) the first document, or (ii) a second document in the plurality of documents, by;
      
      comparing the query independent score associated with the first document to a query independent score associated with the second document; and
      
      selecting the first document as the canonical document when the query independent score associated with the first document is higher than the query independent score associated with the second document by more than a predefined threshold.
  - 13. The computing system of claim 12, wherein the query independent score associated with the second document is the highest among a subset of the plurality of documents, wherein the subset of the plurality of documents excludes the first document.
  - 14. The computing system of claim 10, the one or more programs further comprising instructions for:
    - removing at least one document from the plurality of documents when a total number of documents in the plurality of documents exceeds a predefined value.
  - 15. The computing system of claim 14, whereineach document in the plurality of documents is independently assigned a query independent score, andthe query independent score associated with the at least one document is lower than the query independent scores associated with every other document in the plurality of documents.
  - 16. The computing system of claim 10, wherein the first document has substantially identical content to the second document in the plurality of documents, when:
    - (i) the first document and the second document share the same page content;
      
      (ii) the first document and the second document share the same target network identifier;
      
      (iii) a network identifier of the first document is the same as the target network identifier of the second document;
      
      or(iv) the target network identifier of the first document is the same as the network identifier of the second document.
  - 17. The computing system of claim 10, further comprising:
    - obtaining a second document not included in the plurality of documents; and
      
      in accordance with a determination that the second document has substantially identical content to each document in the plurality of documents, adding the second document to the plurality of documents.
  - 18. The computing system of claim 10, wherein the fingerprint of a document in the plurality of documents is a function of (i) the content of the document, or (ii) a network address of the document.

19. A non-transitory computer readable storage medium storing one or more programs to be executed by one or more processing units of a computing device, the one or more programs comprising instructions that, when executed by the one or more processing units, cause the computing device to:
- obtain a plurality of documents, wherein a respective document in the plurality of document is associated with a query independent score;
  
  select a first document in the plurality of documents in accordance with a query independent score associated with the first document, whereinthe first document has a fingerprint that indicates that the first document has substantially identical content to every other document in the plurality of documents;
  
  index, in accordance with the query independent score, the first document thereby producing an indexed first document; and
  
  with respect to the plurality of documents, include only the indexed first document in a document index.
- View Dependent Claims (20)
- - 20. The computer readable storage medium of claim 19, wherein the query independent score includes a document ranking value indicative of document importance.

21. A method, comprising:
- at a computing device having one or more processors and memory;
  
  identifying a plurality of newly crawled documents, wherein a respective document in the plurality of newly crawled document is associated with a query independent score;
  
  identifying (i) a newly crawled document in the plurality of newly crawled documents, and (ii) a first query independent score associated with the newly crawled document; and
  
  determining a first equivalence class, from among a plurality of equivalence classes, for the newly crawled document based upon a fingerprint for the newly crawled document, whereinthe first equivalence class comprises a plurality of documents,each document in the plurality of documents is uniquely associated with a respective query independent score, andthe fingerprint of the newly crawled document indicates that the newly crawled document is substantially identical to each document in the plurality of documents.
- View Dependent Claims (22, 23, 24)
- - 22. The method of claim 21, further comprising:
    - determining which document in the plurality documents is to be included in a document index, wherein the document index is constrained to have a single document from the plurality of documents, wherein,when the newly crawled document is not in the document index, the newly crawled document is included in the index provided that the first query independent score exceeds by a threshold a query independent score of a document in the plurality of documents that is presently in the document index andwhen the newly crawled document is in the document index, the newly created document remains in the index provided that the first query independent score does not exceed by a threshold the largest query independent score associated with a document in the plurality of documents.
  - 23. The method of claim 21, wherein the first query independent score includes a document ranking value indicative of document importance.
  - 24. The method of claim 21, wherein the fingerprint of the newly created document is a function of (i) the content of the document or (ii) a network address of the document.

25. A computing system, comprising:
- one or more processors;
  
  memory; and
  
  one or more programs, wherein the one or more programs are stored in the memory and are configured to be executed by the one or more processors, the one or more programs comprising instructions for;
  
  identifying a plurality of newly crawled documents, wherein a respective document in the plurality of newly crawled document is associated with a query independent score;
  
  identifying (i) a newly crawled document in the plurality of newly crawled documents, and (ii) a first query independent score associated with the newly crawled document; and
  
  determining a first equivalence class, from among a plurality of equivalence classes, for the newly crawled document based upon a fingerprint for the newly crawled document, whereinthe first equivalence class comprises a plurality of documents,each document in the plurality of documents is uniquely associated with a respective query independent score, andthe fingerprint of the newly crawled document indicates that the newly crawled document is substantially identical to each document in the plurality of documents.
- View Dependent Claims (26, 27, 28)
- - 26. The computing system of claim 25, wherein the one or more programs further comprising instructions for:
    - determining which document in the plurality documents is to be included in a document index, wherein the document index is constrained to have a single document from the plurality of documents, wherein,when the newly crawled document is not in the document index, the newly crawled document is included in the index provided that the first query independent score exceeds by a threshold a query independent score of a document in the plurality of documents that is presently in the document index andwhen the newly crawled document is in the document index, the newly created document remains in the index provided that the first query independent score does not exceed by a threshold the largest query independent score associated with a document in the plurality of documents.
  - 27. The computing system of claim 25, wherein the first query independent score includes a document ranking value indicative of document importance.
  - 28. The computing system of claim 25, wherein the fingerprint of the newly created document is a function of (i) the content of the document or (ii) a network address of the document.

29. A method of associating anchor text to a web page, the method comprising:
- at a computing device having one or more processors and memory;
  
  identifying a plurality of newly crawled documents, wherein a respective document in the plurality of newly crawled document is associated with a query independent score;
  
  identifying (i) a newly crawled document in the plurality of newly crawled documents, and (ii) a first query independent score associated with the newly crawled document;
  
  determining whether the newly crawled document is a canonical document for a first equivalence class, from among a plurality of equivalence classes, based upon a fingerprint for the newly crawled document, wherein the first equivalence class comprises a plurality of documents, wherein each document in the plurality of documents is uniquely associated with a respective query independent score and wherein the fingerprint of the newly crawled document indicates that the newly crawled document is substantially identical to each document in the plurality of documents;
  
  providing, when the newly crawled document is the canonical document, a list of documents in a first plurality of documents that are ranked by respective query independent scores; and
  
  indexing, when the newly crawled document is the canonical document, the newly crawled page, wherein the indexing comprises (i) retrieving anchor text of links to pages in the list of pages and (ii) associating this anchor text with the newly crawled page in a document index.
- View Dependent Claims (30, 31, 32, 33)
- - 30. The method of claim 29, whereinthe determining whether the newly crawled document is the canonical document comprises:
    - verifying that the newly crawled document is in the plurality of documents, wherein the verifying comprises;
      
      updating the query independent score associated with the newly crawled document to be the first query independent score when the newly crawled document is already in the plurality of documents, andadding the newly crawled document to the plurality of documents when the plurality of documents does not contain the newly crawled document; and
      
      determining which document in the plurality documents is to be included in the document index, wherein the document index is constrained to have a single document from the plurality of documents, wherein,when the newly crawled document is not in the document index, the newly crawled document is deemed to be the canonical document when (i) the first query independent score is greater, by an additive threshold, than the query independent score of a document in the plurality of documents that is presently in the document index and (ii) a ratio of the first query independent score and the query independent score of the document in the plurality of documents that is presently in the document index exceeds a multiplicative threshold, andwhen the newly crawled document is in the document index, the newly created document is deemed to be the canonical document when (i) the largest query independent score associated with a document in the plurality of documents is not greater, by the additive threshold, than the first query independent score and (ii) a ratio of the largest query independent score associated with a document in the plurality of documents and the first query independent does not exceed the multiplicative threshold.
  - 31. The method of claim 29, wherein the anchor text includes text in one or more different languages.
  - 32. The method of claim 29, wherein the query independent score includes a document ranking value indicative of document importance.
  - 33. The method of claim 29, wherein the fingerprint of the newly created document is a function of (i) the content of the document or (ii) a network address of the document.

34. A computing system, comprising:
- one or more processors;
  
  memory; and
  
  one or more programs, wherein the one or more programs are stored in the memory and are configured to be executed by the one or more processors, the one or more programs comprising instructions for;
  
  identifying a plurality of newly crawled documents, wherein a respective document in the plurality of newly crawled document is associated with a query independent score;
  
  identifying (i) a newly crawled document in the plurality of newly crawled documents, and (ii) a first query independent score associated with the newly crawled document;
  
  determining whether the newly crawled document is a canonical document for a first equivalence class, from among a plurality of equivalence classes, based upon a fingerprint for the newly crawled document, wherein the first equivalence class comprises a plurality of documents, wherein each document in the plurality of documents is uniquely associated with a respective query independent score and wherein the fingerprint of the newly crawled document indicates that the newly crawled document is substantially identical to each document in the plurality of documents;
  
  providing, when the newly crawled document is the canonical document, a list of documents in a first plurality of documents that are ranked by respective query independent scores; and
  
  indexing, when the newly crawled document is the canonical document, the newly crawled page, wherein the indexing comprises (i) retrieving anchor text of links to pages in the list of pages and (ii) associating this anchor text with the newly crawled page in a document index.
- View Dependent Claims (35, 36, 37)
- - 35. The computing system of claim 34, wherein the anchor text includes text in one or more different languages.
  - 36. The computing system of claim 34, wherein the query independent score includes a document ranking value indicative of document importance.
  - 37. The computing system of claim 34, wherein the fingerprint of the newly created document is a function of (i) the content of the document or (ii) a network address of the document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Dulitz, Daniel, Verstak, Alexandre A., Ghemawat, Sanjay, Dean, Jeffrey A.
Primary Examiner(s)
MORRISON, JAY A

Application Number

US13/599,707
Publication Number

US 20120323896A1
Time in Patent Office

782 Days
Field of Search

707/999.203, 707/736
US Class Current

707/736
CPC Class Codes

G06F 16/24578   using ranking

G06F 16/355   Class or cluster creation o...

G06F 16/951   Indexing; Web crawling tech...

G06F 16/9535   Search customisation based ...

G06F 16/9538   Presentation of query results

Y10S 707/99931   Database or file accessing

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99954   Version management

Representative document selection for a set of duplicate documents

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

37 Claims

Specification

Solutions

Use Cases

Quick Links

Representative document selection for a set of duplicate documents

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

37 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links