Anchor tag indexing in a web crawler system

US 10,210,256 B2
Filed: 04/01/2016
Issued: 02/19/2019
Est. Priority Date: 07/03/2003
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

at least one processor;

an index for searching documents, the index including terms associated with documents; and

memory storing instructions that, when executed by the at least one processor, perform operations including;

obtaining, via a web crawler, a source document,identifying, in the source document, annotation text, the annotation text being text within a predetermined distance of an outbound link to a target document and the annotation text including at least one term,storing in the index an association between the term and the source document,storing in the index, responsive to identifying the annotation text, an association between the term and the target document,identifying, responsive to receiving a query that includes the term, the source document and the target document as associated with the term in the index,responsive to identifying the associations, including the source document and the target document in a list of documents responsive to the query, andreturning the list of documents responsive to the query as a search result for the query.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Provided is a method and system for indexing documents in a collection of linked documents. A link log, including one or more pairings of source documents and target documents is accessed. A sorted anchor map, containing one or more target document to source document pairings, is generated. The pairings in the sorted anchor map are ordered based on target document identifiers.

31 Citations

View as Search Results

20 Claims

1. A system comprising:
- at least one processor;
  
  an index for searching documents, the index including terms associated with documents; and
  
  memory storing instructions that, when executed by the at least one processor, perform operations including;
  
  obtaining, via a web crawler, a source document,identifying, in the source document, annotation text, the annotation text being text within a predetermined distance of an outbound link to a target document and the annotation text including at least one term,storing in the index an association between the term and the source document,storing in the index, responsive to identifying the annotation text, an association between the term and the target document,identifying, responsive to receiving a query that includes the term, the source document and the target document as associated with the term in the index,responsive to identifying the associations, including the source document and the target document in a list of documents responsive to the query, andreturning the list of documents responsive to the query as a search result for the query.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The system of claim 1, wherein the target document has not yet been crawled.
  - 3. The system of claim 1, wherein the outbound link is an anchor tag in the source document and the annotation is anchor text associated with the anchor tag.
  - 4. The system of claim 1, further including an anchor map accessed by an indexer, the anchor map including at least one entry that identifies:
    - a respective target document;
      
      a plurality of source document identifiers, wherein source document includes an outbound link to the respective target document; and
      
      at least one annotation for each source document identifier, the annotation includes a text passage extracted from a respective source document, wherein the text passage is within a predetermined distance of a respective outbound link.
  - 5. The system of claim 4, the anchor map further identifying an attribute of at least one annotation.
  - 6. The system of claim 1, wherein the annotation is a continuous block of text from the source document.
  - 7. The system of claim 1, wherein the annotation includes text outside of an anchor tag in the source document.
  - 8. The system of claim 1, the memory further storing instructions that, when executed by the at least one processor, perform operations including:
    - computing a query-independent relevance metric for the target document, wherein the a query-independent relevance metric includes a sum of partial query-independent relevance metric contributions from each source document that includes an outbound link to the target document.
  - 9. The system of claim 8, wherein the query-independent relevance metric is a page rank.
  - 10. The system of claim 1, wherein the contents of the target document lacks textual information.

11. A method comprising:
- obtaining, via a web crawler, a source document;
  
  identifying, in the source document, annotation text, the annotation text being text within a predetermined distance of an outbound link to a target document and the annotation text including at least one term;
  
  storing in an index an association between the term and the source document;
  
  storing in the index, responsive to identifying the annotation text, an association between the term and the target document;
  
  identifying, responsive to receiving a query that includes the term, the source document and the target document as associated with the term in the index;
  
  responsive to identifying the associations, including the source document and the target document in a list of documents responsive to the query; and
  
  returning the list of documents responsive to the query as a search result for the query.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The method of claim 11, wherein the target document has not been crawled prior to receiving the query.
  - 13. The method of claim 11, wherein the outbound link is an anchor tag in the source document and the annotation is anchor text associated with the anchor tag.
  - 14. The method of claim 11, wherein the target document is an image file.
  - 15. The method of claim 11, wherein the target document is an audio file.
  - 16. The method of claim 11, wherein the annotation is a continuous block of text from the source document.
  - 17. The method of claim 11, wherein the annotation includes text outside of an anchor tag in the source document.
  - 18. The method of claim 11, further comprising:
    - computing a query-independent relevance metric for the target document, wherein the a query-independent relevance metric includes a sum of partial query-independent relevance metric contributions from each source document that includes an outbound link to the target document.
  - 19. The method of claim 18, wherein the query-independent relevance metric is a page rank.
  - 20. The method of claim 11, wherein the target document lacks textual information.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google LLC (Alphabet Inc.)
Inventors
Zhu, Huican, Dean, Jeffrey, Ghemawat, Sanjay, Yang, Bwolen Po-Jen, Acharya, Anurag
Primary Examiner(s)
Bashore, William L
Assistant Examiner(s)
Faber, David

Application Number

US15/088,670
Publication Number

US 20160321252A1
Time in Patent Office

1,054 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/2228   Indexing structures

G06F 16/94   Hypermedia Hyperlinking G06...

G06F 16/951   Indexing; Web crawling tech...

G06F 40/134   Hyperlinking

G06F 40/205   Parsing

Anchor tag indexing in a web crawler system

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

31 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Anchor tag indexing in a web crawler system

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

31 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links