Anchor tag indexing in a web crawler system
First Claim
Patent Images
1. A system for processing information about documents in a collection of linked documents, the system comprising:
- one or more processors;
memory storing one or more programs for execution by the one or more processors;
a link log, the link log comprising a plurality of link records, each link record identifying a source document and a list of one or more target documents pointed to by one or more outbound links in the source document;
the link record including a source document identifier for the identified source document and one or more target document identifiers for the identified list of target documents, wherein the link records are based, at least in part, on information extracted from crawled documents in the collection of linked documents; and
the one or more programs including a global state manager configured to access the link log and to output a sorted anchor map, the sorted anchor map comprising a plurality of anchor records, each anchor record comprising a respective target document identifier and a respective list of inbound links, the list of inbound links including source document identifiers;
wherein the plurality of anchor records are ordered in the sorted anchor map based, at least in part, on their respective target document identifiers; and
wherein, for at least one anchor record, a document located at a source document address corresponding to a source document identifier in the list of inbound links contains at least one outbound link, the at least one outbound link pointing to a corresponding target document address, the target document address corresponding to the respective target document identifier for the at least one anchor record.
2 Assignments
0 Petitions
Accused Products
Abstract
Provided is a method and system for indexing documents in a collection of linked documents. A link log, including one or more pairings of source documents and target documents is accessed. A sorted anchor map, containing one or more target document to source document pairings, is generated. The pairings in the sorted anchor map are ordered based on target document identifiers.
16 Citations
30 Claims
-
1. A system for processing information about documents in a collection of linked documents, the system comprising:
-
one or more processors; memory storing one or more programs for execution by the one or more processors; a link log, the link log comprising a plurality of link records, each link record identifying a source document and a list of one or more target documents pointed to by one or more outbound links in the source document;
the link record including a source document identifier for the identified source document and one or more target document identifiers for the identified list of target documents, wherein the link records are based, at least in part, on information extracted from crawled documents in the collection of linked documents; andthe one or more programs including a global state manager configured to access the link log and to output a sorted anchor map, the sorted anchor map comprising a plurality of anchor records, each anchor record comprising a respective target document identifier and a respective list of inbound links, the list of inbound links including source document identifiers; wherein the plurality of anchor records are ordered in the sorted anchor map based, at least in part, on their respective target document identifiers; and wherein, for at least one anchor record, a document located at a source document address corresponding to a source document identifier in the list of inbound links contains at least one outbound link, the at least one outbound link pointing to a corresponding target document address, the target document address corresponding to the respective target document identifier for the at least one anchor record. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A system for processing information about documents in a collection of linked documents, the system comprising:
-
one or more processors; memory storing one or more programs for execution by the one or more processors; a link log, the link log comprising a plurality of link records, each link record identifying a source document and a list of one or more target documents pointed to by one or more outbound links in the source document;
the link record including a source document identifier for the identified source document and one or more target document identifiers for the identified list of target documents, wherein the link records are based, at least in part, on information extracted from crawled documents in the collection of linked documents; andthe one or more programs including a global state manager configured to access the link log and to output a sorted anchor map that corresponds to the link log, the sorted anchor map comprising a plurality of anchor records, each anchor record identifying a respective target document and a list of inbound links, the list of inbound links identifying source documents that contain links to the respective target document;
each anchor record including a respective target document identifier;wherein the plurality of anchor records are ordered in the sorted anchor map based, at least in part, on their respective target document identifiers; and
wherein each respective target document identifier in the plurality of anchor records corresponds to one of the one or more target document identifiers in the link log.
-
-
15. A system for processing information about documents in a collection of linked documents, the system comprising:
-
one or more processors; memory storing one or more programs, the one or more programs including instructions for; crawling at least a subset of the documents in the collection of linked documents, and extracting from the crawled documents information concerning outbound links between documents in the collection of linked documents; generating, based on the extracted information, a link log that comprises a plurality of link records, each link record identifying a respective source document and a list of one or more target documents pointed to by outbound links in the respective source document; generating an anchor map that corresponds to the link log and that comprises a plurality of anchor records, each anchor record identifying a respective target document, a list of inbound links, the list of inbound links identifying source documents that contain links to the respective target document, and a list of annotations associated with the links in the source documents that point to the respective target document;
wherein each respective target document identified in the plurality of anchor records corresponds to a target document identified in the link log; andprocessing at least a plurality of the anchor records, including, for each anchor record in the plurality of anchor records, adding to a document index entries for terms in the list of annotations in the anchor record, wherein the entries are associated with the target document identified by the anchor record.
-
-
16. A non-transitory computer readable storage medium storing one or more programs, for processing information about documents in a collection of linked documents, for execution by a computer system, the one or more programs comprising:
-
a link log data structure, the link log data structure comprising a plurality of link records, each link record identifying a source document and a list of one or more target documents pointed to by one or more outbound links in the source document;
the link record including a source document identifier for the identified source document and one or more target document identifiers for the identified list of target documents, wherein the link records are based, at least in part, on information extracted from crawled documents in the collection of linked documents;a global state manager module including instructions for accessing the link log data structure; and a sorted anchor map data structure corresponding to the link log data structure, the sorted anchor map data structure comprising a plurality of anchor records, each anchor record comprising a respective target document identifier and a respective list of inbound links, the list of inbound links including source document identifiers; wherein the global state manager module contains instructions for writing to the sorted anchor map data structure, wherein the plurality of anchor records are ordered in the sorted anchor map data structure based, at least in part, on respective target document identifiers; and wherein, for at least one anchor record, a document located at a source document address corresponding to a source document identifier in the list of inbound links contains at least one outbound link, the at least one outbound link pointing to a corresponding target document address, the target document address corresponding to the respective target document identifier for the at least one anchor record. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
-
-
29. A non-transitory computer readable storage medium storing one or more programs, for processing information about documents in a collection of linked documents, for execution by a computer system, the one or more programs comprising:
-
a link log, the link log comprising a plurality of link records, each link record identifying a source document and a list of one or more target documents pointed to by one or more outbound links in the source document;
the link record including a source document identifier for the identified source document and one or more target document identifiers for the identified list of target documents;
wherein the link records are based, at least in part, on information extracted from crawled documents in the collection of linked documents; anda global state manager configured to access the link log and to output a sorted anchor map that corresponds to the link log, the sorted anchor map comprising a plurality of anchor records, each anchor record identifying a respective target document and a list of inbound links, the list of inbound links identifying source documents that contain links to the respective target document;
each anchor record including a respective target document identifier,wherein the plurality of anchor records are ordered in the sorted anchor map based, at least in part, on their respective target document identifiers, and wherein each respective target document identifier in the plurality of anchor records corresponds to one of the one or more target document identifiers in the link log.
-
-
30. A non-transitory computer readable storage medium storing one or more programs, for processing information about documents in a collection of linked documents, for execution by a computer system, the one or more programs comprising instructions for:
-
crawling at least a subset of the documents in the collection of linked documents, and extracting from the crawled documents information concerning outbound links between documents in the collection of linked documents; generating, based on the extracted information, a link log that comprises a plurality of link records, each link record identifying a respective source document and a list of one or more target documents pointed to by outbound links in the respective source document; generating an anchor map that corresponds to the link log and that comprises a plurality of anchor records, each anchor record identifying a respective target document, a list of inbound links, the list of inbound links identifying source documents that contain links to the respective target document, and a list of annotations associated with the links in the source documents that point to the respective target document;
wherein each respective target document identified in the plurality of anchor records corresponds to a target document identified in the link log; andprocessing at least a plurality of the anchor records, including, for each anchor record in the plurality of anchor records, adding to a document index entries for terms in the list of annotations in the anchor record, wherein the entries are associated with the target document identified by the anchor record.
-
Specification