Collaborative team crawling:Large scale information gathering over the internet
First Claim
1. A method to gather data objects from an information system, comprising:
- constructing logical web-graphs defining logical relationships between specified objects in the information system;
partitioning the web-graph into sub-graphs; and
processing the specified objects to locate desired objects by;
balancing the processing of the specified objects by mapping the sub-graphs to one or more preferred processors, wherein a processor is preferred if it reduces processing time or overhead; and
re-balancing the processing if the processing load becomes unbalanced.
1 Assignment
0 Petitions
Accused Products
Abstract
A distributed collection of web-crawlers to gather information over a large portion of the cyberspace. These crawlers share the overall crawling through a cyberspace partition scheme. They also collaborate with each other through load balancing to maximally utilize the computing resources of each of the crawlers. The invention takes advantage of the hierarchical nature of the cyberspace namespace and uses the syntactic components of the URL structure as the main vehicle for dividing and assigning crawling workload to individual crawler. The partition scheme is completely distributed in which each crawler makes the partitioning decision based on its own crawling status and a globally replicated partition tree data structure.
171 Citations
43 Claims
-
1. A method to gather data objects from an information system, comprising:
-
constructing logical web-graphs defining logical relationships between specified objects in the information system;
partitioning the web-graph into sub-graphs; and
processing the specified objects to locate desired objects by;
balancing the processing of the specified objects by mapping the sub-graphs to one or more preferred processors, wherein a processor is preferred if it reduces processing time or overhead; and
re-balancing the processing if the processing load becomes unbalanced. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
coarsening the web-graph to create sub-graphs, the sub-graphs comprising identifiable groupings of objects, where coarsening comprises contracting the vertices in the web-graph; and
dynamically recognizing each identifiable sub-graph.
-
-
5. The method recited in claim 4, further comprising creating sub-graphs comprising clusters of resources having physical closeness to one another.
-
6. The method recited in claim 5, further comprising creating sub-graphs comprising clusters of resources having hyperlink closeness to one another.
-
7. The method recited in claim 6, further comprising estimating a weight for each node of a graph, the weight reflecting a number of visits to the node, a cumulative total of the weights for nodes contained in a sub-graph being used for processing (load) balancing.
-
8. The method recited in claim 7, load balancing including:
-
maintaining information regarding average access time and processing time of an object or a sub-graph, these access and processing times forming a processing rate;
mapping sub-graphs to desired processors based upon processing rate differences.
-
-
9. The method recited in claim 8, further comprising:
-
identifying communities of sub-graphs that have numerous connections; and
mapping sub-graphs having numerous connections to a same processor.
-
-
10. The method recited in claim 9, further comprising using a hierarchically structured namespace having syntactic components to identify objects.
-
11. The method recited in claim 9, further comprising:
-
maintaining a queue and a hit-hash for each processor of sub-graphs, the hit-hash maintained for objects having a potentially large in-degree;
checking the processors hit-hash for the object if a new object is encountered by a processor, and processing the object if is the hit-hash, otherwise;
moving the object to a temporary location;
having each another processor check its hit-hash and queue for the object; and
further processing the object.
-
-
12. The method recited in claim 9, further comprising using non-preferred (lightweight) processors and preferred processors for processing, wherein a lightweight processor can be removed from processing at any time.
-
13. The method recited in claim 12, further comprising not re-balancing the processing load between processors when a lightweight processor is added or removed from processing.
-
14. A signal-bearing medium tangibly embodying a program of machine-readable instructions executable by a digital processing apparatus to perform a method for gathering objects from an information system, said method comprising:
-
constructing logical web-graphs defining logical relationships between specified objects in the information system;
partitioning the web-graph into sub-graphs; and
processing the specified objects to locate desired objects by;
balancing the processing of the specified objects by mapping the sub-graphs to one or more preferred processors, wherein a processor is preferred if it reduces processing time or overhead; and
re-balancing the processing if the processing load becomes unbalanced. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
coarsening the web-graph to create sub-graphs, the sub-graphs comprising identifiable groupings of objects, where coarsening comprises contracting the vertices in the web-graph; and
dynamically recognizing each identifiable sub-graph.
-
-
18. The medium recited in claim 17, further comprising creating sub-graphs comprising clusters of resources having physical closeness to one another.
-
19. The medium recited in claim 18, further comprising creating sub-graphs comprising clusters of resources having hyperlink closeness to one another.
-
20. The medium recited in claim 18, load balancing including:
-
maintaining information regarding average access time and processing time of an object or a sub-graph, these access and processing times forming a processing rate;
mapping sub-graphs to desired processors based upon processing rate differences.
-
-
21. The medium recited in claim 20, further comprising:
-
identifying communities of sub-graphs that have numerous connections; and
mapping sub-graphs having numerous connections to a same processor.
-
-
22. The medium recited in claim 21, further comprising using a hierarchically structured namespace having syntactic components to identify objects.
-
23. The medium recited in claim 21, further comprising:
-
maintaining a queue and a hit-hash for each processor of sub-graphs, the hit-hash maintained for objects having a potentially large in-degree;
checking the processors hit-hash for the object if a new object is encountered by a processor, and processing the object if is the hit-hash, otherwise;
moving the object to a temporary location;
having each another processor check its hit-hash and queue for the object; and
further processing the object.
-
-
24. The medium recited in claim 21, further comprising using non-preferred (lightweight) processors and preferred processors for processing, wherein a lightweight processor can be removed from processing at any time.
-
25. The medium recited in claim 24, further comprising not re-balancing the processing load between processors when a lightweight processor is added or removed from processing.
-
26. The medium recited in claim 17, further comprising estimating a weight for each node of a graph, the weight reflecting a number of visits to the node, a cumulative total of the weights for nodes contained in a sub-graph being used for processing (load) balancing.
-
27. An apparatus for traversing an information system to retrieve designated objects, the apparatus comprising:
-
a recognizer to determine a format for each object retrieved;
a summarizer to read varying data formats communicatively coupled to the recognizer;
an expander for expanding objects having a compressed format communicatively coupled to the recognizer and the summarizer;
storage and processors included in one or more of the above, the processors capable of reading digital signals to traverse the information system to retrieve designated objects by;
constructing logical web-graphs defining logical relationships between specified objects in the information system;
partitioning the web-graph into sub-graphs; and
processing the specified objects to locate desired objects by;
balancing the processing of the specified objects by mapping the sub-graphs to one or more preferred processors, wherein a processor is preferred if it reduces processing time or overhead; and
re-balancing the processing if the processing load becomes unbalanced. - View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39)
coarsening the web-graph to create sub-graphs, the sub-graphs comprising identifiable groupings of objects, where coarsening comprises contracting the vertices in the web-graph; and
dynamically recognizing each identifiable sub-graph.
-
-
31. The apparatus recited in claim 30, further comprising creating sub-graphs comprising clusters of resources having physical closeness to one another.
-
32. The apparatus recited in claim 31, further comprising creating sub-graphs comprising clusters of resources having hyperlink closeness to one another.
-
33. The apparatus recited in claim 32, further comprising estimating a weight for each node of a graph, the weight reflecting a number of visits to the node, a cumulative total of the weights for nodes contained in a sub-graph being used for processing (load) balancing.
-
34. The apparatus recited in claim 33, load balancing including:
-
maintaining information regarding average access time and processing time of an object or a sub-graph, these access and processing times forming a processing rate;
mapping sub-graphs to desired processors based upon processing rate differences.
-
-
35. The apparatus recited in claim 34, further comprising:
-
identifying communities of sub-graphs that have numerous connections; and
mapping sub-graphs having numerous connections to a same processor.
-
-
36. The apparatus recited in claim 35, further comprising using a hierarchically structured namespace having syntactic components to identify objects.
-
37. The apparatus recited in claim 35, further comprising:
-
maintaining a queue and a hit-hash for each processor of sub-graphs, the hit-hash maintained for objects having a potentially large in-degree;
checking the processors hit-hash for the object if a new object is encountered by a processor, and processing the object if is the hit-hash, otherwise;
moving the object to a temporary location;
having each another processor check its hit-hash and queue for the object; and
further processing the object.
-
-
38. The apparatus recited in claim 35, further comprising using non-preferred (lightweight) processors and preferred processors for processing, wherein a lightweight processor can be removed from processing at any time.
-
39. The apparatus recited in claim 38, further comprising not re-balancing the processing load between processors when a lightweight processor is added or removed from processing.
-
40. An apparatus for traversing an information system to retrieve designated objects, the apparatus comprising:
-
a means for recognizing a format for an object retrieved;
a means for summarizing and reading varying data formats communicatively coupled to the recognizing means;
a means for expanding objects having a compressed format and communicatively coupled to the recognizing means and the summarizing means;
storage means and processor means included in one or more of the above, the storage means for storing digital signals, the processor means for reading digital signals and traversing the information system to retrieve designated objects by;
constructing logical web-graphs defining logical relationships between specified objects in the information system;
partitioning the web-graph into sub-graphs; and
processing the specified objects to locate desired objects by;
balancing the processing of the specified objects by mapping the sub-graphs to one or more preferred processors, wherein a processor is preferred if it reduces processing time or overhead; and
re-balancing the processing if the processing load becomes unbalanced. - View Dependent Claims (41, 42, 43)
a means for coarsening the web-graph to create sub-graphs, the sub-graphs comprising identifiable groupings of objects, where coarsening comprises contracting the vertices in the web-graph; and
a means for dynamically recognizing each identifiable sub-graph.
-
Specification