System and method for storing connectivity information in a web database
First Claim
1. A method of storing page link information comprising:
- obtaining page link information for a set of pages, the page link information including for each page in the set a row of page identifiers of other pages;
arranging the rows of page identifiers in a particular order;
for each respective row;
identifying a reference row, if any, that best matches the respective row in accordance with predefined row match criteria; and
encoding the respective row as an identifier for the identified reference row, if any, a set of deletes representing page identifiers in the identified reference row not in the respective row, and a set of adds representing page identifiers in the respective row not in the identified reference row.
4 Assignments
0 Petitions
Accused Products
Abstract
A web crawler system includes a central processing unit for performing computations in accordance with stored procedures and a network interface for accessing remotely located computers via a network. A web crawler module downloads pages from remotely located servers via the network interface. A first link processing module obtains page link information from the downloaded page; the page link information includes for each downloaded page a row of page identifiers of other pages. A second link processing module encodes the rows of page identifies in a space efficient manner. It arranges the rows of page identifiers in a particular order. For each respective row it identifies a prior row, if any, that best matches the respective row in accordance with predefined row match criteria, determines a set of deletes representing page identifiers in the identified prior row not in the respective row, and determines a set of adds representing page identifiers in the respective row not in the identifier prior row. The second link processing module delta encodes the set of deletes and delta encodes the set of adds for each respective row, and then Huffman codes the delta encoded set of deletes and delta encoded set of adds for each respective row.
35 Citations
24 Claims
-
1. A method of storing page link information comprising:
-
obtaining page link information for a set of pages, the page link information including for each page in the set a row of page identifiers of other pages;
arranging the rows of page identifiers in a particular order;
for each respective row;
identifying a reference row, if any, that best matches the respective row in accordance with predefined row match criteria; and
encoding the respective row as an identifier for the identified reference row, if any, a set of deletes representing page identifiers in the identified reference row not in the respective row, and a set of adds representing page identifiers in the respective row not in the identified reference row. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer program product for use in conjunction with a computer system, the computer program product comprising a computer readable storage medium and a computer program mechanism embedded therein, the computer program mechanism comprising:
-
a first module for obtaining page link information for a set of pages, the page link information including for each page in the set a row of page identifiers of other pages; and
a second module for storing the page link information, including instructions for;
arranging the rows of page identifiers in a particular order;
for each respective row;
identifying a reference row, if any, that best matches the respective row in accordance with predefined row match criteria; and
encoding the respective row as an identifier for the identified reference row, if any, a set of deletes representing page identifiers in the identified reference row not in the respective row, and a set of adds representing page identifiers in the respective row not in the identifier reference row. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, 23, 24)
-
-
17. A web crawler system, comprising:
-
a central processing unit for performing computations in accordance with stored procedures;
a network interface for accessing remotely located computers via a network;
memory, coupled to the central processing unit, for storing procedures and data, including;
a web crawler module, executable by the central processing unit, for downloading a set of pages from remotely located servers via the network interface;
a first module for obtaining page link information from the set of pages, the page link information including for each page in the set a row of page identifiers of other pages; and
a second module for storing the page link information, including instructions for;
arranging the rows of page identifiers in a particular order;
for each respective row;
identifying a reference row, if any, that best matches the respective row in accordance with predefined row match criteria; and
encoding the respective row as an identifier for the identified reference row, if any, a set of deletes representing page identifiers in the identified reference row not in the respective row, and a set of adds representing page identifiers in the respective row not in the identified reference row;
-
Specification