Web page connectivity server construction
First Claim
1. A process for constructing a Host Database comprising:
- creating a preliminary Host Table in a connectivity server that collects, arranges and stores data to define the connectivity of pages on the Web, sorting URLs in a URL database concurrently with the creating of the preliminary Host Table occurring; and
creating a Host Table and a Host Index, the Host Table containing information about the URLs from the preliminary Host Table.
11 Assignments
0 Petitions
Accused Products
Abstract
A process for constructing a server for collecting, arranging and storing data that defines the connectivity of pages on the World Wide Web (Web). The process input is a set of compressed ASCII links files, wherein each links file is a series of source URLs and corresponding destination URLs. A temporary URLs_info Table is created and initialized. The links files and URLs metadata are read. Buffers of unique URLs are sorted and written from the links files into URL runs. An ID Index is created from the URL_info table. CS_ids are assigned to URLs and written to the ID Index. Both a compressed URL data structure and a URL Index are created. A Host Table is created. URL fingerprints are converted to CS_ids, and preliminary outstarts to CS_ids and preliminary outstarts and outlinks tables are created. Compressed outstarts and outlinks tables are created from the preliminary tables. Subsequently, compressed instarts and inlinks tables are created based on the outstarts and outlinks tables.
-
Citations
33 Claims
-
1. A process for constructing a Host Database comprising:
-
creating a preliminary Host Table in a connectivity server that collects, arranges and stores data to define the connectivity of pages on the Web, sorting URLs in a URL database concurrently with the creating of the preliminary Host Table occurring; and
creating a Host Table and a Host Index, the Host Table containing information about the URLs from the preliminary Host Table. - View Dependent Claims (2, 3, 4)
(i) sorting the preliminary Host Table by CS_id;
(ii) copying entries from the preliminary Host Table to the Host Table;
(iii) creating an index on the preliminary Host Table;
(iv) sorting the index; and
(v) for each entry, storing the related HostFP from the preliminary Host Table, the HostFP for identifying the entry.
-
-
5. A process for constructing a URL Database for a connectivity server that collects, arranges and stores data to define the connectivity of pages on the Web, the process comprising:
-
(a) reading a set of links files, each of which contains a series of source URLs;
(b) calculating a fingerprint for each URL;
(c) creating a temporary URL_info Table in the form of a hash table having as keys the most significant N bits of a URL fingerprint;
(d) creating an ID Index from the URLs_info Table;
(e) assigning CS_ids to URLs;
(f) routing the CS_ids to URLs;
(g) creating a URL Index; and
(h) converting URL fingerprints to CS_ids. - View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
(i) discarding the URL scheme;
(ii) performing a prefix compression; and
(iii) performing a second compression.
-
-
14. A process as defined in claim 13, wherein performing the prefix compression comprises the steps:
-
(ii.a) writing a number followed by a first URL, and (ii.b) for each URLi subsequent to the first URL, writing a one-byte integer followed by a remainder, where the one-byte integer represents the length of a common prefix shared by a URLi and a URL(i−
1) and where the remainder is the portion of URLi following the common prefix.
-
-
15. A process as defined in claim 5, wherein URLs are compressed by, for each URL:
-
(i) discarding the URL scheme;
(ii) performing a prefix compression; and
(iii) performing a second compression.
-
-
16. A process as defined in claim 15, wherein performing the prefix compression comprises the steps:
-
(ii.a) writing a number followed by a first URL; and
(ii.b) for each URLi subsequent to the first URL, writing a one-byte integer followed by a remainder, where the one-byte integer represents the length of a common prefix shared by a URLi and a URL(i−
1) and where the remainder is the portion of URLi following the common prefix.
-
-
17. A process as defined in claim 5, wherein a fingerprint is calculated for each URL in accordance with a deterministic mathematical function.
-
18. A process as defined in claim 17, wherein a fingerprint is calculated for each URL according to a hash function that returns a unique integer value corresponding to each unique URL string.
-
19. A process as defined in claim 17, wherein each record in the URLs_info Table contains remaining M least significant bits of the URL fingerprint.
-
20. A process as defined in claim 19, wherein each record additionally contains metadata comprising:
-
(1) the indegree of each URL;
(2) the outdegree of each URL; and
(3) a set of Boolean values.
-
-
21. A process as defined in claim 20, wherein the set of Boolean values contains a Boolean value that indicates whether a respective URL has been a source URL.
-
22. A process as defined in claim 20, wherein the set of Boolean values contains a Boolean value that indicates whether the respective URL has appeared in an input file.
-
23. A process as defined in claim 20, wherein the set of Boolean values contains a value that indicates whether the URL has appeared as a source URL in an input file.
-
24. An information storage device for storing, arranging and presenting data defining the connectivity of pages on the Web, the device comprising:
-
a URL Database that stores URLs and that associates a fingerprint and a CS_id with, each stored URL, the URL Database comprising;
a URL Database API, a URL Index Array, and a plurality of partitions that store URLs according to a predetermined criteria;
a Host Database that associates a Host_id with each distinct hostname in the URL Database, the host comprising;
a Host Database API that operates to accept a CS_id and return a corresponding Host_id and to accept a Host_id and return the CS_ids on the corresponding host; and
a Link Database that stores links between source URLs and destination URLs, the link Database comprising;
a Link Database API that operates to retrieve, for a given CS_id, the number of outlinks from the URL corresponding to the CS_id and the number of inlinks to that URL, wherein the device is constructed in accordance with a method comprising;
(a) reading a set of links files;
(b) creating a temporary URLs_info Table;
(c) creating an B) Index from the URLs_info Table;
(d) assigning CS_ids to URLs;
(e) writing the CS_ids to the ID Index;
(f) compressing URLs;
(g) creating a URL Index;
(h) creating a Host Table;
(i) converting URL fingerprints to CS_ids;
(j) creating outstarts and outlinks tables; and
(k) writing instarts and inlink tables to a partitioned URL Database.- View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32, 33)
(i) discarding the URL scheme;
(ii) performing a first, prefix, compression; and
(iii) performing a second compression.
-
-
27. A device as defined in claim 26, wherein the prefix compression comprises the steps:
-
(ii.a) writing a number followed by a first URL; and
(ii.b) for each URLi subsequent to the first URL, writing a one-byte integer followed by a remainder, where the one-byte integer represents the length of a common prefix shared by a URLi and a URL(i−
1) and where the remainder is the portion of URLi following the common prefix.
-
-
28. A device as defined in claim 27, wherein the Host Database comprises a Host Table, the Host Table in turn comprising a plurality of rows containing information regarding:
-
(1) a starting CS_id of a consecutive series of CS_ids on the same host;
(2) the number of CS_ids in the series;
(3) the Host_id for the series; and
(4) the row number of the next highest row containing the same Host.
-
-
29. A device as defined in claim 28, wherein the Host Database comprises a Host Index, where an ith entry in the Host Index contains the largest Host Table row number whose starting CS_id is less than or equal to i*P.
-
30. A device as defined in claim 24, wherein in Step (b) links files are read and are decompressed;
- a fingerprint is computed for each URL and added to the URL_info Table; and
upon the first instance of reading a source URL, the URL is written to a URL_sort buffer.
- a fingerprint is computed for each URL and added to the URL_info Table; and
-
31. A device as defined in claim 30, wherein in Step (c), fingerprints are copied to the ID Index from the temporary URLs_info Table if:
-
(1) the fingerprint corresponds to a source URL;
or(2) the fingerprint corresponds to a special URL that “
appears;
”
or(3) the fingerprint corresponds to a URL with a n indegree greater than or equal to a predetermined number and that “
appears,”
where a URL “
appears”
if(i) the URL is a destination URL and appears in the links files at least a predetermined number of times, or (ii) the URL is a destination URL, is a special URL, and appears in the links files at least once.
-
-
32. A device as defined in claim 31, wherein a fingerprint is calculated for each URL in accordance with a deterministic mathematical function.
-
33. A device as defined in claim 32, wherein a fingerprint is calculated for each URL according to a hash function that returns a unique integer value corresponding to each unique URL string.
Specification