Identification of web sites that contain session identifiers
First Claim
Patent Images
1. A method for crawling documents, performed by one or more server devices, the method comprising:
- receiving, by one or more processors associated with the one or more server devices, a uniform resource locator (URL);
receiving, by one or more processors associated with the one or more server devices, at least two different copies of a document associated with the URL; and
determining, by one or more processors associated with the one or more server devices, whether a web site corresponding to the URL uses session identifiers based on a comparison of URLs that are within the document and that change between the at least two different copies of the document, where the web site is determined to use session identifiers when a portion of the URLs that change between the at least two different copies of the document is greater than a threshold.
2 Assignments
0 Petitions
Accused Products
Abstract
Web sites are analyzed to determine whether the web sites are embedding session identifiers in web documents. The analysis is based on a comparison of in-host links of multiple copies of a document from a web site. Rules governing the insertion of session identifiers for the web site may be determined and used to assist in crawling the web site.
26 Citations
30 Claims
-
1. A method for crawling documents, performed by one or more server devices, the method comprising:
-
receiving, by one or more processors associated with the one or more server devices, a uniform resource locator (URL); receiving, by one or more processors associated with the one or more server devices, at least two different copies of a document associated with the URL; and determining, by one or more processors associated with the one or more server devices, whether a web site corresponding to the URL uses session identifiers based on a comparison of URLs that are within the document and that change between the at least two different copies of the document, where the web site is determined to use session identifiers when a portion of the URLs that change between the at least two different copies of the document is greater than a threshold. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method for identifying web sites that use session identifiers, performed by one or more server devices, the method comprising:
-
downloading, by one or more processors associated with the one or more server devices, at least two different copies of at least one document from a web site; extracting, by one or more processors associated with the one or more server devices, uniform resource locators (URLs) from the two different copies of the web document; comparing, by one or more processors associated with the one or more server devices, the extracted URLs of the two different copies of the document; and determining, by one or more processors associated with the one or more server devices, whether the web site uses session identifiers when the comparison indicates that at least a portion of the URLs change between the two different copies. - View Dependent Claims (12, 13, 14)
-
-
15. A device comprising:
-
a memory to store instructions; and a processor to execute the instructions to implement; a spider component configured to crawl web documents associated with at least one web site; and a session identifier component configured to determine whether the web site uses session identifiers based on a comparison of a portion of uniform resource locators (URLs) that change between different copies of at least one web document downloaded from the web site. - View Dependent Claims (16, 17, 18, 19, 20)
-
-
21. A device comprising:
-
means for downloading at least two different copies of at least one web document from a web site; means for extracting uniform resource locators (URLs) from the two different copies of the web document; means for comparing the extracted URLs of the two different copies of the web document; and means for determining whether the web site uses session identifiers when the comparison indicates that at least a portion of the URLs change between the two different copies. - View Dependent Claims (22, 23, 24)
-
-
25. One or more memory devices containing programming instructions that, when executed by at least one processor cause the processor to perform a method for identifying web sites that use session identifiers, the one or more memory devices including:
-
one or more instructions to download at least two different copies of at least one document from a web site; one or more instructions to extract uniform resource locators (URLs) from the two different copies of the document; one or more instructions to compare the extracted URLs of the two different copies of the web document; and one or more instructions to determine whether the web site uses session identifiers when the comparison indicates that at least a portion of the URLs change between the two different copies. - View Dependent Claims (26, 27, 28, 29, 30)
-
Specification