Content retrieval from sites that use session identifiers
First Claim
Patent Images
1. A method perform by a computer system, the method comprising:
- extracting, by one or more processors associated with the computer system, a set of uniform resource locators (URLs) from one document or from multiple documents downloaded from a web site;
identifying, by the one or more processors associated with the computer system, a sub-string occurring in the set of URLs as a session identifier, based on at least one of a plurality of rules and based on multiple occurrences of the sub-string occurring in the set of URLs;
generating, by the one or more processors associated with the computer system, a clean set of URLs, derived from the set of URLs, by removing the session identifier;
determining, by the one or more processors associated with the computer system, additional URLs that have already been crawled based on a comparison of a clean set of the additional URLs to the clean set of generated URLs; and
where the clean set of the additional URLs is generated by removing another session identifier, or the identified session identifier, of the additional URLs.
2 Assignments
0 Petitions
Accused Products
Abstract
Session identifiers are automatically identified in uniform resource locators (URLs). The session identifiers may be identified using classification techniques based on whether identical sub-strings are identified in multiple URLs downloaded from a web site. The URLs may then have the session identifiers extracted to generate clean versions of the URLs.
-
Citations
22 Claims
-
1. A method perform by a computer system, the method comprising:
-
extracting, by one or more processors associated with the computer system, a set of uniform resource locators (URLs) from one document or from multiple documents downloaded from a web site; identifying, by the one or more processors associated with the computer system, a sub-string occurring in the set of URLs as a session identifier, based on at least one of a plurality of rules and based on multiple occurrences of the sub-string occurring in the set of URLs; generating, by the one or more processors associated with the computer system, a clean set of URLs, derived from the set of URLs, by removing the session identifier; determining, by the one or more processors associated with the computer system, additional URLs that have already been crawled based on a comparison of a clean set of the additional URLs to the clean set of generated URLs; and where the clean set of the additional URLs is generated by removing another session identifier, or the identified session identifier, of the additional URLs. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method performed by a computer system, the method comprising:
-
downloading, by a communication interface associated with the computer system, one or more documents from a web site; extracting, by one or more processors associated with the computer system, a set of uniform resource locators (URLs) from the downloaded one or more documents; identifying, by the one or more processor associated with the computer system, a sub-sting occurring in the extracted set of URLs as a session identifier, based on the sub-string having a structure consistent with session identifiers and based on multiple occurrences of the sub-string in the extracted set of URLs; generating, by the one or more processors associated with the computer system, a clean set of URLs from the extracted set of URLs by removing the identified session identifier; determining, by the one or more processors associated with the computer system, whether additional URLs have already been crawled based on a comparison of a clean set of the additional URLs to the generated clean set of URLs; where the clean set of the additional URLs is generated by removing another session identifier, or the identified session identifier, of the additional URLs. - View Dependent Claims (9, 10, 11)
-
-
12. A device comprising:
-
a memory to store instructions; and a processor to execute the instructions to implement; at least one fetch bot to download content on a network from a single web site; extract URLs from the downloaded content; identify a sub-string as a session identifier from the URLs extracted from the downloaded content based on at least one of a plurality rules and based on multiple occurrences of the sub-string in the extracted URLs; create a clean set of URLs by removing the session identifier from the extracted URLs; store the clean set of URLs; and determine whether additional URLs have already been crawled based on a comparison of a clean set of the additional URLs to the created clean set of URLs, where the clean set of the additional URLs is generated by removing another session identifier, or the identified session identifier, from the additional URLs. - View Dependent Claims (13, 14, 15, 16, 17)
-
-
18. A system comprising:
-
one or more server devices comprising one or more processors to; download one or more documents from a web site; extract a set of uniform resource locators (URLs) from the one or more documents downloaded from the website; identify a sub-sting occurring in the set of URLs as session identifier, based on the sub-string including characters that are structured consistent with session identifiers and based on multiple occurrences of the sub-string in the set of URLs; generate a clean set of URLs from the set of URLs by removing the identified sub-string; determine whether additional URLs have already been crawled based on a comparison of a clean set of the additional URLs to the generated clean set of URLs; where the clean set of the additional URLs are generated by removing another session identifier, or the identified session identifier, of the additional URLs. - View Dependent Claims (19, 20)
-
-
21. One or more memory devices that include programming instructions executed by one or more processors, where the instructions causes the one or more processors to:
-
extract a set of uniform resource locators (URLs) from one document or from multiple document associated with a single web host; identify, in the set of URLs, a sub-string as a session identifier based on the sub-string having at least a specified measure of randomness and based on multiple occurrences the sub-string in the extracted set of URLs; and generate a clean set of URLs from the extracted set of URLs by removing the identified session identifier; determine, by the one or more processors associated with the computer system, additional URLs have already been crawled based on a comparison of a clean set of the additional URLs to the clean set of URLs;
wherein the clean set of the additional URLs are generated by removing another session identifier, or the identified session identifier, of the additional URLs. - View Dependent Claims (22)
-
Specification