Content retrieval from sites that use session identifiers
First Claim
Patent Images
1. A method performed by a computer system, the method comprising:
- extracting, by one or more processors associated with the computer system, a set of uniform resource locators (URLs) from at least one document;
identifying, by the one or more processors, a sub-string occurring in the set of URLs as a session identifier based on the sub-string being associated with at least a particular measure of randomness, identifying the sub-string including;
determining the particular measure of randomness associated with the sub-string based on a quantity of times characters in the sub-string alternate between numbers, lower case letters, or upper case letters;
generating, by the one or more processors, a clean set of URLs from the set of URLs by removing the session identifier; and
determining that a second URL has already been crawled based on the second URL matching a URL in the clean set of URLs.
1 Assignment
0 Petitions
Accused Products
Abstract
Session identifiers are automatically identified in uniform resource locators (URLs). The session identifiers may be identified using classification techniques based on whether identical sub-strings are identified in multiple URLs downloaded from a web site. The URLs may then have the session identifiers extracted to generate clean versions of the URLs.
-
Citations
20 Claims
-
1. A method performed by a computer system, the method comprising:
-
extracting, by one or more processors associated with the computer system, a set of uniform resource locators (URLs) from at least one document; identifying, by the one or more processors, a sub-string occurring in the set of URLs as a session identifier based on the sub-string being associated with at least a particular measure of randomness, identifying the sub-string including; determining the particular measure of randomness associated with the sub-string based on a quantity of times characters in the sub-string alternate between numbers, lower case letters, or upper case letters; generating, by the one or more processors, a clean set of URLs from the set of URLs by removing the session identifier; and determining that a second URL has already been crawled based on the second URL matching a URL in the clean set of URLs. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A device comprising:
-
a memory to store instructions; and a processor to execute the instructions to; receive content from a web site; extract URLs from the received content; identify a sub-string occurring in the set of URLs as a session identifier based on the sub-string being associated with at least a particular measure of randomness, when identifying the sub-string, the processor being further to; determine the particular measure of randomness associated with the sub-string based on a quantity of times characters in the sub-string alternate between numbers, lower case letters, or upper case letters; generate a clean set of URLs from the extracted set of URLs by removing the session identifier; store the clean set of URLs; and determine that a second URL has already been crawled based on the second URL matching a URL in the clean set of URLs. - View Dependent Claims (13, 14, 15)
-
-
16. A non-transitory computer-readable storage medium storing computer-executable program instructions which, when executed by a processor, perform a method, the instructions comprising:
-
one or more instructions to extract a set of uniform resource locators (URLs) from at least one document; one or more instructions to identify a sub-string occurring in the set of URLs as a session identifier based on the sub-string being associated with at least a particular measure of randomness, the one or more instructions to identify the sub-string including; one or more instructions to determine the particular measure of randomness associated with the sub-string based on a quantity of times characters in the sub-string alternate between numbers, lower case letters, or upper case letters; one or more instructions to generate a clean set of URLs from the extracted set of URLs by removing the session identifier; and one or more instructions to determine that a second URL has already been crawled based on the second URL matching a URL in the clean set of URLs. - View Dependent Claims (17, 18, 19, 20)
-
Specification