×

Content retrieval from sites that use session identifiers

  • US 8,307,076 B1
  • Filed: 11/03/2010
  • Issued: 11/06/2012
  • Est. Priority Date: 12/23/2003
  • Status: Active Grant
First Claim
Patent Images

1. A method performed by a computer system, the method comprising:

  • extracting, by one or more processors associated with the computer system, a set of uniform resource locators (URLs) from at least one document;

    identifying, by the one or more processors, a sub-string occurring in the set of URLs as a session identifier based on the sub-string being associated with at least a particular measure of randomness, identifying the sub-string including;

    determining the particular measure of randomness associated with the sub-string based on a quantity of times characters in the sub-string alternate between numbers, lower case letters, or upper case letters;

    generating, by the one or more processors, a clean set of URLs from the set of URLs by removing the session identifier; and

    determining that a second URL has already been crawled based on the second URL matching a URL in the clean set of URLs.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×