×

Content retrieval from sites that use session identifiers

  • US 7,886,032 B1
  • Filed: 12/23/2003
  • Issued: 02/08/2011
  • Est. Priority Date: 12/23/2003
  • Status: Active Grant
First Claim
Patent Images

1. A method perform by a computer system, the method comprising:

  • extracting, by one or more processors associated with the computer system, a set of uniform resource locators (URLs) from one document or from multiple documents downloaded from a web site;

    identifying, by the one or more processors associated with the computer system, a sub-string occurring in the set of URLs as a session identifier, based on at least one of a plurality of rules and based on multiple occurrences of the sub-string occurring in the set of URLs;

    generating, by the one or more processors associated with the computer system, a clean set of URLs, derived from the set of URLs, by removing the session identifier;

    determining, by the one or more processors associated with the computer system, additional URLs that have already been crawled based on a comparison of a clean set of the additional URLs to the clean set of generated URLs; and

    where the clean set of the additional URLs is generated by removing another session identifier, or the identified session identifier, of the additional URLs.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×