Content retrieval from sites that use session identifiers

US 8,307,076 B1
Filed: 11/03/2010
Issued: 11/06/2012
Est. Priority Date: 12/23/2003
Status: Active Grant

First Claim

Patent Images

1. A method performed by a computer system, the method comprising:

extracting, by one or more processors associated with the computer system, a set of uniform resource locators (URLs) from at least one document;

identifying, by the one or more processors, a sub-string occurring in the set of URLs as a session identifier based on the sub-string being associated with at least a particular measure of randomness, identifying the sub-string including;

determining the particular measure of randomness associated with the sub-string based on a quantity of times characters in the sub-string alternate between numbers, lower case letters, or upper case letters;

generating, by the one or more processors, a clean set of URLs from the set of URLs by removing the session identifier; and

determining that a second URL has already been crawled based on the second URL matching a URL in the clean set of URLs.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Session identifiers are automatically identified in uniform resource locators (URLs). The session identifiers may be identified using classification techniques based on whether identical sub-strings are identified in multiple URLs downloaded from a web site. The URLs may then have the session identifiers extracted to generate clean versions of the URLs.

Citations

20 Claims

1. A method performed by a computer system, the method comprising:
- extracting, by one or more processors associated with the computer system, a set of uniform resource locators (URLs) from at least one document;
  
  identifying, by the one or more processors, a sub-string occurring in the set of URLs as a session identifier based on the sub-string being associated with at least a particular measure of randomness, identifying the sub-string including;
  
  determining the particular measure of randomness associated with the sub-string based on a quantity of times characters in the sub-string alternate between numbers, lower case letters, or upper case letters;
  
  generating, by the one or more processors, a clean set of URLs from the set of URLs by removing the session identifier; and
  
  determining that a second URL has already been crawled based on the second URL matching a URL in the clean set of URLs.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, where determining that the second URL has already been crawled comprises:
    - obtaining a URL;
      
      removing the session identifier from the obtained URL to generate a clean URL;
      
      identifying the clean URL as being present in the clean set of URLs; and
      
      determining that the obtained URL has already been crawled based on the clean URL being present in the clean set of URLs.
  - 3. The method of claim 2, where identifying the clean URL as being present in the clean set of URLs comprises:
    - comparing a fingerprint value of the clean URL to fingerprint values associated with the clean set of URLs.
  - 4. The method of claim 1, where identifying the sub-string occurring in the set of URLs as a session identifier includes:
    - determining the measure of randomness associated with the sub-string by comparing the sub-string to terms in a dictionary.
  - 5. The method of claim 1, where a higher quantity of times the characters in the sub-string alternate between numbers, lower case letters, or upper case letters is associated with a higher measure of randomness.
  - 6. The method of claim 1, where identifying the sub-string occurring in the set of URLs as a session identifier is further based on multiple occurrences of the sub-string in the set of URLs.
  - 7. The method of claim 1, where identifying the sub-string occurring in the set of URLs as a session identifier is further based on determining that the sub-string does not reference content.
  - 8. The method of claim 1, where identifying the sub-string occurring in the set of URLs as a session identifier is further based on determining that the sub-string includes at least a particular quantity of characters.
  - 9. The method of claim 1, where identifying the sub-string occurring in the set of URLs as a session identifier is further based on determining that the sub-string is not part of a domain name.
  - 10. The method of claim 1, where a plurality of URLs in the set of URLs are associated with a same web host.
  - 11. The method of claim 1, further comprising:
    - adding the session identifier to a particular URL, from the clean set of URLs, when the particular URL is to be used to access a document.

12. A device comprising:
- a memory to store instructions; and
  
  a processor to execute the instructions to;
  
  receive content from a web site;
  
  extract URLs from the received content;
  
  identify a sub-string occurring in the set of URLs as a session identifier based on the sub-string being associated with at least a particular measure of randomness, when identifying the sub-string, the processor being further to;
  
  determine the particular measure of randomness associated with the sub-string based on a quantity of times characters in the sub-string alternate between numbers, lower case letters, or upper case letters;
  
  generate a clean set of URLs from the extracted set of URLs by removing the session identifier;
  
  store the clean set of URLs; and
  
  determine that a second URL has already been crawled based on the second URL matching a URL in the clean set of URLs.
- View Dependent Claims (13, 14, 15)
- - 13. The device of claim 12, where, when determining that the second URL has already been crawled based on the second URL matching the clean set of URLs, the processor is further to:
    - obtain a URL;
      
      remove the session identifier from the obtained URL to generate a clean URL;
      
      identify the clean URL as being present in the clean set of URLs; and
      
      determine that the obtained URL has already been crawled based on the clean URL being present in the clean set of URLs.
  - 14. The device of claim 12, where, when identifying the sub-string occurring in the set of URLs as a session identifier, the processor is further to:
    - determine the measure of randomness associated with the sub-string by comparing the sub-string to terms in a dictionary.
  - 15. The device of claim 12, where, when identifying the sub-string occurring in the set of URLs as a session identifier, the processor is further to:
    - identify a sub-string occurring in the set of URLs as a session identifier based on at least one of;
      
      determining multiple occurrences of the sub-string in the set of URLs,determining that the sub-string does not reference content,determining that the sub-string includes at least a particular quantity of characters, ordetermining that the sub-string is not part of a domain name.

16. A non-transitory computer-readable storage medium storing computer-executable program instructions which, when executed by a processor, perform a method, the instructions comprising:
- one or more instructions to extract a set of uniform resource locators (URLs) from at least one document;
  
  one or more instructions to identify a sub-string occurring in the set of URLs as a session identifier based on the sub-string being associated with at least a particular measure of randomness, the one or more instructions to identify the sub-string including;
  
  one or more instructions to determine the particular measure of randomness associated with the sub-string based on a quantity of times characters in the sub-string alternate between numbers, lower case letters, or upper case letters;
  
  one or more instructions to generate a clean set of URLs from the extracted set of URLs by removing the session identifier; and
  
  one or more instructions to determine that a second URL has already been crawled based on the second URL matching a URL in the clean set of URLs.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The computer-readable storage medium of claim 16, where the one or more instructions to determine that a second URL has already been crawled further comprise:
    - one or more instructions to obtain a URL;
      
      one or more instructions to remove the session identifier from the obtained URL to generate a clean URL;
      
      one or more instructions to identify the clean URL as being present in the clean set of URLs; and
      
      one or more instructions to determine that the obtained URL has already been crawled based on the clean URL being present in the clean set of URLs.
  - 18. The computer-readable storage medium of claim 16, where the one or more instructions to identify a sub-string occurring in the set of URLs as a session identifier include:
    - one or more instructions to identify a sub-string occurring in the set of URLs as a session identifier based on at least one of;
      
      determining multiple occurrences of the sub-string in the set of URLs,determining that the sub-string does not reference content,determining that the sub-string includes at least a particular quantity of characters, ordetermining that the sub-string is not part of a domain name.
  - 19. The computer-readable storage medium of claim 17, where the one or more instructions to identify the clean URL as being present in the clean set of URLs further comprise:
    - one or more instructions to identify the clean URL as being present in the clean set of URLs by comparing a fingerprint value of the clean URL to fingerprint values associated with the clean set of URLs.
  - 20. The computer-readable storage medium of claim 16, where the one or more instructions to identify the sub-string include:
    - one or more instructions to determine the measure of randomness associated with the sub-string by comparing the sub-string to terms in a dictionary.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Louz-On, Michal
Primary Examiner(s)
TANG, KAREN C

Application Number

US12/938,671
Time in Patent Office

734 Days
Field of Search

709/204, 709/224, 709/225
US Class Current

709/224
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

G06F 16/9566 URL specific, e.g. using al...

Content retrieval from sites that use session identifiers

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Content retrieval from sites that use session identifiers

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links