Content retrieval from sites that use session identifiers

US 7,886,032 B1
Filed: 12/23/2003
Issued: 02/08/2011
Est. Priority Date: 12/23/2003
Status: Active Grant

First Claim

Patent Images

1. A method perform by a computer system, the method comprising:

extracting, by one or more processors associated with the computer system, a set of uniform resource locators (URLs) from one document or from multiple documents downloaded from a web site;

identifying, by the one or more processors associated with the computer system, a sub-string occurring in the set of URLs as a session identifier, based on at least one of a plurality of rules and based on multiple occurrences of the sub-string occurring in the set of URLs;

generating, by the one or more processors associated with the computer system, a clean set of URLs, derived from the set of URLs, by removing the session identifier;

determining, by the one or more processors associated with the computer system, additional URLs that have already been crawled based on a comparison of a clean set of the additional URLs to the clean set of generated URLs; and

where the clean set of the additional URLs is generated by removing another session identifier, or the identified session identifier, of the additional URLs.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Session identifiers are automatically identified in uniform resource locators (URLs). The session identifiers may be identified using classification techniques based on whether identical sub-strings are identified in multiple URLs downloaded from a web site. The URLs may then have the session identifiers extracted to generate clean versions of the URLs.

Citations

22 Claims

1. A method perform by a computer system, the method comprising:
- extracting, by one or more processors associated with the computer system, a set of uniform resource locators (URLs) from one document or from multiple documents downloaded from a web site;
  
  identifying, by the one or more processors associated with the computer system, a sub-string occurring in the set of URLs as a session identifier, based on at least one of a plurality of rules and based on multiple occurrences of the sub-string occurring in the set of URLs;
  
  generating, by the one or more processors associated with the computer system, a clean set of URLs, derived from the set of URLs, by removing the session identifier;
  
  determining, by the one or more processors associated with the computer system, additional URLs that have already been crawled based on a comparison of a clean set of the additional URLs to the clean set of generated URLs; and
  
  where the clean set of the additional URLs is generated by removing another session identifier, or the identified session identifier, of the additional URLs.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, where the comparison of the clean set of the additional URLs to the clean set of generated URLs comprises:
    - calculating a first fingerprint value derive from the clean set of additional URLs and a second fingerprint value derive from the clean set of generated URLs, and where the comparison is based on a comparison of the first fingerprint value with the second fingerprint value.
  - 3. The method of claim 1, where the at least one of plurality rules comprises:
    - determining that the sub-string does not reference content.
  - 4. The method of claim 1, where the at least one of plurality rules comprises:
    - determining that the sub-string contains characters consistent with a session identifier.
  - 5. The method of claim 1, further comprising:
    - downloading content from the additional URLs when the additional URLS are determined to not already have been crawled.
  - 6. The method of claim 1, further comprising:
    - storing information based on the clean set of URLs for use in later determining whether the additional URLs have already been extracted; and
      
      storing the set of URLs, including embedded session identifiers, for use in later accessing the set of URLs.
  - 7. The method of claim 1, where the at least one of the plurality rules comprises:
    - determining that the sub-string exhibits at least a specified measure of randomness.

8. A method performed by a computer system, the method comprising:
- downloading, by a communication interface associated with the computer system, one or more documents from a web site;
  
  extracting, by one or more processors associated with the computer system, a set of uniform resource locators (URLs) from the downloaded one or more documents;
  
  identifying, by the one or more processor associated with the computer system, a sub-sting occurring in the extracted set of URLs as a session identifier, based on the sub-string having a structure consistent with session identifiers and based on multiple occurrences of the sub-string in the extracted set of URLs;
  
  generating, by the one or more processors associated with the computer system, a clean set of URLs from the extracted set of URLs by removing the identified session identifier;
  
  determining, by the one or more processors associated with the computer system, whether additional URLs have already been crawled based on a comparison of a clean set of the additional URLs to the generated clean set of URLs;
  
  where the clean set of the additional URLs is generated by removing another session identifier, or the identified session identifier, of the additional URLs.
- View Dependent Claims (9, 10, 11)
- - 9. The method of claim 8, further comprising:
    - storing the generated clean set of URLs.
  - 10. The method of claim 9, further comprising:
    - adding a generated session identifier to in each of the generated clean set of URLs.
  - 11. The method of claim 8, identifying the sub-sting occurring in the extracted set of URLs as a session identifier includes identifying the sub-string as having at least a specified measure of randomness.

12. A device comprising:
- a memory to store instructions; and
  
  a processor to execute the instructions to implement;
  
  at least one fetch bot to download content on a network from a single web site;
  
  extract URLs from the downloaded content;
  
  identify a sub-string as a session identifier from the URLs extracted from the downloaded content based on at least one of a plurality rules and based on multiple occurrences of the sub-string in the extracted URLs;
  
  create a clean set of URLs by removing the session identifier from the extracted URLs;
  
  store the clean set of URLs; and
  
  determine whether additional URLs have already been crawled based on a comparison of a clean set of the additional URLs to the created clean set of URLs, where the clean set of the additional URLs is generated by removing another session identifier, or the identified session identifier, from the additional URLs.
- View Dependent Claims (13, 14, 15, 16, 17)
- - 13. The device of claim 12, where the processor is further to identify the sub-string as a session identifier based on locating characters consistent with a session identifier in the URLs extracted from the downloaded content.
  - 14. The device of claim 12, further comprising:
    - a database to store the downloaded content.
  - 15. The device of claim 12, where the processor is further to determine whether the additional URLs have previously been stored by comparing the clean set of the additional URLs to the stored clean set of URLs.
  - 16. The device of claim 12, where the session identifier includes characters from the extracted URLs that do not reference content.
  - 17. The device of claim 12, where the processor is further to:
    - identify the session identifier from the extracted URLs based on identifying that the sub-string exhibits at least a specified measure of randomness.

18. A system comprising:
- one or more server devices comprising one or more processors to;
  
  download one or more documents from a web site;
  
  extract a set of uniform resource locators (URLs) from the one or more documents downloaded from the website;
  
  identify a sub-sting occurring in the set of URLs as session identifier, based on the sub-string including characters that are structured consistent with session identifiers and based on multiple occurrences of the sub-string in the set of URLs;
  
  generate a clean set of URLs from the set of URLs by removing the identified sub-string;
  
  determine whether additional URLs have already been crawled based on a comparison of a clean set of the additional URLs to the generated clean set of URLs;
  
  where the clean set of the additional URLs are generated by removing another session identifier, or the identified session identifier, of the additional URLs.
- View Dependent Claims (19, 20)
- - 19. The system of claim 18, where the one or more processors are further to:
    - add a generated session identifier to each URL in the generated clean set of URLs.
  - 20. The system of claim 18, where the one or more processors are further to:
    - identify the sub-sting occurring in the set of URLs as a session identifier based on the sub-string having at least a specified measure of randomness.

21. One or more memory devices that include programming instructions executed by one or more processors, where the instructions causes the one or more processors to:
- extract a set of uniform resource locators (URLs) from one document or from multiple document associated with a single web host;
  
  identify, in the set of URLs, a sub-string as a session identifier based on the sub-string having at least a specified measure of randomness and based on multiple occurrences the sub-string in the extracted set of URLs; and
  
  generate a clean set of URLs from the extracted set of URLs by removing the identified session identifier;
  
  determine, by the one or more processors associated with the computer system, additional URLs have already been crawled based on a comparison of a clean set of the additional URLs to the clean set of URLs;
  
  wherein the clean set of the additional URLs are generated by removing another session identifier, or the identified session identifier, of the additional URLs.
- View Dependent Claims (22)
- - 22. The one or more memory devices of claim 21, further causes the one or more processors to:
    - add a generated session identifier to URLs in the clean set of URLs when the URLs are to be used to access a web document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Louz-On, Michal
Primary Examiner(s)
Tang; Karen C

Application Number

US10/743,547
Time in Patent Office

2,604 Days
Field of Search

709/223, 709/225, 709/226, 709/227, 709/228, 709/229, 709/230, 705/35
US Class Current

709/223
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

G06F 16/9566 URL specific, e.g. using al...

Content retrieval from sites that use session identifiers

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Content retrieval from sites that use session identifiers

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links