ADAPTING CONTENT REPOSITORIES FOR CRAWLING AND SERVING

US 20130332443A1
Filed: 12/20/2012
Published: 12/12/2013
Est. Priority Date: 06/07/2012
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

obtaining, using an adaptor being executed by at least one processor, file identifiers for files available in a file source, the file source being unavailable to a web crawler of a search engine that is remote from the file source and the adaptor;

creating a uniform resource locator (URL) for each of the file identifiers using the at least one processor, the URL being HTTP compatible;

providing each URL to the search engine;

receiving, by the adaptor, a request for contents associated with a particular URL of the provided URLs from the search engine;

obtaining file content using a file identifier determined based on the particular URL from the file source; and

providing an HTTP response to the search engine, the response comprising the content of the file identified by the file identifier corresponding to the particular URL.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system for searching files stored in a closed file source that is not accessible via a web crawler obtains file identifiers for files stored in the file source and creates a unique URL for each of the identifiers. Each URL may be based on a file identifier and a domain portion of a URL associated with the system. The system may provide the unique URLs to a search engine. The system may respond to a crawl request from the search engine for a particular URL by converting the URL back into a file identifier, obtaining the contents of the file, creating an HTTP response from the contents of the file, and returning the response to the search engine. The system may respond to a request for a seed URL with a plurality of URLs as links in a single HTTP response.

20 Citations

View as Search Results

20 Claims

1. A computer-implemented method comprising:
- obtaining, using an adaptor being executed by at least one processor, file identifiers for files available in a file source, the file source being unavailable to a web crawler of a search engine that is remote from the file source and the adaptor;
  
  creating a uniform resource locator (URL) for each of the file identifiers using the at least one processor, the URL being HTTP compatible;
  
  providing each URL to the search engine;
  
  receiving, by the adaptor, a request for contents associated with a particular URL of the provided URLs from the search engine;
  
  obtaining file content using a file identifier determined based on the particular URL from the file source; and
  
  providing an HTTP response to the search engine, the response comprising the content of the file identified by the file identifier corresponding to the particular URL.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, wherein the timing and frequency of requests for contents are controlled by the search engine.
  - 3. The method of claim 1, wherein file content is only provided to the search engine after a request is received.
  - 4. The method of claim 1, wherein obtaining the file identifiers includes obtaining file identifiers for files that are modified/updated to the search engine.
  - 5. The method of claim 1, wherein the URLs are pushed to the search engine without the corresponding content of the files.
  - 6. The method of claim 1, further comprising:
    - receiving, by the adaptor, a second request for contents of the particular URL;
      
      determining whether a user initiating the request is authorized to access the file identified by the file identifier corresponding to the particular URL;
      
      providing an HTTP response indicating the user is not authorized when it is determined that the user is not authorized; and
      
      obtaining the file content when it is determined that the user is authorized.
  - 7. The method of claim 1, further comprising:
    - providing a URL for a seed file to the search engine;
      
      receiving, by the adaptor, a request for content associated with the URL for the seed file; and
      
      obtaining the file identifiers in response to receiving the request for content associated with the URL for the seed file,wherein providing each URL to the search engine comprises providing each URL as a link in a response sent to the search engine.
  - 8. The method of claim 7, further comprising:
    - receiving a second request for content associated with the URL for the seed file;
      
      obtaining second file identifiers in response to receiving the second request, wherein the second file identifiers differ from the file identifiers obtained in response to the previous request; and
      
      providing a second HTTP response to the search engine, the second HTTP response comprising URLs created for the file identifiers as links.
  - 9. The method of claim 1, wherein the file source is a database.
  - 10. The method of claim 9, wherein the files are table rows in the database.
  - 11. The method of claim 1, wherein each URL is created from a file identifier and from a domain portion of a URL for the adaptor.
  - 12. The method of claim 1, wherein the adaptor includes a plurality of modules, each module corresponding to a different file source.

13. A system comprising:
- at least one processor; and
  
  a memory storing modules comprising;
  
  a lister module configured to cause the at least one processor to provide identifiers for files stored in a file source accessible by the system but unavailable to a web crawler of a search engine that is remote from the file source,a retriever module configured to cause the at least one processor to provide content of the files stored in the file source using the identifiers, andan adaptor module configured to cause the at least one processor to perform the following operations;
  
  invoke the lister for a particular file source,receive file identifiers for files in the file source,create a uniform resource locator (URL) for each of the file identifiers, each URL including a domain portion of a URL for the system; and
  
  provide each URL to the search engine.
- View Dependent Claims (14, 15, 16, 17)
- - 14. The system of claim 13, wherein the adaptor module further:
    - receives a request for content associated with a particular URL of the provided URLs;
      
      obtains file content associated with a file identifier associated with the URL from the file source; and
      
      provides a valid web response to the search engine, the response comprising the file content.
  - 15. The system of claim 14, wherein the memory further stores a security module that, when executed by the at least one processor, determines whether a user is authorized to access the file associated with the file identifier associated with the particular URL.
  - 16. The system of claim 13, wherein the URLs are provided to the search engine without the corresponding content of the files.
  - 17. The system of claim 13, wherein the file source is a database and the files are table rows from the database.

18. A computer-implemented method of crawling and indexing a closed file source comprising:
- receiving, using a search engine being executed by at least one processor, uniform resource locators (URLs) from an adaptor associated with the file source, wherein the URLs are received without corresponding file contents;
  
  adding the URLs to a crawl list; and
  
  sending a request to the adaptor for the contents associated with a particular URL of the URLs,wherein the adaptor can identify a file based on the particular URL, andwherein a web crawler of the search engine cannot access the file without using the adaptor.
- View Dependent Claims (19, 20)
- - 19. The method of claim 18, wherein the closed file source is a database and the files are table rows from the database.
  - 20. The method of claim 18, wherein the closed file source is a document management system.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Opalinski, Pawel, Iles, Brandon Player, Anderson, Eric Jon, Felton, John

Granted Patent

US 8,972,375 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/709
CPC Class Codes

G06F 16/14   Details of searching files ...

G06F 16/951   Indexing; Web crawling tech...

G06F 16/9566   URL specific, e.g. using al...

G06F 16/972   Access to data in other rep...

ADAPTING CONTENT REPOSITORIES FOR CRAWLING AND SERVING

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

20 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

ADAPTING CONTENT REPOSITORIES FOR CRAWLING AND SERVING

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

20 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links