Method and apparatus for an application crawler

US 8,954,416 B2
Filed: 03/18/2009
Issued: 02/10/2015
Est. Priority Date: 11/22/2004
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

loading multiple web page components;

assembling the multiple web page components;

identifying a crawling template based on the multiple web page components;

identifying a period of time specified by the crawling template;

crawling, using one or more processors, an object model of the web page components;

identifying and indexing one or more objects that are loaded during crawling;

simulating a user event;

in response to the simulated user event, pausing crawling of the object model for the identified period of time;

continuing to crawl the object model after the identified period of time has elapsed; and

identifying and indexing one or more objects that loaded during the identified period of time that the crawling of the object model was paused.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-implemented method is provided for searching for files on the Internet. In one embodiment, the method may provide an application crawler that assembles and dynamically instantiates all components of a web page. The instantiated web application may then be analyzed to locate desired components on the web page. This may involve finding and analyzing all clickable items in the application, driving the web application by injecting events, and extracting information from the application and writing it to a file or database.

Citations

28 Claims

1. A computer-implemented method comprising:
- loading multiple web page components;
  
  assembling the multiple web page components;
  
  identifying a crawling template based on the multiple web page components;
  
  identifying a period of time specified by the crawling template;
  
  crawling, using one or more processors, an object model of the web page components;
  
  identifying and indexing one or more objects that are loaded during crawling;
  
  simulating a user event;
  
  in response to the simulated user event, pausing crawling of the object model for the identified period of time;
  
  continuing to crawl the object model after the identified period of time has elapsed; and
  
  identifying and indexing one or more objects that loaded during the identified period of time that the crawling of the object model was paused.
- View Dependent Claims (2, 3, 4, 5, 27, 28)
- - 2. The method of claim 1, wherein crawling the object model comprises at least one of:
    - following a tree of clickable items and activating items in an automated manner to simulate behavior of a human user;
      
      orfollowing a seed list of pages or applications.
  - 3. The method of claim 1, wherein identifying and indexing comprises at least one of:
    - traversing an object tree in the object model;
      
      traversing an object tree in the object model and simulating mouse, keyboard, or other user events;
      
      traversing an object tree in the object model and recording a location and contents of an object of the object tree;
      
      saving a plurality of uniform resource locators (URLs) associated with media into a database;
      
      orinstantiating at least one of video files or video streams.
  - 4. The method of claim 1, wherein identifying and indexing comprises reaching into a node in the object model and performing at least one of:
    - recording the node in a database;
      
      orsaving a pointer to the node in a database.
  - 5. The method of claim 1, wherein crawling the object model comprises at least one of:
    - crawling one or more of an intranet, a single machine, or multiple applications on a single machine;
      
      crawling the Internet;
      
      crawling a device on a TCP/IP network;
      
      crawling a device on a public network;
      
      orcrawling a device on a private network.
  - 27. The method of claim 1, wherein the period of time is based on the simulated user event.
  - 28. The method of claim 27, wherein the period of time is 30 seconds.

6. A computer-implemented method comprising:
- loading multiple web page components;
  
  assembling the multiple web page components;
  
  executing the loaded and assembled multiple web page components to instantiate at least a portion of the web page components;
  
  identifying a crawling template based on the multiple web page components;
  
  identifying a period of time specified by the crawling template;
  
  crawling, using one or more processors, an object model of the web page components;
  
  indexing one or more objects during crawling;
  
  relating information gathered from crawling and indexing with objects;
  
  simulating a user event;
  
  in response to the simulated user event, pausing crawling of the object model for the identified period of time;
  
  continuing to crawl, using the one or more processors, an updated object model after the identified period of time has elapsed;
  
  indexing at least one object that was loaded during the period of time that the crawling of the object model was paused; and
  
  relating information gathered from crawling and indexing the updated object model with objects that are displayed during the period of time.
- View Dependent Claims (7, 8, 9, 10, 11)
- - 7. The method of claim 6, wherein crawling the object model comprises at least one of:
    - following a tree of clickable items and activating items in an automated manner to simulate behavior of a human user;
      
      orfollowing a seed list of pages or applications.
  - 8. The method of claim 6, wherein indexing the one or more objects comprises at least one of:
    - traversing an object tree in the object model of the at least a portion of the web page components;
      
      reaching into a node in the object model of the at least a portion of the web page components and either saving or recording the node in a database;
      
      saving a plurality of uniform resource locators (URLs) associated with media into a database;
      
      orsimulating mouse, keyboard, or other user events.
  - 9. The method of claim 6, wherein crawling the object model comprises at least one of:
    - crawling one or more of an intranet, a single machine, or multiple applications on a single machine;
      
      crawling the Internet;
      
      crawling a device on a TCP/IP network;
      
      crawling a device on a public network;
      
      orcrawling a device on a private network.
  - 10. The method of claim 6, wherein indexing one or more objects further comprises instantiating at least one of video files or video streams.
  - 11. The method of claim 6, further comprising adding data-query interfaces to objects in the at least a portion of the web page components.

12. A computer-implemented method comprising:
- executing multiple web page components to instantiate the multiple web page components;
  
  identifying a crawling template based on the multiple web page components;
  
  identifying a period of time specified by the crawling template;
  
  crawling, using one or more processors, an object model of the web page components;
  
  indexing an object model of the web page components including identifying objects;
  
  simulating a user event;
  
  in response to the simulated user event, pausing crawling of the object model for the identified period of time;
  
  continuing to crawl the object model after the identified period of time has elapsed;
  
  locating video files after the identified period of time has elapsed that loaded during the identified period of time that the crawling of the object model was paused;
  
  indexing the located video files by saving pointers to the video files in a database;
  
  extracting first data about the video files from the object model;
  
  saving the first data in the database;
  
  detecting when a video file has been initiated for playing;
  
  extracting second data as the video file is played; and
  
  relating the second data with objects that were displayed at the same time that the second data was extracted.

13. A computer-implemented method comprising:
- identifying a video-rich website;
  
  identifying a crawling template based on the video-rich website;
  
  identifying a period of time specified by the crawling template;
  
  crawling, using one or more processors, a web page of the identified video-rich website, wherein the crawling comprises;
  
  dynamically instantiating and assembling components of the web page to create an instantiated web application;
  
  identifying specific parts of the instantiated web application that contain useful information;
  
  providing logic for extracting the information into a metadata record by applying data-query interfaces to media player objects in the instantiated web application;
  
  using the data-query interfaces to query the media player objects for media player properties and for metadata about downloaded audio or video streams;
  
  analyzing the instantiated web application to extract information from the web application;
  
  writing the extracted information to a file or database;
  
  simulating a user event;
  
  in response to the simulated user event, pausing crawling of the web page; and
  
  continuing to crawl the web page after the period of time has elapsed.
- View Dependent Claims (14, 15, 16)
- - 14. The method of claim 13, wherein crawling the web page of the identified video-rich website includes executing code for at least one of the following:
    - a Document Object Model (DOM) implementation for a browser;
      
      a scripting engine capable of executing JavaScript, JScript, ECMAScript or VBScript;
      
      an XML parsing engine;
      
      a Cascading Style Sheet engine;
      
      a network I/O library;
      
      an HTML parsing and rendering engine;
      
      an engine for executing embedded controls;
      
      oran engine for rendering web applications.
  - 15. The method of claim 13 wherein crawling the web page of the identified video-rich website includes executing code for at least one of the following:
    - an XSL engine;
      
      an XPath implementation;
      
      a regular expression engine;
      
      a script execution engine;
      
      an embedded object inspector for components;
      
      a network transport proxy;
      
      a multimedia stream proxy;
      
      a software bridge to process data with class libraries of external programming frameworks;
      
      a taxonomy engine for categorizing metadata;
      
      ora text parsing and processing engine.
  - 16. The method of claim 13, wherein crawling the web page of the identified video-rich website includes executing code for at least one of the following:
    - a file I/O library;
      
      a network I/O library;
      
      ora library for generating and storing logfiles.

17. A non-transitory computer-readable storage medium including a set of instructions that, when executed, cause at least one processor to perform steps comprising:
- loading multiple web page components;
  
  assembling the multiple web page components;
  
  identifying a crawling template based on the multiple web page components;
  
  identifying a period of time specified by the crawling template;
  
  executing the loaded and assembled multiple web page components to instantiate at least a portion of the application;
  
  crawling an object model of the web page components;
  
  indexing one or more objects loaded during crawling;
  
  relating information gathered from crawling and indexing the object model with objects that are displayed;
  
  simulating a user event;
  
  in response to the simulated user event, pausing crawling of the object model for the identified period of time;
  
  continuing to crawl and index an object model after the identified period of time has elapsed;
  
  indexing at least one object that loaded during the period of time that the crawling of the object model was paused; and
  
  relating information gathered from crawling and indexing the updated object model with objects that displayed during the period of time that the crawling of the object model was paused.
- View Dependent Claims (18, 19, 20, 21, 22)
- - 18. The computer-readable storage medium as recited in claim 17, further comprising instructions that, when executed, cause the at least one processor to perform the step of identifying one or more web sites for inspection.
  - 19. The computer-readable storage medium as recited in claim 18, wherein the one or more web sites identified for inspection contain at least one video file or at least one media file.
  - 20. The computer-readable storage medium as recited in claim 17, further comprising using the crawling template to identify specific parts of the web page components that contain useful information.
  - 21. The computer-readable storage medium as recited in claim 17, further comprising instructions that, when executed, cause the at least one processor to perform the steps of:
    - inspecting a tree of clickable items; and
      
      activating each item in an automated manner to simulate behavior of a human user.
  - 22. The computer-readable storage medium as recited in claim 17, further comprising instructions that, when executed, cause the at least one processor to perform the step of:
    - using the crawling template for at least one of the following;
      
      data extraction, timing of when to follow a link, depth to crawl, how to skip a commercial, where to start crawling, finding links, location of title, location of media file metadata, temporal synchronization, or waiting certain time intervals before crawling the object model again.

23. A computer system comprising:
- at least one processor; and
  
  at least one non-transitory computer readable storage medium storing instructions thereon that, when executed by the at least one processor, cause the system to;
  
  identify a video-rich website;
  
  identify a crawling template based on the video-rich website;
  
  identify a period of time specified by the crawling template;
  
  crawl a video-rich website, wherein the crawling comprises;
  
  dynamically instantiate and assemble components of a web page at the video-rich website to create an instantiated web application;
  
  identify specific parts of the instantiated web application that contain useful information in accordance with a template;
  
  provide logic for extracting that information into a metadata record;
  
  analyze the instantiated web application to extract information from the instantiated web application;
  
  write the extracted information to a file or database;
  
  simulate a user event;
  
  in response to the simulated user event, pause crawling of the web page; and
  
  continue to crawl the web page after the period of time has elapsed.
- View Dependent Claims (24, 25, 26)
- - 24. The computer system of claim 23, wherein the system includes executable code for at least one of the following:
    - a Document Object Model (DOM) implementation for a browser;
      
      a scripting engine capable of executing JavaScript, JScript, ECMAScript or VBScript;
      
      an XML parsing engine;
      
      a Cascading Style Sheet engine;
      
      a network I/O library;
      
      an HTML parsing and rendering engine;
      
      an engine for executing embedded controls;
      
      or an engine for rendering web applications.
  - 25. The computer system of claim 23, wherein the system includes executable code for at least one of the following:
    - an XSL engine;
      
      an XPath implementation;
      
      a regular expression engine;
      
      a script execution engine;
      
      an embedded object inspector for components;
      
      a network transport proxy;
      
      a multimedia stream proxy;
      
      a software bridge to process data with class libraries of external programming frameworks;
      
      a taxonomy engine for categorizing metadata;
      
      ora text parsing and processing engine.
  - 26. The computer system of claim 23, wherein the system includes executable code for at least one of the following:
    - a file I/O library;
      
      a network I/O library;
      
      ora library for generating and storing logfiles.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Meta Platforms, Inc. (f/k/a Facebook, Inc.)
Original Assignee
Meta Platforms, Inc. (f/k/a Facebook, Inc.)
Inventors
Tuttle, Timothy D., Beguelin, Adam L., Kocks, Peter F.
Primary Examiner(s)
Lewis, Alicia

Application Number

US12/406,404
Publication Number

US 20090216758A1
Time in Patent Office

2,155 Days
Field of Search

707/709, 707/711
US Class Current

707/709
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

Method and apparatus for an application crawler

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

Citations

28 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for an application crawler

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

28 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links