Automatic online video discovery and indexing

US 8,473,574 B2
Filed: 05/20/2010
Issued: 06/25/2013
Est. Priority Date: 05/20/2010
Status: Active Grant

First Claim

Patent Images

1. A method for discovering and indexing online video content, the method comprising:

classifying each of a plurality of webpages as either a video page or a non-video page;

aggregating ones of the plurality of webpages classified as a video page;

determining a respective domain importance ranking for each of a plurality of domains of the ones of the plurality of webpages classified as a video page;

selecting ones of the plurality of domains based on the respective domain importance rankings;

randomly sampling webpages of the selected ones of the plurality of domains;

automatically, for each of the selected ones of the plurality of domains, forming page groups, based, at least in part, on layouts and visual patterns of the randomly sampled webpages;

generating hint information for each of the selected ones of the plurality of domains based, at least in part, on attributes of corresponding page groups for guiding a deep crawling operation of the selected ones of the plurality of domains;

using the hint information to guide the deep crawling operation of the selected ones of the plurality of domains to discover video pages in each of the selected ones of the plurality of domains; and

indexing the discovered video pages, wherein the method is implemented on an electronic computing device.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A classifier may be integrated into a pipeline of a general web crawler. The classifier may classify crawled webpages as either video pages or non-video pages. Video pages and information regarding domain importance may be aggregated. Ones of the domains of the video pages may be selected based on domain importance rankings. Webpages of the selected domains may be randomly sampled. The sampled webpages may be structurally analyzed and hint information may be generated with respect to each of the selected domains. The hint information may guide a deep crawling operation for discovering all video pages within the selected domains. Video links within the video pages may be found, one or more videos may be downloaded, and one or more representations of the one or more videos may be indexed.

14 Citations

View as Search Results

20 Claims

1. A method for discovering and indexing online video content, the method comprising:
- classifying each of a plurality of webpages as either a video page or a non-video page;
  
  aggregating ones of the plurality of webpages classified as a video page;
  
  determining a respective domain importance ranking for each of a plurality of domains of the ones of the plurality of webpages classified as a video page;
  
  selecting ones of the plurality of domains based on the respective domain importance rankings;
  
  randomly sampling webpages of the selected ones of the plurality of domains;
  
  automatically, for each of the selected ones of the plurality of domains, forming page groups, based, at least in part, on layouts and visual patterns of the randomly sampled webpages;
  
  generating hint information for each of the selected ones of the plurality of domains based, at least in part, on attributes of corresponding page groups for guiding a deep crawling operation of the selected ones of the plurality of domains;
  
  using the hint information to guide the deep crawling operation of the selected ones of the plurality of domains to discover video pages in each of the selected ones of the plurality of domains; and
  
  indexing the discovered video pages, wherein the method is implemented on an electronic computing device.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the indexing the discovered video pages further comprises:
    - sending the discovered video pages to a service executing within a dynamic rendering environment,receiving resents from the service,downloading video based on the received results, andadding a representation of the video to a video search index.
  - 3. The method of claim 2, wherein:
    - the dynamic rendering environment includes a browser executing in a virtual machine, andthe dynamic rendering environment is implemented on either the electronic computing device or a second electronic computing device having a communication connection with the electronic computing device.
  - 4. The method of claim 3, further comprising:
    - using, by the service, the browser included in the dynamic rendering environment to discover at least one video link within the discovered video pages; and
      
      providing, by the service to the electronic computing device, the results including the at least one discovered video link.
  - 5. The method of claim 1, further comprising:
    - crawling the plurality of webpages with a general web crawler, wherein the classifying is performed on each of the crawled plurality of webpages.
  - 6. The method of claim 1, wherein the classifying each of a plurality of webpages further comprises:
    - determining whether a respective webpage of the plurality of webpages includes video player information,determining whether the video player information included in the respective webpage is to be used for playing an advertisement, andclassifying the respective webpage as a video page when the respective webpage is determined to include the video player information and the video player information is not to be used for playing an advertisement.
  - 7. The method of claim 1, wherein the determining a respective domain importance ranking for each of a plurality of domains of the ones of the plurality of webpages classified as a video page further comprises:
    - determining the respective domain importance ranking for each of the selected ones of the plurality of domains based on at least one item from a group including a number of video pages discovered in a respective domain of the plurality of domains, a number of download requests made for video pages included in the respective domain, and information obtained through public application program interfaces.

8. An electronic computing device for discovering and indexing online video content, the electronic computing device comprising at least one processor programmed to implement instructions to:
- crawl a plurality of webpages;
  
  classify each of the plurality of webpages as either a video page or a non-video page;
  
  aggregate ones of the plurality of webpages classified as a video page by the classifier;
  
  select ones of a plurality of domains based on respective domain importance rankings, each of the plurality of domains including at least one of the plurality of webpages classified as a video page;
  
  sample webpages of each of the selected ones of the plurality of domains;
  
  assign each of the sampled webpages to a respective one of a plurality of page groups of a respective one of the selected ones of the plurality of domains based on a layout and a visual pattern of each of the sampled webpages;
  
  analyze a structure of each of the selected ones of the plurality of domains to determine relationships among the plurality of page groups of each of the selected ones of the plurality of domains;
  
  generate hint information for guiding a deep crawling operation with respect to each of the selected ones of the plurality of domains based on attributes and relationships among the sampled webpages of each of the selected ones of the plurality of domains; and
  
  index video pages discovered during the deep crawling operation.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The electronic computing device of claim 8, whereinthe instructions to index include instructions to:
    - provide the video pages discovered during the deep crawling operation to a service executing in a virtual machine, andreceive results from the service, the results including at least one video link for downloading a video.
  - 10. The electronic computing device of claim 9, wherein the instructions to index include instructions to:
    - generate a predetermined amount of a smart motion thumbnail based on using one of the at least one video link to download the video; and
      
      index the generated predetermined amount of the smart motion thumbnail.
  - 11. The electronic computing device of claim 9, wherein the service is implemented in a dynamic rendering environment comprising a browser executing on the virtual machine.
  - 12. The electronic computing device of claim 8, wherein the instructions to crawl include instructions to crawl the plurality of webpages in an order based on a respective static rank of each of the plurality of webpages.
  - 13. The electronic computing device of claim 8, wherein the instructions to classify include instructions to classify a webpage as a video page when the webpage includes video player information and a footprint of the video player information included in the webpage indicates that the video player information is for playing video content other than an advertisement.
  - 14. The electronic computing device of claim 8, wherein the instructions to index include instructions to detect duplicate video pages in order to prevent indexing of the duplicate video pages.

15. At least one machine-readable storage device having information recorded thereon for at least one processor of an electronic computing device, the information comprising:
- instructions for web crawling a plurality of webpages;
  
  instructions for classifying each of the plurality of webpages as either a video page or a non-video page, each of the plurality of webpages being included in a corresponding domain;
  
  instructions for aggregating each of the plurality of webpages classified as a video page and corresponding domain importance ranking information;
  
  instructions for selecting a plurality of domains based on the aggregated corresponding domain importance ranking information;
  
  instructions for randomly sampling webpages of each of the selected plurality of domains to obtain structural information;
  
  instructions for grouping, for each of the selected plurality of domains, ones of the randomly sampled webpages of a corresponding domain into a plurality of page groups based on similarities among the ones of the randomly sampled webpages;
  
  instructions for analyzing the obtained structural information of the randomly sampled webpages of each of the selected plurality of domains;
  
  instructions for generating hint information for each of the selected plurality of domains based on the obtained structural information and corresponding ones of the plurality of page groups;
  
  instructions for using the generated hint information to perform a deep crawling operation of each of the selected plurality of domains to discover all video pages in each of the selected plurality of domains; and
  
  instructions for indexing representations of all of the discovered video pages in each of the selected plurality of domains.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The at least one device of claim 15, further comprising:
    - instructions for providing the discovered video pages of each of the selected plurality of domains to a service executing on a virtual machine;
      
      instructions for receiving results from the service, the results including at least one video link obtained from the discovered video pages of ones of the selected plurality of domains;
      
      instructions for using the at least one video link to download a video;
      
      instructions for generating a smart motion thumbnail of a predetermined length based on the downloaded video; and
      
      instructions for indexing the generated smart motion thumbnail.
  - 17. The at least one device of claim 16, further comprising:
    - instructions for implementing the service on a virtual machine.
  - 18. The at least one device of claim 17, wherein the virtual machine is reloaded after every predetermined period of time.
  - 19. The at least one device of claim 17, further comprising:
    - instructions for the service to use a browser, executing on the virtual machine, to make a request based on a video link of the at least one video link;
      
      instructions for the service to examine a file extension included in the request and a type of a file included in a response to the request to determine whether the response includes the downloaded video; and
      
      instructions for including, in the results of the service, the video link when the response is determined to include the downloaded video.
  - 20. The at least one device of claim 19, wherein the browser includes an Internet Explorer®
    - browser.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Kong, Xiao, Yu, Shouqiu, Wang, Wei, Yang, Jiang-Ming, Cai, Rui, Li, Haifeng, Yang, Xiaosong
Primary Examiner(s)
Gorney, Boris

Application Number

US12/783,620
Publication Number

US 20110289182A1
Time in Patent Office

1,132 Days
Field of Search

709/217, 707/706, 707/709, 707/711
US Class Current

709/217
CPC Class Codes

G06F 16/70 of video data

G06F 16/951 Indexing; Web crawling tech...

Automatic online video discovery and indexing

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

14 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic online video discovery and indexing

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

14 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links