Automatic online video discovery and indexing
First Claim
1. A method for discovering and indexing online video content, the method comprising:
- classifying each of a plurality of webpages as either a video page or a non-video page;
aggregating ones of the plurality of webpages classified as a video page;
determining a respective domain importance ranking for each of a plurality of domains of the ones of the plurality of webpages classified as a video page;
selecting ones of the plurality of domains based on the respective domain importance rankings;
randomly sampling webpages of the selected ones of the plurality of domains;
automatically, for each of the selected ones of the plurality of domains, forming page groups, based, at least in part, on layouts and visual patterns of the randomly sampled webpages;
generating hint information for each of the selected ones of the plurality of domains based, at least in part, on attributes of corresponding page groups for guiding a deep crawling operation of the selected ones of the plurality of domains;
using the hint information to guide the deep crawling operation of the selected ones of the plurality of domains to discover video pages in each of the selected ones of the plurality of domains; and
indexing the discovered video pages, wherein the method is implemented on an electronic computing device.
2 Assignments
0 Petitions
Accused Products
Abstract
A classifier may be integrated into a pipeline of a general web crawler. The classifier may classify crawled webpages as either video pages or non-video pages. Video pages and information regarding domain importance may be aggregated. Ones of the domains of the video pages may be selected based on domain importance rankings. Webpages of the selected domains may be randomly sampled. The sampled webpages may be structurally analyzed and hint information may be generated with respect to each of the selected domains. The hint information may guide a deep crawling operation for discovering all video pages within the selected domains. Video links within the video pages may be found, one or more videos may be downloaded, and one or more representations of the one or more videos may be indexed.
14 Citations
20 Claims
-
1. A method for discovering and indexing online video content, the method comprising:
-
classifying each of a plurality of webpages as either a video page or a non-video page; aggregating ones of the plurality of webpages classified as a video page; determining a respective domain importance ranking for each of a plurality of domains of the ones of the plurality of webpages classified as a video page; selecting ones of the plurality of domains based on the respective domain importance rankings; randomly sampling webpages of the selected ones of the plurality of domains; automatically, for each of the selected ones of the plurality of domains, forming page groups, based, at least in part, on layouts and visual patterns of the randomly sampled webpages; generating hint information for each of the selected ones of the plurality of domains based, at least in part, on attributes of corresponding page groups for guiding a deep crawling operation of the selected ones of the plurality of domains; using the hint information to guide the deep crawling operation of the selected ones of the plurality of domains to discover video pages in each of the selected ones of the plurality of domains; and indexing the discovered video pages, wherein the method is implemented on an electronic computing device. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. An electronic computing device for discovering and indexing online video content, the electronic computing device comprising at least one processor programmed to implement instructions to:
-
crawl a plurality of webpages; classify each of the plurality of webpages as either a video page or a non-video page; aggregate ones of the plurality of webpages classified as a video page by the classifier; select ones of a plurality of domains based on respective domain importance rankings, each of the plurality of domains including at least one of the plurality of webpages classified as a video page; sample webpages of each of the selected ones of the plurality of domains; assign each of the sampled webpages to a respective one of a plurality of page groups of a respective one of the selected ones of the plurality of domains based on a layout and a visual pattern of each of the sampled webpages; analyze a structure of each of the selected ones of the plurality of domains to determine relationships among the plurality of page groups of each of the selected ones of the plurality of domains; generate hint information for guiding a deep crawling operation with respect to each of the selected ones of the plurality of domains based on attributes and relationships among the sampled webpages of each of the selected ones of the plurality of domains; and index video pages discovered during the deep crawling operation. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. At least one machine-readable storage device having information recorded thereon for at least one processor of an electronic computing device, the information comprising:
-
instructions for web crawling a plurality of webpages; instructions for classifying each of the plurality of webpages as either a video page or a non-video page, each of the plurality of webpages being included in a corresponding domain; instructions for aggregating each of the plurality of webpages classified as a video page and corresponding domain importance ranking information; instructions for selecting a plurality of domains based on the aggregated corresponding domain importance ranking information; instructions for randomly sampling webpages of each of the selected plurality of domains to obtain structural information; instructions for grouping, for each of the selected plurality of domains, ones of the randomly sampled webpages of a corresponding domain into a plurality of page groups based on similarities among the ones of the randomly sampled webpages; instructions for analyzing the obtained structural information of the randomly sampled webpages of each of the selected plurality of domains; instructions for generating hint information for each of the selected plurality of domains based on the obtained structural information and corresponding ones of the plurality of page groups; instructions for using the generated hint information to perform a deep crawling operation of each of the selected plurality of domains to discover all video pages in each of the selected plurality of domains; and instructions for indexing representations of all of the discovered video pages in each of the selected plurality of domains. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification