×

IDENTIFYING SOURCES OF MEDIA CONTENT HAVING A HIGH LIKELIHOOD OF PRODUCING ON-TOPIC CONTENT

  • US 20080114755A1
  • Filed: 11/12/2007
  • Published: 05/15/2008
  • Est. Priority Date: 11/15/2006
  • Status: Abandoned Application
First Claim
Patent Images

1. A method of identifying sources of media content having a high likelihood of producing on-topic content, the method comprising:

  • responsive to receiving a definition of a topic area of interest of a plurality of topic areas of interest, identifying a set of candidate seed sites from which a current set of seeds are selected for deep crawling to locate on-topic content relevant to the topic area of interest bycorrelating relevancy scores or key-word search results from a plurality of search engines; and

    selecting the current set of seeds from the candidate seed sites based at least in part on on-topic scores associated with the candidate seed sites;

    periodically executing a topic net corresponding to the topic area of interest to locate sources of media content relevant to the topic area of interest bybuilding a graph in which nodes of the graph represent pages and edges of the graph represent links among pages by performing an iterative crawl until a predetermined degree of separation is achieved to find a list of pages linking to any seed of the current set of seeds and a list of pages to which any seed of the current set of seeds links;

    assigning initial graph scores to each node of the graph;

    computing final graph scores for each node based on the initial graph scores by performing link analysis on the graph;

    computing a site graph score for each site represented in the graph by its set of pages by aggregating and averaging the node graph scores associated with the site; and

    identifying a set of sites with the highest site graph scores and configuring them to be scraped; and

    scraping and downloading pages associated with the sites configured to be scraped.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×