×

Specific online resource identification and extraction

  • US 9,390,166 B2
  • Filed: 12/31/2012
  • Issued: 07/12/2016
  • Est. Priority Date: 12/31/2012
  • Status: Active Grant
First Claim
Patent Images

1. A method of automatically identifying and extracting distributed online resources, the method comprising:

  • locating in a website a candidate entry list page;

    verifying the candidate entry list page as an entry list page using repeated pattern discovery;

    segmenting the entry list page into a plurality of entry items;

    extracting from the plurality of entry items a plurality of candidate target pages, wherein extracting from the plurality of entry items a plurality of candidate target pages comprises, for each corresponding entry item, identifying one or more links in the entry item, the one or more links each pointing to a corresponding one of the plurality of candidate target pages;

    verifying at least one of the candidate target pages as a target page including;

    fetching the at least one of the candidate target pages using a corresponding one of the one or more links;

    analyzing a visual structure and presentation of the at least one of the candidate target pages to identify one or more information blocks in the at least one of the candidate target pages;

    extracting one or more keyword features from the one or more information blocks;

    classifying the at least one of the candidate target pages as belonging to a particular genre of pages based at least on the one or more keyword features;

    extracting metadata from the target page; and

    organizing the target page and the metadata in one or more databases.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×