Specific online resource identification and extraction
First Claim
1. A method of automatically identifying and extracting distributed online resources, the method comprising:
- locating in a website a candidate entry list page;
verifying the candidate entry list page as an entry list page using repeated pattern discovery;
segmenting the entry list page into a plurality of entry items;
extracting from the plurality of entry items a plurality of candidate target pages, wherein extracting from the plurality of entry items a plurality of candidate target pages comprises, for each corresponding entry item, identifying one or more links in the entry item, the one or more links each pointing to a corresponding one of the plurality of candidate target pages;
verifying at least one of the candidate target pages as a target page including;
fetching the at least one of the candidate target pages using a corresponding one of the one or more links;
analyzing a visual structure and presentation of the at least one of the candidate target pages to identify one or more information blocks in the at least one of the candidate target pages;
extracting one or more keyword features from the one or more information blocks;
classifying the at least one of the candidate target pages as belonging to a particular genre of pages based at least on the one or more keyword features;
extracting metadata from the target page; and
organizing the target page and the metadata in one or more databases.
1 Assignment
0 Petitions
Accused Products
Abstract
A method of automatically identifying and extracting distributed online resources may include locating in a website a candidate entry list page. The method may also include verifying the candidate entry list page as an entry list page using repeated pattern discovery. The method may also include segmenting the entry list page into a plurality of entry items. The method may also include extracting from the plurality of entry items a plurality of candidate target pages. The method may also include verifying at least some of the candidate target pages as target pages including analyzing a visual structure and presentation of the candidate target pages. The method may also include extracting metadata from the target pages. The method may also include organizing the target pages and/or the metadata in one or more databases.
25 Citations
19 Claims
-
1. A method of automatically identifying and extracting distributed online resources, the method comprising:
-
locating in a website a candidate entry list page; verifying the candidate entry list page as an entry list page using repeated pattern discovery; segmenting the entry list page into a plurality of entry items; extracting from the plurality of entry items a plurality of candidate target pages, wherein extracting from the plurality of entry items a plurality of candidate target pages comprises, for each corresponding entry item, identifying one or more links in the entry item, the one or more links each pointing to a corresponding one of the plurality of candidate target pages; verifying at least one of the candidate target pages as a target page including; fetching the at least one of the candidate target pages using a corresponding one of the one or more links; analyzing a visual structure and presentation of the at least one of the candidate target pages to identify one or more information blocks in the at least one of the candidate target pages; extracting one or more keyword features from the one or more information blocks; classifying the at least one of the candidate target pages as belonging to a particular genre of pages based at least on the one or more keyword features; extracting metadata from the target page; and organizing the target page and the metadata in one or more databases. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A system for automatically identifying and extracting distributed online resources, the system comprising:
-
a processor; one or more databases communicatively coupled to the processor; and a tangible computer-readable storage medium communicatively coupled to the processor and having computer-executable instructions stored thereon that are executable by the processor to perform operations comprising; locating in a website a candidate entry list page; verifying the candidate entry list page as an entry list page using repeated pattern discovery; segmenting the entry list page into a plurality of entry items; extracting from the plurality of entry items a plurality of candidate target pages, wherein extracting from the plurality of entry items a plurality of candidate target pages comprises, for each corresponding entry item, identifying one or more links in the entry item, the one or more links each pointing to a corresponding one of the plurality of candidate target pages; verifying at least one of the candidate target pages as a target page including; fetching the at least one of the candidate target pages using a corresponding one of the one or more links; analyzing a visual structure and presentation of the at least one of candidate target pages to identify one or more information blocks in the at least one of the candidate target pages; extracting one or more keyword features from the one or more information blocks; classifying the at least one of the candidate target pages as belonging to a particular genre of pages based at least on the one or more keyword features; extracting metadata from the target page; and organizing the target page and the metadata in one or more databases. - View Dependent Claims (14, 15, 16, 17, 18, 19)
-
Specification