SPECIFIC ONLINE RESOURCE IDENTIFICATION AND EXTRACTION
First Claim
1. A method of automatically identifying and extracting distributed online resources, the method comprising:
- locating in a website a candidate entry list page;
verifying the candidate entry list page as an entry list page using repeated pattern discovery;
segmenting the entry list page into a plurality of entry items;
extracting from the plurality of entry items a plurality of candidate target pages;
verifying at least some of the candidate target pages as target pages including analyzing a visual structure and presentation of the candidate target pages;
extracting metadata from the target pages; and
organizing the target pages and/or the metadata in one or more databases.
1 Assignment
0 Petitions
Accused Products
Abstract
A method of automatically identifying and extracting distributed online resources may include locating in a website a candidate entry list page. The method may also include verifying the candidate entry list page as an entry list page using repeated pattern discovery. The method may also include segmenting the entry list page into a plurality of entry items. The method may also include extracting from the plurality of entry items a plurality of candidate target pages. The method may also include verifying at least some of the candidate target pages as target pages including analyzing a visual structure and presentation of the candidate target pages. The method may also include extracting metadata from the target pages. The method may also include organizing the target pages and/or the metadata in one or more databases.
-
Citations
20 Claims
-
1. A method of automatically identifying and extracting distributed online resources, the method comprising:
-
locating in a website a candidate entry list page; verifying the candidate entry list page as an entry list page using repeated pattern discovery; segmenting the entry list page into a plurality of entry items; extracting from the plurality of entry items a plurality of candidate target pages; verifying at least some of the candidate target pages as target pages including analyzing a visual structure and presentation of the candidate target pages; extracting metadata from the target pages; and organizing the target pages and/or the metadata in one or more databases. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A system for automatically identifying and extracting distributed online resources, the system comprising:
-
a processor; one or more databases communicatively coupled to the processor; and a tangible computer-readable storage medium communicatively coupled to the processor and having computer-executable instructions stored thereon that are executable by the processor to perform operations comprising; locating in a website a candidate entry list page; verifying the candidate entry list page as an entry list page using repeated pattern discovery; segmenting the entry list page into a plurality of entry items; extracting from the plurality of entry items a plurality of candidate target pages; verifying at least some of the candidate target pages as target pages including analyzing a visual structure and presentation of the candidate target pages; extracting metadata from the target pages; and organizing the target pages and/or the metadata in one or more databases. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
Specification