Specific online resource identification and extraction

US 9,390,166 B2
Filed: 12/31/2012
Issued: 07/12/2016
Est. Priority Date: 12/31/2012
Status: Active Grant

First Claim

Patent Images

1. A method of automatically identifying and extracting distributed online resources, the method comprising:

locating in a website a candidate entry list page;

verifying the candidate entry list page as an entry list page using repeated pattern discovery;

segmenting the entry list page into a plurality of entry items;

extracting from the plurality of entry items a plurality of candidate target pages, wherein extracting from the plurality of entry items a plurality of candidate target pages comprises, for each corresponding entry item, identifying one or more links in the entry item, the one or more links each pointing to a corresponding one of the plurality of candidate target pages;

verifying at least one of the candidate target pages as a target page including;

fetching the at least one of the candidate target pages using a corresponding one of the one or more links;

analyzing a visual structure and presentation of the at least one of the candidate target pages to identify one or more information blocks in the at least one of the candidate target pages;

extracting one or more keyword features from the one or more information blocks;

classifying the at least one of the candidate target pages as belonging to a particular genre of pages based at least on the one or more keyword features;

extracting metadata from the target page; and

organizing the target page and the metadata in one or more databases.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of automatically identifying and extracting distributed online resources may include locating in a website a candidate entry list page. The method may also include verifying the candidate entry list page as an entry list page using repeated pattern discovery. The method may also include segmenting the entry list page into a plurality of entry items. The method may also include extracting from the plurality of entry items a plurality of candidate target pages. The method may also include verifying at least some of the candidate target pages as target pages including analyzing a visual structure and presentation of the candidate target pages. The method may also include extracting metadata from the target pages. The method may also include organizing the target pages and/or the metadata in one or more databases.

25 Citations

View as Search Results

19 Claims

1. A method of automatically identifying and extracting distributed online resources, the method comprising:
- locating in a website a candidate entry list page;
  
  verifying the candidate entry list page as an entry list page using repeated pattern discovery;
  
  segmenting the entry list page into a plurality of entry items;
  
  extracting from the plurality of entry items a plurality of candidate target pages, wherein extracting from the plurality of entry items a plurality of candidate target pages comprises, for each corresponding entry item, identifying one or more links in the entry item, the one or more links each pointing to a corresponding one of the plurality of candidate target pages;
  
  verifying at least one of the candidate target pages as a target page including;
  
  fetching the at least one of the candidate target pages using a corresponding one of the one or more links;
  
  analyzing a visual structure and presentation of the at least one of the candidate target pages to identify one or more information blocks in the at least one of the candidate target pages;
  
  extracting one or more keyword features from the one or more information blocks;
  
  classifying the at least one of the candidate target pages as belonging to a particular genre of pages based at least on the one or more keyword features;
  
  extracting metadata from the target page; and
  
  organizing the target page and the metadata in one or more databases.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, wherein locating in the website the candidate entry list page comprises:
    - running a keyword search for a target keyword in links in a home page of the website, each link including at least one of anchor text and a uniform resource locator (URL);
      
      creating a link list including links in which the target keyword is found; and
      
      selecting a link from the link list, the link pointing to the candidate entry list page.
  - 3. The method of claim 2, wherein verifying the candidate entry list page as the entry list page using repeated pattern discovery comprises:
    - fetching the candidate entry list page using the selected link;
      
      parsing the candidate entry list page into a document object model (DOM) tree;
      
      scanning the DOM tree for a preselected repeated pattern;
      
      detecting the preselected repeated pattern in the DOM tree; and
      
      generating a list of detected instances of the preselected repeated pattern.
  - 4. The method of claim 3, wherein segmenting the entry list page into the plurality of entry items comprises:
    - detecting a DOM level in the DOM tree where each of the detected instances is in a different branch of the DOM tree;
      
      selecting the detected DOM level as a border of the plurality of entry items; and
      
      segmenting the entry list page into the different branches of the DOM tree, each of the different branches corresponding to a different one of the plurality of entry items.
  - 5. The method of claim 1, whereinthe one or more keywords are extracted from the one or more information blocks based on a visual position of each of the one or more keywords within a corresponding one of the one or more information blocks.
  - 6. The method of claim 5, wherein:
    - each of the one or more keywords extracted from the one or more information blocks matches a corresponding keyword in a preselected keyword set; and
      
      each of the one or more keywords extracted from the one or more information blocks includes;
      
      an information block title including text having fewer than a first threshold number of characters and visually presented on its own line within a corresponding one of the one or more information blocks;
      
      text having fewer than a second threshold number of characters and immediately followed by a colon within a corresponding one of the one or more information blocks; and
      
      a head of a table within a corresponding one of the one or more information blocks.
  - 7. The method of claim 5, wherein verifying the at least one of the candidate target pages further comprises:
    - generating one or more anchor text features including;
      
      extracting one or more anchor texts from the entry list page; and
      
      unifying each of the one or more anchor texts;
      
      generating one or more uniform resource locator (URL) features including;
      
      splitting a URL of the at least one of the candidate target pages into one or more URL tokens; and
      
      unifying each of the one or more URL tokens; and
      
      classifying the at least one of the candidate target pages as belonging to the particular genre of pages further based on the one or more anchor text features and the one or more URL features.
  - 8. The method of claim 1, wherein the target page comprises a detailed product page of a corresponding product associated with a corresponding merchant and wherein extracting metadata from the target page comprises extracting at least one of:
    - a name of the corresponding product, a price of the corresponding product, a name of the corresponding merchant, specifications associated with the corresponding product, or customized prices associated with the corresponding product.
  - 9. The method of claim 8, wherein the one or more databases include a database of products and associated prices, the method further comprising providing online access to the database of products and associated prices for product and price comparison queries by users.
  - 10. The method of claim 1, wherein the target page comprises a home page of a corresponding faculty member associated with a corresponding educational institution and wherein extracting metadata from the target page comprises extracting at least one of:
    - a name of the corresponding faculty member, a name of the corresponding educational institution, a course information block including one or more courses taught by the corresponding faculty member, and/or one or more publications by the corresponding faculty member.
  - 11. The method of claim 10, further comprising:
    - segmenting the course information block into an entry list of candidate course pages;
      
      extracting and verifying at least some of the candidate course pages as course pages including analyzing a visual structure and presentation of the candidate course pages;
      
      extracting course metadata from the course pages including, for at least one of the course pages, a learning materials information block;
      
      segmenting the learning materials information block into an entry list of candidate learning materials;
      
      extracting and verifying at least some of the candidate learning materials as learning materials; and
      
      organizing the course metadata and learning materials in the one or more databases.
  - 12. The method of claim 11, wherein the one or more databases include at least one of:
    - a database of faculty members, a database of courses, and a database of learning materials, the method further comprising providing online access to the database of faculty members, the database of courses, and/or the database of learning materials for scholar queries, course queries, publication queries, and/or learning material queries by users.

13. A system for automatically identifying and extracting distributed online resources, the system comprising:
- a processor;
  
  one or more databases communicatively coupled to the processor; and
  
  a tangible computer-readable storage medium communicatively coupled to the processor and having computer-executable instructions stored thereon that are executable by the processor to perform operations comprising;
  
  locating in a website a candidate entry list page;
  
  verifying the candidate entry list page as an entry list page using repeated pattern discovery;
  
  segmenting the entry list page into a plurality of entry items;
  
  extracting from the plurality of entry items a plurality of candidate target pages, wherein extracting from the plurality of entry items a plurality of candidate target pages comprises, for each corresponding entry item, identifying one or more links in the entry item, the one or more links each pointing to a corresponding one of the plurality of candidate target pages;
  
  verifying at least one of the candidate target pages as a target page including;
  
  fetching the at least one of the candidate target pages using a corresponding one of the one or more links;
  
  analyzing a visual structure and presentation of the at least one of candidate target pages to identify one or more information blocks in the at least one of the candidate target pages;
  
  extracting one or more keyword features from the one or more information blocks;
  
  classifying the at least one of the candidate target pages as belonging to a particular genre of pages based at least on the one or more keyword features;
  
  extracting metadata from the target page; and
  
  organizing the target page and the metadata in one or more databases.
- View Dependent Claims (14, 15, 16, 17, 18, 19)
- - 14. The system of claim 13, wherein the target page comprises a detailed product page of a corresponding product associated with a corresponding merchant and wherein extracting metadata from the target page comprises extracting at least one of:
    - a name of the corresponding product, a price of the corresponding product, a name of the corresponding merchant, specifications associated with the corresponding product, or customized prices associated with the corresponding product.
  - 15. The system of claim 14, wherein the one or more databases include a database of products and associated prices, the operations further comprising providing online access to the database of products and associated prices for product and price comparison queries by users.
  - 16. The system of claim 13, wherein the target page comprises a home page of a corresponding faculty member associated with a corresponding educational institution and wherein extracting metadata from the target page comprises extracting at least one of:
    - a name of the corresponding faculty member, a name of the corresponding educational institution, a course information block including one or more courses taught by the corresponding faculty member, and/or one or more publications by the corresponding faculty member.
  - 17. The system of claim 16, the operations further comprising:
    - segmenting the course information block into an entry list of candidate course pages;
      
      extracting and verifying at least some of the candidate course pages as course pages including analyzing a visual structure and presentation of the candidate course pages;
      
      extracting course metadata from the course pages including, for at least one of the course pages, a learning materials information block;
      
      segmenting the learning materials information block into an entry list of candidate learning materials;
      
      extracting and verifying at least some of the candidate learning materials as learning materials; and
      
      organizing the course metadata and learning materials in the one or more databases.
  - 18. The system of claim 17, wherein the one or more databases include a database of faculty members, a database of courses, and a database of learning materials, the operations further comprising providing online access to the database of faculty members, the database of courses, and/or the database of learning materials for scholar queries, course queries, publication queries, and/or learning material queries by users.
  - 19. The system of claim 13, wherein:
    - analyzing the visual structure and presentation of the at least one of candidate target pages comprises;
      
      generating the one or more keyword features, wherein theone or more keywords are extracted from the information blocks based on a visual position of each of the one or more keywords within a corresponding one of the information blocks;
      
      generating one or more anchor text features including;
      
      extracting one or more anchor texts from the entry list page; and
      
      unifying each of the one or more anchor texts; and
      
      generating one or more uniform resource locator (URL) features including;
      
      splitting a URL of the at least one of candidate target pages into one or more URL tokens; and
      
      unifying each of the one or more URL tokens; and
      
      classifying the at least one of candidate target pages as belonging to the particular genre of pages is further based on the one or more keyword features, the one or more anchor text features, and the one or more URL features.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Fujitsu Limited
Original Assignee
Fujitsu Limited
Inventors
Wang, Jun, Uchino, Kanji
Primary Examiner(s)
Syed, Farhan

Application Number

US13/732,036
Publication Number

US 20140188882A1
Time in Patent Office

1,289 Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/353 into predefined classes

G06F 16/951 Indexing; Web crawling tech...

Specific online resource identification and extraction

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

25 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Specific online resource identification and extraction

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

25 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links