SPECIFIC ONLINE RESOURCE IDENTIFICATION AND EXTRACTION

US 20140188882A1
Filed: 12/31/2012
Published: 07/03/2014
Est. Priority Date: 12/31/2012
Status: Active Grant

First Claim

Patent Images

1. A method of automatically identifying and extracting distributed online resources, the method comprising:

locating in a website a candidate entry list page;

verifying the candidate entry list page as an entry list page using repeated pattern discovery;

segmenting the entry list page into a plurality of entry items;

extracting from the plurality of entry items a plurality of candidate target pages;

verifying at least some of the candidate target pages as target pages including analyzing a visual structure and presentation of the candidate target pages;

extracting metadata from the target pages; and

organizing the target pages and/or the metadata in one or more databases.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of automatically identifying and extracting distributed online resources may include locating in a website a candidate entry list page. The method may also include verifying the candidate entry list page as an entry list page using repeated pattern discovery. The method may also include segmenting the entry list page into a plurality of entry items. The method may also include extracting from the plurality of entry items a plurality of candidate target pages. The method may also include verifying at least some of the candidate target pages as target pages including analyzing a visual structure and presentation of the candidate target pages. The method may also include extracting metadata from the target pages. The method may also include organizing the target pages and/or the metadata in one or more databases.

Citations

20 Claims

1. A method of automatically identifying and extracting distributed online resources, the method comprising:
- locating in a website a candidate entry list page;
  
  verifying the candidate entry list page as an entry list page using repeated pattern discovery;
  
  segmenting the entry list page into a plurality of entry items;
  
  extracting from the plurality of entry items a plurality of candidate target pages;
  
  verifying at least some of the candidate target pages as target pages including analyzing a visual structure and presentation of the candidate target pages;
  
  extracting metadata from the target pages; and
  
  organizing the target pages and/or the metadata in one or more databases.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, wherein locating in a website a candidate entry list page comprises:
    - running a keyword search for a target keyword in links in a home page of the website, each link including at least one of anchor text and a uniform resource locator (URL);
      
      creating a link list including links in which the target keyword is found; and
      
      selecting a link from the link list, the link pointing to the candidate entry list page.
  - 3. The method of claim 2, wherein verifying the candidate entry list page as an entry list page using repeated pattern discovery comprises:
    - fetching the candidate entry list page using the selected link;
      
      parsing the candidate entry list page into a document object model (DOM) tree;
      
      scanning the DOM tree for a preselected repeated pattern;
      
      detecting the preselected repeated pattern in the DOM tree; and
      
      generating a list of detected instances of the preselected repeated pattern.
  - 4. The method of claim 3, wherein segmenting the entry list page into a plurality of entry items comprises:
    - detecting a DOM level in the DOM tree where each of the detected instances is in a different branch of the DOM tree;
      
      selecting the detected DOM level as a border of the plurality of entry items; and
      
      segmenting the entry list page into the different branches of the DOM tree, each of the different branches corresponding to a different one of the plurality of entry items.
  - 5. The method of claim 1, wherein extracting from the plurality of entry items a plurality of candidate target pages comprises, for each corresponding entry item, identifying one or more links in the entry item, the one or more links each pointing to a corresponding one of the plurality of candidate target pages.
  - 6. The method of claim 5, wherein verifying at least some of the candidate target pages as target pages including analyzing a visual structure and presentation of the candidate target pages comprises, for each corresponding candidate target page:
    - fetching the candidate target page using a corresponding one of the one or more links;
      
      analyzing the visual structure and presentation of the candidate target page to identify information blocks in the candidate target page;
      
      extracting one or more keywords from the information blocks based on a visual position of each of the one or more keywords within a corresponding one of the information blocks to generate keyword features; and
      
      classifying the candidate target page as belonging to a particular genre of pages based at least on the keyword features.
  - 7. The method of claim 6, wherein:
    - each of the one or more keywords extracted from the information blocks matches a corresponding keyword in a preselected keyword set; and
      
      each of the one or more keywords extracted from the information blocks includes;
      
      an information block title including text having fewer than a first threshold number of characters and visually presented on its own line within a corresponding one of the information blocks;
      
      text having fewer than a second threshold number of characters and immediately followed by a colon within a corresponding one of the information blocks; and
      
      a head of a table within a corresponding one of the information blocks.
  - 8. The method of claim 6, wherein verifying at least some of the candidate target pages further comprises:
    - generating one or more anchor text features including;
      
      extracting one or more anchor texts from the entry list page; and
      
      unifying each of the one or more anchor texts;
      
      generating one or more uniform resource locator (URL) features including;
      
      splitting a URL of the candidate target page into one or more URL tokens; and
      
      unifying each of the one or more URL tokens; and
      
      classifying the candidate target page as belonging to a particular genre of pages additionally based on the one or more anchor text features and the one or more URL features.
  - 9. The method of claim 1, wherein the target pages each comprise a detailed product page of a corresponding product associated with a corresponding merchant such that extracting metadata from the target pages comprises extracting at least one of:
    - a name of the corresponding product, a price of the corresponding product, a name of the corresponding merchant, specifications associated with the corresponding product;
      
      or customized prices associated with the corresponding product.
  - 10. The method of claim 9, wherein the one or more databases include a database of products and associated prices, the method further comprising providing online access to the database of products and associated prices for product and price comparison queries by users.
  - 11. The method of claim 1, wherein the target pages each comprise a home page of a corresponding faculty member associated with a corresponding educational institution such that extracting metadata from the target pages comprises extracting at least one of:
    - a name of the corresponding faculty member, a name of the corresponding educational institution, a course information block including one or more courses taught by the corresponding faculty member, and/or one or more publications by the corresponding faculty member.
  - 12. The method of claim 11, further comprising:
    - segmenting the course information block into an entry list of candidate course pages;
      
      extracting and verifying at least some of the candidate course pages as course pages including analyzing a visual structure and presentation of the candidate course pages;
      
      extracting course metadata from the course pages including, for at least one of the course pages, a learning materials information block;
      
      segmenting the learning materials information block into an entry list of candidate learning materials;
      
      extracting and verifying at least some of the candidate learning materials as learning materials; and
      
      organizing the course metadata and learning materials in the one or more databases.
  - 13. The method of claim 12, wherein the one or more databases include at least one of:
    - a database of faculty members, a database of courses, and a database of learning materials, the method further comprising providing online access to the database of faculty members, the database of courses, and/or the database of learning materials for scholar queries, course queries, publication queries, and/or learning material queries by users.

14. A system for automatically identifying and extracting distributed online resources, the system comprising:
- a processor;
  
  one or more databases communicatively coupled to the processor; and
  
  a tangible computer-readable storage medium communicatively coupled to the processor and having computer-executable instructions stored thereon that are executable by the processor to perform operations comprising;
  
  locating in a website a candidate entry list page;
  
  verifying the candidate entry list page as an entry list page using repeated pattern discovery;
  
  segmenting the entry list page into a plurality of entry items;
  
  extracting from the plurality of entry items a plurality of candidate target pages;
  
  verifying at least some of the candidate target pages as target pages including analyzing a visual structure and presentation of the candidate target pages;
  
  extracting metadata from the target pages; and
  
  organizing the target pages and/or the metadata in one or more databases.
- View Dependent Claims (15, 16, 17, 18, 19, 20)
- - 15. The system of claim 14, wherein the target pages each comprise a detailed product page of a corresponding product associated with a corresponding merchant such that extracting metadata from the target pages comprises extracting at least one of:
    - a name of the corresponding product, a price of the corresponding product, a name of the corresponding merchant, specifications associated with the corresponding product, or customized prices associated with the corresponding product.
  - 16. The system of claim 15, wherein the one or more databases include a database of products and associated prices, the operations further comprising providing online access to the database of products and associated prices for product and price comparison queries by users.
  - 17. The system of claim 14, wherein the target pages each comprise a home page of a corresponding faculty member associated with a corresponding educational institution such that extracting metadata from the target pages comprises extracting at least one of:
    - a name of the corresponding faculty member, a name of the corresponding educational institution, a course information block including one or more courses taught by the corresponding faculty member, and/or one or more publications by the corresponding faculty member.
  - 18. The system of claim 17, the operations further comprising:
    - segmenting the course information block into an entry list of candidate course pages;
      
      extracting and verifying at least some of the candidate course pages as course pages including analyzing a visual structure and presentation of the candidate course pages;
      
      extracting course metadata from the course pages including, for at least one of the course pages, a learning materials information block;
      
      segmenting the learning materials information block into an entry list of candidate learning materials;
      
      extracting and verifying at least some of the candidate learning materials as learning materials; and
      
      organizing the course metadata and learning materials in the one or more databases.
  - 19. The system of claim 18, wherein the one or more databases include a database of faculty members, a database of courses, and a database of learning materials, the operations further comprising providing online access to the database of faculty members, the database of courses, and/or the database of learning materials for scholar queries, course queries, publication queries, and/or learning material queries by users.
  - 20. The system of claim 14, wherein:
    - extracting from the plurality of entry items a plurality of candidate target pages comprises, for each corresponding entry item, identifying one or more links in the entry item, the one or more links each pointing to a corresponding one of the plurality of candidate target pages; and
      
      verifying at least some of the candidate target pages as target pages including analyzing a visual structure and presentation of the candidate target pages comprises, for each corresponding candidate target page;
      
      generating one or more keyword features including;
      
      fetching the candidate target page using a corresponding one of the one or more links;
      
      analyzing the visual structure and presentation of the candidate target page to identify information blocks in the candidate target page;
      
      extracting one or more keywords from the information blocks based on a visual position of each of the one or more keywords within a corresponding one of the information blocks;
      
      generating one or more anchor text features including;
      
      extracting one or more anchor texts from the entry list page; and
      
      unifying each of the one or more anchor texts;
      
      generating one or more uniform resource locator (URL) features including;
      
      splitting a URL of the candidate target page into one or more URL tokens; and
      
      unifying each of the one or more URL tokens; and
      
      classifying the candidate target page as belonging to a particular genre of pages based on the keyword features, the one or more anchor text features, and the one or more URL features.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Fujitsu Limited
Original Assignee
Fujitsu Limited
Inventors
WANG, Jun, UCHINO, Kanji

Granted Patent

US 9,390,166 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/737
CPC Class Codes

G06F 16/353 into predefined classes

G06F 16/951 Indexing; Web crawling tech...

SPECIFIC ONLINE RESOURCE IDENTIFICATION AND EXTRACTION

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

SPECIFIC ONLINE RESOURCE IDENTIFICATION AND EXTRACTION

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links