Method and system for crawling, mapping and extracting information associated with a business using heuristic and semantic analysis

US 8,166,013 B2
Filed: 11/04/2008
Issued: 04/24/2012
Est. Priority Date: 11/05/2007
Status: Active Grant

First Claim

Patent Images

1. A system for collecting information from a web page, comprising:

a file system; and

a processor operatively connected to the file system and having functionality to execute instructions for;

obtaining and storing contents of the web page;

evaluating the contents to identify a unique identifier within the contents;

transforming the contents to a normalized form by analyzing the contents to identify at least one selected from a group consisting of a street name, a street number, a street direction, a house number, a neighborhood, a city name, a state name, a zip code and a point of interest;

parsing the normalized form of the contents to identify at least one token, wherein the at least one token comprises a portion of a physical address and an associated telephone number;

semantically analyzing, using a plurality of heuristic rules, the at least one token to identify a plurality of possible business identifications;

assigning, based on at least the portion of the physical address and the associated telephone number, a plurality of confidence scores to the plurality of possible business identifications;

identifying a highest confidence score of the plurality of confidence scores;

identifying, in the plurality of possible business identifications, a business identification corresponding to the highest confidence score;

mapping the unique identifier to the business identification;

extracting, after mapping the unique identifier to the business identification, at least one element from the contents of the web page using an extraction template, the extraction template generated based on a structure of the web page, the at least one element comprising data related to a business identified by the business identification;

associating the at least one element related to the business with the business identification; and

publishing results of the association of the at least one element with the business identification.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for crawling multiple websites containing one or more web pages having information relevant to a particular domain of interest, such as details about local restaurants, extracting content from such websites, such as hours, location and phone number as well as reviews, review dates and other business specific information, and associating the extracted content with a specific business entity.

87 Citations

View as Search Results

15 Claims

1. A system for collecting information from a web page, comprising:
- a file system; and
  
  a processor operatively connected to the file system and having functionality to execute instructions for;
  
  obtaining and storing contents of the web page;
  
  evaluating the contents to identify a unique identifier within the contents;
  
  transforming the contents to a normalized form by analyzing the contents to identify at least one selected from a group consisting of a street name, a street number, a street direction, a house number, a neighborhood, a city name, a state name, a zip code and a point of interest;
  
  parsing the normalized form of the contents to identify at least one token, wherein the at least one token comprises a portion of a physical address and an associated telephone number;
  
  semantically analyzing, using a plurality of heuristic rules, the at least one token to identify a plurality of possible business identifications;
  
  assigning, based on at least the portion of the physical address and the associated telephone number, a plurality of confidence scores to the plurality of possible business identifications;
  
  identifying a highest confidence score of the plurality of confidence scores;
  
  identifying, in the plurality of possible business identifications, a business identification corresponding to the highest confidence score;
  
  mapping the unique identifier to the business identification;
  
  extracting, after mapping the unique identifier to the business identification, at least one element from the contents of the web page using an extraction template, the extraction template generated based on a structure of the web page, the at least one element comprising data related to a business identified by the business identification;
  
  associating the at least one element related to the business with the business identification; and
  
  publishing results of the association of the at least one element with the business identification.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The system of claim 1, wherein evaluating the contents to identify the unique identifier within the contents further comprises:
    - analyzing an underlying construction of the web page to identify at least one link to other web pages; and
      
      storing the at least one link to other web pages.
  - 3. The system of claim 1, wherein the web page is from a structured web site comprising web pages formatted in a pre-defined manner.
  - 4. The system of claim 1, wherein the web page is from an unstructured web site.
  - 5. The system of claim 1, wherein the instructions are further for identifying when the contents of the web page are modified, and wherein the web page is crawled again to store new contents.

6. A method for collecting information, comprising:
- identifying, using a processor, a plurality of web pages that are likely to comprise information about a business;
  
  crawling, using the processor, the plurality of web pages to collect and store contents of the plurality of web pages;
  
  evaluating, using the processor, the contents by;
  
  transforming the contents to a normalized form by analyzing the contents to identify at least one selected from a group consisting of a street name, a street number, a street direction, a house number, a neighborhood, a city name, a state name, a zip code and a point of interest;
  
  parsing the normalized form of the contents to identify at least one token, wherein the at least one token comprises a portion of a physical address and an associated telephone number;
  
  semantically analyzing, using a plurality of heuristic rules, the at least one token to identify a plurality of possible business identifications;
  
  assigning, based on at least the portion of the physical address and the associated telephone number, a plurality of confidence scores to the plurality of possible business identifications;
  
  identifying a highest confidence score of the plurality of confidence scores;
  
  identifying, in the plurality of possible business identifications, a business identification corresponding to the highest confidence score; and
  
  mapping the business identification to the business;
  
  extracting, using the processor and after mapping the business identification to the business, the information about the business from the contents; and
  
  publishing, using the processor, the information about the business.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The method of claim 6, wherein evaluating the contents further comprises:
    - analyzing an underlying construction of each of the plurality of web pages to identify at least one link to other web pages; and
      
      storing the at least one link to other web pages.
  - 13. The method of claim 6, wherein at least one of the plurality of web pages is from a structured web site comprising structured web pages formatted in a pre-defined manner.
  - 14. The method of claim 6, wherein at least one of the plurality of web pages is from an unstructured web site.
  - 15. The method of claim 6, further comprising:
    - identifying when the contents of the plurality of web pages is modified, wherein the plurality of web pages is crawled again to store new contents.

7. A system for crawling websites, comprising:
- a file system; and
  
  a processor operatively connected to the file system and having functionality to execute instructions for;
  
  identifying a plurality of websites comprising information about a plurality of businesses;
  
  crawling a seed uniform resource locator (URL) located within a website of the plurality of websites;
  
  obtaining and storing contents of the website at a first web page identified by the seed URL;
  
  analyzing the contents to identify links from the website to other URLs and to identify at least one other URL of the other URLs that links to a second web page comprising an attribute of interest about a business of the plurality of businesses;
  
  storing the at least one other URL for use as an additional seed URLs;
  
  extracting a business identification code from the contents, wherein the business identification code is a unique identifier used to organize information about the business on the website; and
  
  analyzing the contents to associate the contents with the business by;
  
  transforming the contents to a normalized form by analyzing the contents to identify at least one selected from a group consisting of a street name, a street number, a street direction, a house number, a neighborhood, a city name, a state name, a zip code and a point of interest;
  
  parsing the normalized form of the contents to identify at least a portion of a physical address and an associated telephone number;
  
  semantically analyzing, using a plurality of heuristic rules, the portion of the physical address and the associated telephone number to identify a plurality of possible business identifications;
  
  assigning, based on at least the portion of the physical address and the associated telephone number, a plurality of confidence scores to the plurality of possible business identifications;
  
  identifying a highest confidence score of the plurality of confidence scores;
  
  identifying, in the plurality of possible business identifications, a business identification corresponding to the highest confidence score;
  
  mapping the business identification code to the business identification;
  
  extracting, after mapping the business identification code to the business identification, other information about the business from the contents and associating the other information with the business; and
  
  publishing the other information about the business.
- View Dependent Claims (8, 9, 10, 11)
- - 8. The system of claim 7, wherein the business is a restaurant and wherein the second web page comprises at least one selected from a group consisting of reviews, ratings, directions, and bibliographic information.
  - 9. The system of claim 8, wherein extracting the other information about the business from the contents is performed using a heuristically derived extraction template.
  - 10. The system of claim 7, wherein the seed URL identifies a blog.
  - 11. The system of claim 10, wherein identifying the plurality of websites comprising the information about the plurality of businesses further comprises:
    - creating a plurality of query strings, each of the plurality of query strings comprising a name and a city of operation of the business;
      
      submitting each of the plurality of query strings to a notification service that monitors publication of new or altered information in blogs; and
      
      receiving an indication from the notification service when an updated web page is identified as matching a query string of the plurality of query strings, the indication including the URL of the updated web page, wherein the updated web page is identified as comprising the information about the business.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Intuit, Inc.
Original Assignee
Intuit, Inc.
Inventors
Bandaru, Nagaraju, Moyer, Eric D., Radhakrishna, Shrisha
Primary Examiner(s)
Ali, Mohammad
Assistant Examiner(s)
Hocker, John

Application Number

US12/290,825
Publication Number

US 20090119268A1
Time in Patent Office

1,267 Days
Field of Search

707/705, 707/999.003, 707/708, 707/706
US Class Current

707/705
CPC Class Codes

G06F 16/951   Indexing; Web crawling tech...

G06F 40/258   Heading extraction; Automat...

G06Q 10/0631   Resource planning, allocati...

Method and system for crawling, mapping and extracting information associated with a business using heuristic and semantic analysis

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

87 Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for crawling, mapping and extracting information associated with a business using heuristic and semantic analysis

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

87 Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links