Method and system for crawling, mapping and extracting information associated with a business using heuristic and semantic analysis
First Claim
Patent Images
1. A system for collecting information from a web page, comprising:
- a file system; and
a processor operatively connected to the file system and having functionality to execute instructions for;
obtaining and storing contents of the web page;
evaluating the contents to identify a unique identifier within the contents;
transforming the contents to a normalized form by analyzing the contents to identify at least one selected from a group consisting of a street name, a street number, a street direction, a house number, a neighborhood, a city name, a state name, a zip code and a point of interest;
parsing the normalized form of the contents to identify at least one token, wherein the at least one token comprises a portion of a physical address and an associated telephone number;
semantically analyzing, using a plurality of heuristic rules, the at least one token to identify a plurality of possible business identifications;
assigning, based on at least the portion of the physical address and the associated telephone number, a plurality of confidence scores to the plurality of possible business identifications;
identifying a highest confidence score of the plurality of confidence scores;
identifying, in the plurality of possible business identifications, a business identification corresponding to the highest confidence score;
mapping the unique identifier to the business identification;
extracting, after mapping the unique identifier to the business identification, at least one element from the contents of the web page using an extraction template, the extraction template generated based on a structure of the web page, the at least one element comprising data related to a business identified by the business identification;
associating the at least one element related to the business with the business identification; and
publishing results of the association of the at least one element with the business identification.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for crawling multiple websites containing one or more web pages having information relevant to a particular domain of interest, such as details about local restaurants, extracting content from such websites, such as hours, location and phone number as well as reviews, review dates and other business specific information, and associating the extracted content with a specific business entity.
87 Citations
15 Claims
-
1. A system for collecting information from a web page, comprising:
-
a file system; and a processor operatively connected to the file system and having functionality to execute instructions for; obtaining and storing contents of the web page; evaluating the contents to identify a unique identifier within the contents; transforming the contents to a normalized form by analyzing the contents to identify at least one selected from a group consisting of a street name, a street number, a street direction, a house number, a neighborhood, a city name, a state name, a zip code and a point of interest; parsing the normalized form of the contents to identify at least one token, wherein the at least one token comprises a portion of a physical address and an associated telephone number; semantically analyzing, using a plurality of heuristic rules, the at least one token to identify a plurality of possible business identifications; assigning, based on at least the portion of the physical address and the associated telephone number, a plurality of confidence scores to the plurality of possible business identifications; identifying a highest confidence score of the plurality of confidence scores; identifying, in the plurality of possible business identifications, a business identification corresponding to the highest confidence score; mapping the unique identifier to the business identification; extracting, after mapping the unique identifier to the business identification, at least one element from the contents of the web page using an extraction template, the extraction template generated based on a structure of the web page, the at least one element comprising data related to a business identified by the business identification; associating the at least one element related to the business with the business identification; and publishing results of the association of the at least one element with the business identification. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method for collecting information, comprising:
-
identifying, using a processor, a plurality of web pages that are likely to comprise information about a business; crawling, using the processor, the plurality of web pages to collect and store contents of the plurality of web pages; evaluating, using the processor, the contents by; transforming the contents to a normalized form by analyzing the contents to identify at least one selected from a group consisting of a street name, a street number, a street direction, a house number, a neighborhood, a city name, a state name, a zip code and a point of interest; parsing the normalized form of the contents to identify at least one token, wherein the at least one token comprises a portion of a physical address and an associated telephone number; semantically analyzing, using a plurality of heuristic rules, the at least one token to identify a plurality of possible business identifications; assigning, based on at least the portion of the physical address and the associated telephone number, a plurality of confidence scores to the plurality of possible business identifications; identifying a highest confidence score of the plurality of confidence scores; identifying, in the plurality of possible business identifications, a business identification corresponding to the highest confidence score; and mapping the business identification to the business; extracting, using the processor and after mapping the business identification to the business, the information about the business from the contents; and publishing, using the processor, the information about the business. - View Dependent Claims (12, 13, 14, 15)
-
-
7. A system for crawling websites, comprising:
-
a file system; and a processor operatively connected to the file system and having functionality to execute instructions for; identifying a plurality of websites comprising information about a plurality of businesses; crawling a seed uniform resource locator (URL) located within a website of the plurality of websites; obtaining and storing contents of the website at a first web page identified by the seed URL; analyzing the contents to identify links from the website to other URLs and to identify at least one other URL of the other URLs that links to a second web page comprising an attribute of interest about a business of the plurality of businesses; storing the at least one other URL for use as an additional seed URLs; extracting a business identification code from the contents, wherein the business identification code is a unique identifier used to organize information about the business on the website; and analyzing the contents to associate the contents with the business by; transforming the contents to a normalized form by analyzing the contents to identify at least one selected from a group consisting of a street name, a street number, a street direction, a house number, a neighborhood, a city name, a state name, a zip code and a point of interest; parsing the normalized form of the contents to identify at least a portion of a physical address and an associated telephone number; semantically analyzing, using a plurality of heuristic rules, the portion of the physical address and the associated telephone number to identify a plurality of possible business identifications; assigning, based on at least the portion of the physical address and the associated telephone number, a plurality of confidence scores to the plurality of possible business identifications; identifying a highest confidence score of the plurality of confidence scores; identifying, in the plurality of possible business identifications, a business identification corresponding to the highest confidence score; mapping the business identification code to the business identification; extracting, after mapping the business identification code to the business identification, other information about the business from the contents and associating the other information with the business; and publishing the other information about the business. - View Dependent Claims (8, 9, 10, 11)
-
Specification