Method and system for crawling, mapping and extracting information associated with a business using heuristic and semantic analysis
First Claim
Patent Images
1. A system for crawling, mapping, extracting and associating information on a web page with a known business, the system comprising:
- means for identifying the website address of a web page to be crawled;
means for downloading and storing the contents of said web page;
means for evaluating the stored content and identifying a unique identification symbol within said stored content and associated with a business referenced on said web page;
means for transforming the stored content to a normalize form;
means for identifying one or more potential businesses that may be associated with said normalized content;
means for confidently selecting one business, from among said one or more potential businesses, to be associated with said normalized content;
means for mapping said unique identification symbol with said confidently selected business;
means for extracting one or more elements from said stored contents of said web page in accordance with an extraction template formed in accordance with the structure of said web page, said elements comprising data about the business referenced on said web page;
means for associating said extracted elements about the business referenced on said web page with the confidently selected one business; and
means for publishing the results of said association of said extracted elements with said confidently selected one business.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for crawling multiple websites containing one or more web pages having information relevant to a particular domain of interest, such as details about local restaurants, extracting content from such websites, such as hours, location and phone number as well as reviews, review dates and other business specific information, and associating the extracted content with a specific business entity.
212 Citations
17 Claims
-
1. A system for crawling, mapping, extracting and associating information on a web page with a known business, the system comprising:
-
means for identifying the website address of a web page to be crawled; means for downloading and storing the contents of said web page; means for evaluating the stored content and identifying a unique identification symbol within said stored content and associated with a business referenced on said web page; means for transforming the stored content to a normalize form; means for identifying one or more potential businesses that may be associated with said normalized content; means for confidently selecting one business, from among said one or more potential businesses, to be associated with said normalized content; means for mapping said unique identification symbol with said confidently selected business; means for extracting one or more elements from said stored contents of said web page in accordance with an extraction template formed in accordance with the structure of said web page, said elements comprising data about the business referenced on said web page; means for associating said extracted elements about the business referenced on said web page with the confidently selected one business; and means for publishing the results of said association of said extracted elements with said confidently selected one business. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method for collecting information about a local business from multiple web pages including:
-
identifying one or more web pages that are likely to contain information about said local business; crawling said one or more web pages; collecting and storing all of the content found on said one or more web pages; evaluating said stored content to determine the business entity said stored content refers to; matching said business entity to said local business; extracting all the information about said local business from said stored content; and publishing said extracted information, said publication including an indication of the local business to which said extracted information refers.
-
-
9. An improved system for crawling websites to collect and extract information about a specific local business, said system comprising:
-
means for identifying a list of websites known to contain information about local businesses; means for crawling a seed URL located within one of said websites known to contain information about local businesses; means for means for downloading and storing all of the content found on said website at a web page identified by said seed URL; means for analyzing said stored content to identify all other URL'"'"'s the web page identified by said seed URL points to and to determine of said other URL'"'"'s point to a web page containing an attribute of interest about said specific local business; means for storing said identified all other URL'"'"'s whereby they can be also used as seed URL'"'"'s; means for extracting a business identification code from said stored content, said business identification code being a unique identifier used to organize all information about a specific local business on said website; and means for analyzing said stored content in order to associate it with a specific local business, further comprising means for extracting at least one of a business name, business address and business phone number from said stored content; means for comparing said extracted business name, business address and business phone number with a directory of known local businesses and identifying the known local business that said stored content is associated with; means for extracting all other information about said identified known local business from said stored content and associating said extracted information with said identified local business; and means for publishing the results of said association. - View Dependent Claims (10, 11, 15, 16, 17)
-
-
12. The improved system for crawling websites to collect and extract information about a specific local business, as claimed in claim 10, wherein said means for comparing said extracted business name, business address and business phone number with a directory of known local businesses further comprise the use of semantic analysis.
-
12-14. -14. (canceled)
Specification