×

Method and system for crawling, mapping and extracting information associated with a business using heuristic and semantic analysis

  • US 8,166,013 B2
  • Filed: 11/04/2008
  • Issued: 04/24/2012
  • Est. Priority Date: 11/05/2007
  • Status: Active Grant
First Claim
Patent Images

1. A system for collecting information from a web page, comprising:

  • a file system; and

    a processor operatively connected to the file system and having functionality to execute instructions for;

    obtaining and storing contents of the web page;

    evaluating the contents to identify a unique identifier within the contents;

    transforming the contents to a normalized form by analyzing the contents to identify at least one selected from a group consisting of a street name, a street number, a street direction, a house number, a neighborhood, a city name, a state name, a zip code and a point of interest;

    parsing the normalized form of the contents to identify at least one token, wherein the at least one token comprises a portion of a physical address and an associated telephone number;

    semantically analyzing, using a plurality of heuristic rules, the at least one token to identify a plurality of possible business identifications;

    assigning, based on at least the portion of the physical address and the associated telephone number, a plurality of confidence scores to the plurality of possible business identifications;

    identifying a highest confidence score of the plurality of confidence scores;

    identifying, in the plurality of possible business identifications, a business identification corresponding to the highest confidence score;

    mapping the unique identifier to the business identification;

    extracting, after mapping the unique identifier to the business identification, at least one element from the contents of the web page using an extraction template, the extraction template generated based on a structure of the web page, the at least one element comprising data related to a business identified by the business identification;

    associating the at least one element related to the business with the business identification; and

    publishing results of the association of the at least one element with the business identification.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×