Generating structured information
First Claim
1. A system for generating structured data, comprising:
- a processor for executing computer program modules; and
a computer-readable storage medium storing executable computer program modules comprising;
a data acquisition module for receiving an electronic document containing unstructured data describing facts about business hours of an enterprise;
a data extraction module for extracting the unstructured data describing facts about the business hours of the enterprise from the electronic document; and
a data parsing module for receiving the extracted unstructured data and creating structured representations of the facts about the business hours of the enterprise described by the unstructured data, wherein the data parsing module comprises;
a value normalization module for receiving a string describing facts about the business hours of the enterprise extracted from the electronic document and for;
parsing the string to classify symbols within the string, the parsing classifying symbols within the string as representing days of the week and classifying symbols within the string as representing times of the enterprise'"'"'s business hours;
collapsing the symbols representing days of the week in the string to form a collapsed string, the collapsed string having a symbol representing a sequence of days and the symbols representing times of the enterprise'"'"'s business hours;
interpreting the symbols within the collapsed string to determine business hours for the enterprise on the days in the sequence;
wherein the structured representations of the facts about the business hours of the enterprise comprise a vector describing the symbol representing the sequence of days using bits indicating days of the week on which the enterprise is open.
2 Assignments
0 Petitions
Accused Products
Abstract
Structured and/or unstructured data about enterprises are acquired from one or more sources such as commercial data providers, enterprise web sites, and/or directory web sites. Strings are extracted from the unstructured data. The strings contain key, value pairs describing facts about the enterprises. The extracted strings are parsed to normalize the keys and values and place them in a machine-understandable structured representation. Some keys and/or values cannot be normalized. The facts are clustered with the enterprise to which they pertain. Normalized facts from different sources are compared and confidence levels and/or weights are assigned to the facts. These confidence levels and weights are used to select the facts that are displayed on a page for the enterprise in a directory.
70 Citations
18 Claims
-
1. A system for generating structured data, comprising:
-
a processor for executing computer program modules; and a computer-readable storage medium storing executable computer program modules comprising; a data acquisition module for receiving an electronic document containing unstructured data describing facts about business hours of an enterprise; a data extraction module for extracting the unstructured data describing facts about the business hours of the enterprise from the electronic document; and a data parsing module for receiving the extracted unstructured data and creating structured representations of the facts about the business hours of the enterprise described by the unstructured data, wherein the data parsing module comprises; a value normalization module for receiving a string describing facts about the business hours of the enterprise extracted from the electronic document and for; parsing the string to classify symbols within the string, the parsing classifying symbols within the string as representing days of the week and classifying symbols within the string as representing times of the enterprise'"'"'s business hours; collapsing the symbols representing days of the week in the string to form a collapsed string, the collapsed string having a symbol representing a sequence of days and the symbols representing times of the enterprise'"'"'s business hours; interpreting the symbols within the collapsed string to determine business hours for the enterprise on the days in the sequence; wherein the structured representations of the facts about the business hours of the enterprise comprise a vector describing the symbol representing the sequence of days using bits indicating days of the week on which the enterprise is open. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A computer-readable storage medium having computer-executable program modules for generating structured data tangibly embodied therein, comprising:
-
a data acquisition module for receiving an electronic document containing unstructured data describing facts about business hours of an enterprise; a data extraction module for extracting the unstructured data describing facts about the business hours of the enterprise from the electronic document; and a data parsing module for receiving the extracted unstructured data and creating structured representations of the facts about the business hours of the enterprise described by the unstructured data, wherein the data parsing module comprises; a value normalization module for receiving a string describing facts about the business hours of the enterprise extracted from the electronic document and for; parsing the string to classify symbols within the string, the parsing classifying symbols within the string as representing days of the week and classifying symbols within the string as representing times of the enterprise'"'"'s business hours; collapsing the symbols representing days of the week in the string to form a collapsed string, the collapsed string having a symbol representing a sequence of days and the symbols representing times of the enterprise'"'"'s business hours, wherein the symbol representing the sequence of days is described in the structured representation by a vector having bits indicating days of the week on which the enterprise is open; and interpreting the symbols within the collapsed string to determine business hours for the enterprise on the days in the sequence; wherein the structured representations of the facts about the business hours of the enterprise comprise a vector describing the symbol representing the sequence of days using bits indicating days of the week on which the enterprise is open. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A method for generating structured data, comprising:
using a computer to perform steps comprising; receiving an electronic document containing unstructured data describing facts about business hours of an enterprise; extracting the unstructured data describing facts about the business hours of the enterprise from the electronic document; and receiving the extracted unstructured data and creating structured representations of the facts about the business hours of the enterprise described by the unstructured data, wherein the receiving extracted unstructured data and creating comprises; receiving a string describing facts about the business hours of the enterprise extracted from the electronic document; parsing the string to classify symbols within the string, the parsing classifying symbols within the string as representing days of the week and classifying symbols within the string as representing times of the enterprise'"'"'s business hours; collapsing the symbols representing days of the week in the string to form a collapsed string, the collapsed string having a symbol representing a sequence of days and the symbols representing times of the enterprise'"'"'s business hours, wherein the symbol representing the sequence of days is described in the structured representation by a vector having bits indicating days of the week on which the enterprise is open; and interpreting the symbols within the collapsed string to determine business hours for the enterprise on the days in the sequence; wherein the structured representations of the facts about the business hours of the enterprise comprise a vector describing the symbol representing the sequence of days using bits indicating days of the week on which the enterprise is open. - View Dependent Claims (14, 15, 16, 17, 18)
Specification