NATURAL LANGUAGE PARSERS TO NORMALIZE ADDRESSES FOR GEOCODING
First Claim
1. A method for normalizing an input address comprising the steps of:
- receiving an input address,parsing the input address into components,classifying each component according to one or more predetermined regular expressions and a lexicon of known tokens, thereby generating classified components, andexecuting a predictive model to associate each classified component with a unique address field.
9 Assignments
0 Petitions
Accused Products
Abstract
The present invention provides a technique for building natural language parsers by implementing a country and/or jurisdiction specific set of training data that is automatically converted during a build phase to a respective predictive model, i.e., an automated country specific natural language parser. The predictive model can be used without the training data to quantify any input address. This model may be included as part of a larger Geographic Information System (GIS) data-set or as a stand alone quantifier. The build phase may also be run on demand and the resultant predictive model kept in temporary storage for immediate use.
89 Citations
19 Claims
-
1. A method for normalizing an input address comprising the steps of:
-
receiving an input address, parsing the input address into components, classifying each component according to one or more predetermined regular expressions and a lexicon of known tokens, thereby generating classified components, and executing a predictive model to associate each classified component with a unique address field. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method of constructing a natural language parser comprising the steps of:
-
loading a training file defining an acceptable format for one or more regular expressions and comprising exemplary address field and token pairs; parsing the training file into a number of tokens; classifying the tokens according to a lexicon of known tokens and the regular expressions; and generating a predictive model that defines a probability for each of one or more address fields that may be associated with a given token. - View Dependent Claims (9, 10, 11, 12)
-
-
13. A computer readable medium encoded with computer readable program code, the program code comprising the instructions of:
-
parsing an input address into components, classifying each component according to one or more predetermined regular expressions and a lexicon of known tokens, thereby generating classified components, and executing a predictive model to associate each classified component with a unique address field. - View Dependent Claims (14, 15, 16, 17, 18, 19)
-
Specification