Systems and methods for discovering synonymous elements using context over multiple similar addresses
First Claim
Patent Images
1. A method comprising:
- accessing a database having postal addresses stored therein on a computer readable storage medium;
in response to accessing the database;
clustering a plurality of the postal addresses based on similarity, and thereby forming at least one cluster of postal addresses; and
within an identified cluster of postal addresses, identifying one or more synonyms relative to one or more components of postal addresses, wherein the one or synonyms comprise variants of the one or more components; and
with respect to one or more components of the postal addresses in the identified cluster of postal addresses, identifying a standardized identifier from among the one or more synonyms and applying the standardized identifier to postal addresses within the cluster of postal addresses.
1 Assignment
0 Petitions
Accused Products
Abstract
A clustering-based approach to data standardization is provided. Certain embodiments take as input a plurality of addresses, identify one or more features of the addresses, cluster the addresses based on the one or more features, utilize the cluster(s) to provide a data-based context useful in identifying one or more synonyms for elements contained in the address(es), and standardize the address(es) to an acceptable format, with one or more synonyms and/or other elements being added to or taken away from the input address(es) as part of the standardization process.
20 Citations
20 Claims
-
1. A method comprising:
-
accessing a database having postal addresses stored therein on a computer readable storage medium; in response to accessing the database; clustering a plurality of the postal addresses based on similarity, and thereby forming at least one cluster of postal addresses; and within an identified cluster of postal addresses, identifying one or more synonyms relative to one or more components of postal addresses, wherein the one or synonyms comprise variants of the one or more components; and with respect to one or more components of the postal addresses in the identified cluster of postal addresses, identifying a standardized identifier from among the one or more synonyms and applying the standardized identifier to postal addresses within the cluster of postal addresses. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A computer program product comprising:
-
a computer readable storage medium having computer readable program code embodied therewith, the computer readable program code comprising; computer readable program code configured to access a database having postal addresses stored therein; computer readable program code configured to, in response to accessing the database; cluster a plurality of the postal addresses based on similarity, and thereby forming at least one cluster of postal addresses; and within an identified cluster of postal addresses, identify one or more synonyms relative to one or more components of postal addresses, wherein the one or synonyms comprise variants of the one or more components; and computer readable program code configured to, with respect to one or more components of the postal addresses in the identified cluster of postal addresses, identify a standardized identifier from among the one or more synonyms and applying the standardized identifier to postal addresses within the cluster of postal addresses. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. An apparatus comprising:
-
one or more processors; and a memory operatively connected to the one or more processors; wherein, responsive to execution of computer readable program code accessible to the one or more processors, the one or more processors are configured to; access a database having a plurality of postal addresses stored therein; in response to accessing the database; cluster a plurality of the postal addresses based on similarity, and thereby forming at least one cluster of postal addresses; and within an identified cluster of postal addresses, identify one or more synonyms relative to one or more components of postal addresses, wherein the one or synonyms comprise variants of the one or more components; and computer readable program code configured to, with respect to one or more components of the postal addresses in the identified cluster of postal addresses, identify a standardized identifier from among the one or more synonyms and applying the standardized identifier to postal addresses within the cluster of postal addresses. - View Dependent Claims (20)
-
Specification