Method and system for finding similar records in mixed free-text and structured data
First Claim
1. A method for determining whether records are similar in a database containing both structured and unstructured, free-text data, the method comprising the steps of:
- accessing two of the records from the database for evaluation; and
evaluating a match between the two records as a weighted match between each of a plurality of available fields, such that a matching process is selected as appropriate from among a group of matching processes including strict Boolean, ordinal, and vector-based matching processes, wherein;
when a strict Boolean matching process is selected, applying a match function as an exact match test;
when an ordinal matching process is selected, applying a match ffunction that makes use of information concerning the size and ordering of the data domain; and
when a vector-based matching process is selected applying a match function that uses a vector space frequency test.
1 Assignment
0 Petitions
Accused Products
Abstract
A technique for data mining where the available data contains both structured as well as unstructured (free-text) data. The present invention combines together the information available from different types of data to provide a single similarity score indicating the degree of similarity between records. Thus, a data evaluation application selects two records from a database and compares corresponding fields from the two records. The application determines whether to apply a nominal matching process, an ordinal matching process, or a vector-space matching process depending on the type of data in each pair of corresponding fields. The application sums the matching scores for all the fields in the records to compute the similarity score.
39 Citations
16 Claims
-
1. A method for determining whether records are similar in a database containing both structured and unstructured, free-text data, the method comprising the steps of:
-
accessing two of the records from the database for evaluation; and
evaluating a match between the two records as a weighted match between each of a plurality of available fields, such that a matching process is selected as appropriate from among a group of matching processes including strict Boolean, ordinal, and vector-based matching processes, wherein;
when a strict Boolean matching process is selected, applying a match function as an exact match test;
when an ordinal matching process is selected, applying a match ffunction that makes use of information concerning the size and ordering of the data domain; and
when a vector-based matching process is selected applying a match function that uses a vector space frequency test. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A data processing system for determining whether records are similar in a database containing both structured and unstructured, free-text data, the data processing system comprising:
-
a communications interface for communicating with the database; and
a processor coupled to the communications interface, the processor hosting and executing a data evaluation application that is configured to;
access two of the records from the database for evaluation; and
evaluate a match between the two records as a weighted match between each of a plurality of available fields, such that a matching process is selected as appropriate from among a group of matching processes including strict Boolean, ordinal, and vector-based matching processes, wherein;
when a strict Boolean matching process is selected, apply a match function as an exact match test;
when an ordinal matching process is selected, apply a match function that makes use of information concerning the size and ordering of the data domain; and
when a vector-based matching process is selected, apply a match function that uses a vector space frequency test. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
Specification