Method and system for finding similar records in mixed free-text and structured data
First Claim
1. A method for determining whether records are similar in a database containing both structured and unstructured, free-text data, the method comprising the steps of:
- accessing two of the records from the database for evaluation;
evaluating a match between the two records as a weighted match between each of a plurality of available fields, such that a matching process is selected as appropriate from among a group of matching processes including strict Boolean, ordinal, and vector-based matching processes, wherein;
when a strict Boolean matching process is selected, applying a match function as an exact match test,when an ordinal matching process is selected, applying a match function that makes use of information concerning the size and ordering of the data domain, andwhen a vector-based matching process is selected applying a match function that uses a vector space frequency test; and
calculating a similarity score between the two records, as follows;
sim(recordi, recordj)=w1*match(a1i,a1j)+w2*match(a2i,a2j)+. . . wn*match(ani,anj)wherein sim is a similarity function that determines the similarity score for the two records,recordi is a first record of the two records and is identified in the database by an iterator i,recordj is a second record of the two records and is identified in the database by an iterator j,iterator n identifies a field position for a given field ani in the recordi and a corresponding field position for a given field anj in the recordj,match indicates the match function, anda symbol wn indicates a predefined weight for each result of each match function.
1 Assignment
0 Petitions
Accused Products
Abstract
A technique for data mining where the available data contains both structured as well as unstructured (free-text) data. The present invention combines together the information available from different types of data to provide a single similarity score indicating the degree of similarity between records. Thus, a data evaluation application selects two records from a database and compares corresponding fields from the two records. The application determines whether to apply a nominal matching process, an ordinal matching process, or a vector-space matching process depending on the type of data in each pair of corresponding fields. The application sums the matching scores for all the fields in the records to compute the similarity score.
36 Citations
14 Claims
-
1. A method for determining whether records are similar in a database containing both structured and unstructured, free-text data, the method comprising the steps of:
-
accessing two of the records from the database for evaluation; evaluating a match between the two records as a weighted match between each of a plurality of available fields, such that a matching process is selected as appropriate from among a group of matching processes including strict Boolean, ordinal, and vector-based matching processes, wherein; when a strict Boolean matching process is selected, applying a match function as an exact match test, when an ordinal matching process is selected, applying a match function that makes use of information concerning the size and ordering of the data domain, and when a vector-based matching process is selected applying a match function that uses a vector space frequency test; and calculating a similarity score between the two records, as follows;
sim(recordi, recordj)=w1*match(a1i,a1j)+w2*match(a2i,a2j)+. . . wn*match(ani,anj)wherein sim is a similarity function that determines the similarity score for the two records, recordi is a first record of the two records and is identified in the database by an iterator i, recordj is a second record of the two records and is identified in the database by an iterator j, iterator n identifies a field position for a given field ani in the recordi and a corresponding field position for a given field anj in the recordj, match indicates the match function, and a symbol wn indicates a predefined weight for each result of each match function. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A data processing system for determining whether records are similar in a database containing both structured and unstructured, free-text data, the data processing system comprising:
-
a communications interface for communicating with the database; and a processor coupled to the communications interface, the processor hosting and executing a data evaluation application that is configured to; (a) access two of the records from the database for evaluation, (b) evaluate a match between the two records as a weighted match between each of a plurality of available fields, such that a matching process is selected as appropriate from among a group of matching processes including strict Boolean, ordinal, and vector-based matching processes, wherein; when a strict Boolean matching process is selected, apply a match function as an exact match test, when an ordinal matching process is selected, apply a match function that makes use of information concerning the size and ordering of the data domain, and when a vector-based matching process is selected, apply a match function that uses a vector space frequency test; and (c) calculate a similarity score between the two records, as follows;
sim(recordi, recordj)=w1*match(a1i,a1j)+w2*match(a2i,a2j)+. . . wn*match(ani,anj)wherein sim is a similarity function that determines the similarity score for the two records, recordi is a first record of the two records and is identified in the database by an iterator i, recordj is a second record of the two records and is identified in the database by an iterator j, iterator n identifies a field position for a given field ani in the recordi and a corresponding field position for a given field anj in the recordj, match indicates the match function, and a symbol wn indicates a predefined weight for each result of each match function. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
Specification