Method and system for finding similar records in mixed free-text and structured data

US 7,076,485 B2
Filed: 03/06/2002
Issued: 07/11/2006
Est. Priority Date: 03/07/2001
Status: Expired due to Term

First Claim

Patent Images

1. A method for determining whether records are similar in a database containing both structured and unstructured, free-text data, the method comprising the steps of:

accessing two of the records from the database for evaluation;

evaluating a match between the two records as a weighted match between each of a plurality of available fields, such that a matching process is selected as appropriate from among a group of matching processes including strict Boolean, ordinal, and vector-based matching processes, wherein;

when a strict Boolean matching process is selected, applying a match function as an exact match test,when an ordinal matching process is selected, applying a match function that makes use of information concerning the size and ordering of the data domain, andwhen a vector-based matching process is selected applying a match function that uses a vector space frequency test; and

calculating a similarity score between the two records, as follows;

sim(record_i, record_j)=w₁*match(a_1i,a_1j)+w₂*match(a_2i,a_2j)+. . . w_n*match(a_ni,a_nj)wherein sim is a similarity function that determines the similarity score for the two records,record_iis a first record of the two records and is identified in the database by an iterator i,record_jis a second record of the two records and is identified in the database by an iterator j,iterator n identifies a field position for a given field a_niin the record_iand a corresponding field position for a given field a_njin the record_j,match indicates the match function, anda symbol w_nindicates a predefined weight for each result of each match function.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A technique for data mining where the available data contains both structured as well as unstructured (free-text) data. The present invention combines together the information available from different types of data to provide a single similarity score indicating the degree of similarity between records. Thus, a data evaluation application selects two records from a database and compares corresponding fields from the two records. The application determines whether to apply a nominal matching process, an ordinal matching process, or a vector-space matching process depending on the type of data in each pair of corresponding fields. The application sums the matching scores for all the fields in the records to compute the similarity score.

36 Citations

View as Search Results

14 Claims

1. A method for determining whether records are similar in a database containing both structured and unstructured, free-text data, the method comprising the steps of:
- accessing two of the records from the database for evaluation;
  
  evaluating a match between the two records as a weighted match between each of a plurality of available fields, such that a matching process is selected as appropriate from among a group of matching processes including strict Boolean, ordinal, and vector-based matching processes, wherein;
  
  when a strict Boolean matching process is selected, applying a match function as an exact match test,when an ordinal matching process is selected, applying a match function that makes use of information concerning the size and ordering of the data domain, andwhen a vector-based matching process is selected applying a match function that uses a vector space frequency test; and
  
  calculating a similarity score between the two records, as follows;
  
  sim(record_i, record_j)=w₁*match(a_1i,a_1j)+w₂*match(a_2i,a_2j)+. . . w_n*match(a_ni,a_nj)wherein sim is a similarity function that determines the similarity score for the two records,record_iis a first record of the two records and is identified in the database by an iterator i,record_jis a second record of the two records and is identified in the database by an iterator j,iterator n identifies a field position for a given field a_niin the record_iand a corresponding field position for a given field a_njin the record_j,match indicates the match function, anda symbol w_nindicates a predefined weight for each result of each match function.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1 wherein the step of evaluating a match between the two records comprises applying the matching process to determine a match score for two corresponding fields of the plurality of available fields, the two corresponding fields selected from corresponding locations in each of the two records.
  - 3. The method of claim 1 wherein the step of evaluating a match between the two records comprises selecting the matching process based on a common data type shared by both of two fields of the plurality of available fields accessed in the two records.
  - 4. The method of claim 3 wherein when a Boolean matching process is selected, the data type of both of the two fields specifies nominal data.
  - 5. The method of claim 3 wherein when an ordinal matching process is selected, the data type of both of the two fields specifies data capable of being ordered.
  - 6. The method of claim 3 wherein, when a vector-based matching process is selected, the data type of both of the two fields specifies text data.
  - 7. The method of claim 1 wherein the database is a relational database, the records are tuples, and the fields are attributes.

8. A data processing system for determining whether records are similar in a database containing both structured and unstructured, free-text data, the data processing system comprising:
- a communications interface for communicating with the database; and
  
  a processor coupled to the communications interface, the processor hosting and executing a data evaluation application that is configured to;
  
  (a) access two of the records from the database for evaluation,(b) evaluate a match between the two records as a weighted match between each of a plurality of available fields, such that a matching process is selected as appropriate from among a group of matching processes including strict Boolean, ordinal, and vector-based matching processes, wherein;
  
  when a strict Boolean matching process is selected, apply a match function as an exact match test,when an ordinal matching process is selected, apply a match function that makes use of information concerning the size and ordering of the data domain, andwhen a vector-based matching process is selected, apply a match function that uses a vector space frequency test; and
  
  (c) calculate a similarity score between the two records, as follows;
  
  sim(record_i, record_j)=w₁*match(a_1i,a_1j)+w₂*match(a_2i,a_2j)+. . . w_n*match(a_ni,a_nj)wherein sim is a similarity function that determines the similarity score for the two records,record_iis a first record of the two records and is identified in the database by an iterator i,record_jis a second record of the two records and is identified in the database by an iterator j,iterator n identifies a field position for a given field a_niin the record_iand a corresponding field position for a given field a_njin the record_j,match indicates the match function, anda symbol w_nindicates a predefined weight for each result of each match function.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The data processing system of claim 8 wherein the data evaluation application is configured to apply the matching process to determine a match score for two corresponding fields of the plurality of available fields, the two corresponding fields selected from corresponding locations in each of the two records.
  - 10. The data processing system of claim 8 wherein the data evaluation application is further configured to select the matching process based on a common data type shared by both of two fields of the plurality of available fields accessed in the two records.
  - 11. The data processing system of claim 10 wherein when the data evaluation application selects a Boolean matching process, the data type of both of the two fields specifies nominal data.
  - 12. The data processing system of claim 10 wherein when the data evaluation application selects an ordinal matching process, the data type of both of the two fields specifies data capable of being ordered.
  - 13. The data processing system of claim 10 wherein, when the data evaluation application selects a vector-based matching process, the data type of both of the two fields specifies text data.
  - 14. The data processing system of claim 8 wherein the database is a relational database, the records are tuples, and the fields are attributes.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Mitre Corporation
Original Assignee
Mitre Corporation
Inventors
Bloedorn, Eric
Primary Examiner(s)
Wassum, Luke S
Assistant Examiner(s)
Black, Linh

Application Number

US10/091,932
Publication Number

US 20020152208A1
Time in Patent Office

1,588 Days
Field of Search

707 1-1041, 707200-205, 370/229, 370/235, 370/238, 370/351, 370/389, 370/392, 704 1- 4, 704/9, 704 25- 26, 706/45, 709/200, 709217-218, 709223-224
US Class Current

1/1
CPC Class Codes

G06F 16/2465   Query processing support fo...

G06F 16/3346   using probabilistic model

Y10S 707/99931   Database or file accessing

Y10S 707/99936   Pattern matching access

Y10S 707/99945   Object-oriented database st...

Method and system for finding similar records in mixed free-text and structured data

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

36 Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for finding similar records in mixed free-text and structured data

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

36 Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links