Method and system for finding similar records in mixed free-text and structured data

US 20020152208A1
Filed: 03/06/2002
Published: 10/17/2002
Est. Priority Date: 03/07/2001
Status: Active Grant

First Claim

Patent Images

1. A method for determining whether records are similar in a database containing both structured and unstructured, free-text data, the method comprising the steps of:

accessing two of the records from the database for evaluation; and

evaluating a match between the two records as a weighted match between each of a plurality of available fields, such that a matching process is selected as appropriate from among a group of matching processes including strict Boolean, ordinal, and vector-based matching processes, wherein;

when a strict Boolean matching process is selected, applying a match function as an exact match test;

when an ordinal matching process is selected, applying a match ffunction that makes use of information concerning the size and ordering of the data domain; and

when a vector-based matching process is selected applying a match function that uses a vector space frequency test.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A technique for data mining where the available data contains both structured as well as unstructured (free-text) data. The present invention combines together the information available from different types of data to provide a single similarity score indicating the degree of similarity between records. Thus, a data evaluation application selects two records from a database and compares corresponding fields from the two records. The application determines whether to apply a nominal matching process, an ordinal matching process, or a vector-space matching process depending on the type of data in each pair of corresponding fields. The application sums the matching scores for all the fields in the records to compute the similarity score.

39 Citations

View as Search Results

16 Claims

1. A method for determining whether records are similar in a database containing both structured and unstructured, free-text data, the method comprising the steps of:
- accessing two of the records from the database for evaluation; and
  
  evaluating a match between the two records as a weighted match between each of a plurality of available fields, such that a matching process is selected as appropriate from among a group of matching processes including strict Boolean, ordinal, and vector-based matching processes, wherein;
  
  when a strict Boolean matching process is selected, applying a match function as an exact match test;
  
  when an ordinal matching process is selected, applying a match ffunction that makes use of information concerning the size and ordering of the data domain; and
  
  when a vector-based matching process is selected applying a match function that uses a vector space frequency test.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1 wherein the step of evaluating a match between the two records comprises applying the matching process to determine a match score for two corresponding fields of the plurality of available fields, the two corresponding fields selected from corresponding locations in each of the two records.
  - 3. The method of claim 1 wherein the step of evaluating a match between the two records comprises selecting the matching process based on a common data type shared by both of two fields of the plurality of available fields accessed in the two records.
  - 4. The method of claim 3 wherein when a Boolean matching process is selected, the data type of both of the two fields specifies nominal data.
  - 5. The method of claim 3 wherein when an ordinal matching process is selected, the data type of both of the two fields specifies data capable of being ordered.
  - 6. The method of claim 3 wherein, when a vector-based matching process is selected, the data type of both of the two fields specifies text data.
  - 7. The method of claim 1 wherein the step of evaluating the match between the two records comprises calculating a similarity score between the two records, as follows:
    - sim(record_i, record_j)=w₁*match(a_1i,a_1j)+w₂*match(a_2i,a_2j)+ . . w_n*match(a_ni,a_nj)wherein sim is a similarity function that determines the similarity score for the two records;
      
      records is a first record of the two records and is identified in the database by an iterator i;
      
      record_jis a second record of the two records and is identified in the database by an iterator j;
      
      iterator n identifies a field position for a given field a_niin the record_iand a corresponding field position for a given field a_njin the record_j;
      
      match indicates the match function; and
      
      a symbol w_nindicates a predefined weight for each result of each match function.
  - 8. The method of claim 1 wherein the database is a relational database, the records are tuples, and the fields are attributes.

9. A data processing system for determining whether records are similar in a database containing both structured and unstructured, free-text data, the data processing system comprising:
- a communications interface for communicating with the database; and
  
  a processor coupled to the communications interface, the processor hosting and executing a data evaluation application that is configured to;
  
  access two of the records from the database for evaluation; and
  
  evaluate a match between the two records as a weighted match between each of a plurality of available fields, such that a matching process is selected as appropriate from among a group of matching processes including strict Boolean, ordinal, and vector-based matching processes, wherein;
  
  when a strict Boolean matching process is selected, apply a match function as an exact match test;
  
  when an ordinal matching process is selected, apply a match function that makes use of information concerning the size and ordering of the data domain; and
  
  when a vector-based matching process is selected, apply a match function that uses a vector space frequency test.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The data processing system of claim 9 wherein the data evaluation application is configured to apply the matching process to determine a match score for two corresponding fields of the plurality of available fields, the two corresponding fields selected from corresponding locations in each of the two records.
  - 11. The data processing system of claim 9 wherein the data evaluation application is configured to select the matching process based on a common data type shared by both of two fields of the plurality of available fields accessed in the two records.
  - 12. The data processing system of claim 11 wherein when the data evaluation application selects a Boolean matching process, the data type of both of the two fields specifies nominal data.
  - 13. The data processing system of claim 11 wherein when the data evaluation application selects an ordinal matching process, the data type of both of the two fields specifies data capable of being ordered.
  - 14. The data processing system of claim 11 wherein, when the data evaluation application selects a vector-based matching process, the data type of both of the two fields specifies text data.
  - 15. The data processing system of claim 9 wherein the data evaluation application is configured to calculate a similarity score between the two records, as follows:
    - sim(record_i, record_j)=w₁*match(a_1i,a_1j)+w₂*match(a_2i,a_2j)+ . . . w_n*match(a_ni,a_nj)wherein sim is a similarity function that determines the similarity score for the two records;
      
      record_iis a first record of the two records and is identified in the database by an iterator i;
      
      record_jis a second record of the two records and is identified in the database by an iterator j;
      
      iterator n identifies a field position for a given field a_niin the record_iand a corresponding field position for a given field a_njin the record_j;
      
      match indicates the match function; and
      
      a symbol w_nindicates a predefined weight for each result of each match function.
  - 16. The data processing system of claim 9 wherein the database is a relational database, the records are tuples, and the fields are attributes.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Mitre Corporation
Original Assignee
Mitre Corporation
Inventors
Bloedorn, Eric

Granted Patent

US 7,076,485 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/6
CPC Class Codes

G06F 16/2465   Query processing support fo...

G06F 16/3346   using probabilistic model

Y10S 707/99931   Database or file accessing

Y10S 707/99936   Pattern matching access

Y10S 707/99945   Object-oriented database st...

Method and system for finding similar records in mixed free-text and structured data

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

39 Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for finding similar records in mixed free-text and structured data

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

39 Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links