Method and system for finding similar records in mixed free-text and structured data

US 7,440,946 B2
Filed: 01/13/2006
Issued: 10/21/2008
Est. Priority Date: 03/07/2001
Status: Expired due to Term

First Claim

Patent Images

1. A method for determining whether records are similar in a database, the method comprising:

(a) selecting two records in the database;

(b) accessing two corresponding fields of the selected records;

(c) determining a type of data in the accessed fields;

(d) applying a match function to the accessed corresponding fields to generate a match score based on the type of data in the accessed fields,wherein,if the type of data in the fields is nominal, the match function applied is a Boolean match function, if the type of data in the fields is ordinal, the match function applied is an ordinal match function, and if the type of data in the fields is unstructured data, the match function applied is a vector-based match function;

(e) repeating steps b through d for one or more additional corresponding fields of the selected records to generate one or more additional match scores; and

(f) generating a similarity score that indicates a degree of similarity between the two records from the match scores, wherein the similarity score is generated as follows;

similarity_score=w₁*match(a_1i,a_1j)+w₂*match(a_2i,a_2j)+ . . . w_n*match(a_ni,a_nj)wherein,similarity_score is the similarity score,i identifies a first selected record in the database,j identifies a second selected record in the database,n identifies a field position for a given field a_niin the first selected record and a corresponding field position for a given field a_njin the second selected record, match indicates the match function used to generate the match scores, and w_nindicates a predefined weight for each match score.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A technique for data mining where the available data contains both structured as well as unstructured (free-text) data. The present invention combines together the information available from different types of data to provide a single similarity score indicating the degree of similarity between records. Thus, a data evaluation application selects two records from a database and compares corresponding fields from the two records. The application determines whether to apply a nominal matching process, an ordinal matching process, or a vector-space matching process depending on the type of data in each pair of corresponding fields. The application sums the matching scores for all the fields in the records to compute the similarity score.

29 Citations

View as Search Results

11 Claims

1. A method for determining whether records are similar in a database, the method comprising:
- (a) selecting two records in the database;
  
  (b) accessing two corresponding fields of the selected records;
  
  (c) determining a type of data in the accessed fields;
  
  (d) applying a match function to the accessed corresponding fields to generate a match score based on the type of data in the accessed fields,wherein,if the type of data in the fields is nominal, the match function applied is a Boolean match function, if the type of data in the fields is ordinal, the match function applied is an ordinal match function, and if the type of data in the fields is unstructured data, the match function applied is a vector-based match function;
  
  (e) repeating steps b through d for one or more additional corresponding fields of the selected records to generate one or more additional match scores; and
  
  (f) generating a similarity score that indicates a degree of similarity between the two records from the match scores, wherein the similarity score is generated as follows;
  
  similarity_score=w₁*match(a_1i,a_1j)+w₂*match(a_2i,a_2j)+ . . . w_n*match(a_ni,a_nj)wherein,similarity_score is the similarity score,i identifies a first selected record in the database,j identifies a second selected record in the database,n identifies a field position for a given field a_niin the first selected record and a corresponding field position for a given field a_njin the second selected record, match indicates the match function used to generate the match scores, and w_nindicates a predefined weight for each match score.
- View Dependent Claims (2)
- - 2. A method as defined in claim 1 wherein the match score is generated by the Boolean match function as follows:
    - match_score=1 if a_niequals a_njelse function_result=0wherein,match_score indicates the match score generated by the Boolean match function,i identifies a first selected record in the database,j identifies a second selected record in the database, andn identifies a field position for a given field a_niin the first selected record and a given field a_njin the second selected record.

3. A method for determining whether records are similar in a database, the method comprising:
- (a) selecting two records in the database;
  
  (b) accessing two corresponding fields of the selected records;
  
  (c) determining a type of data in the accessed fields;
  
  (d) applying a match function to the accessed corresponding fields to generate a match score based on the type of data in the accessed fields, wherein the match score is generated by the ordinal match function as follows;
  
  match_score=1−
  
  ((a_ni−
  
  a_nj)/|Domain a_n|)wherein,match_score indicates the match score generated by the ordinal match function,i identifies a first selected record in the database,j identifies a second selected record in the database,n identifies a field position for a given field a_niin the first selected record and a given field a_njin the second selected record, and|Domain a_n| is the size of the data domain of the corresponding fields; and
  
  wherein,if the type of data in the fields is nominal, the match function applied is a Boolean match function, if the type of data in the fields is ordinal, the match function applied is an ordinal match function, and if the type of data in the fields is unstructured data, the match function applied is a vector-based match function.

4. A method for determining whether records are similar in a database, the method comprising:
- (a) selecting two records in the database;
  
  (b) accessing two corresponding fields of the selected records;
  
  (c) determining a type of data in the accessed fields;
  
  (d) applying a match function to the accessed corresponding fields to generate a match score based on the type of data in the accessed fields, wherein the match score is generated by the vector-based match function as follows;
  
  $match_score = \sum_{x = 1 to V} \frac{{weight}_{nix} * {weight}_{njx}}{\sqrt{({weight}_{nix})^2 * ({weight}_{njx})^2}}$ wherein,match_score indicates the match score generated by the vector-based match function,V is a size of a vocabulary associated with the data in the corresponding fields,i identifies a first selected record in the database,j identifies a second selected record in the database,n identifies an accessed field in the selected records,x identifies a word within an accessed field,weight_nixis a weight of word x in field n of the first record, andweight_njxis a weight of word x in field n of the second record; and
  
  wherein,if the type of data in the fields is nominal, the match function applied is a Boolean match function, if the type of data in the fields is ordinal, the match function applied is an ordinal match function, and if the type of data in the fields is unstructured data, the match function applied is a vector-based match function.

5. A data processing system comprising:
- a database having a plurality of records;
  
  a processor capable of accessing records in the database, the processor configured to;
  
  (a) select two records in the database;
  
  (b) access two corresponding fields of the selected records;
  
  (c) determine a type of data in the accessed fields;
  
  (d) apply a match function to the accessed corresponding fields to generate a match score based on the type of data in the accessed fields,wherein,if the type of data in the fields is nominal, the match function applied is a Boolean match function, if the type of data in the fields is ordinal, the match function applied is an ordinal match function, and if the type of data in the fields is unstructured data, the match function applied is a vector-based match function;
  
  (e) repeat steps b through d for additional corresponding fields of the selected records to generate one or more additional match scores; and
  
  (f) generate a similarity score that indicates a degree of similarity between the two records from the match scores, wherein the similarity score is generated as follows;
  
  similarity_score=w₁*match(a_1i,a_1j)+w₂*match(a_2i,a_2j)+ . . . w_n*match(a_ni,a_nj)wherein,similarity_score is the similarity score,i identifies a first selected record in the database,j identifies a second selected record in the database,n identifies a field position for a given field a_niin the first selected record and a corresponding field position for a given field a_njin the second selected record, match indicates the match function used to generate the match scores, and w_nindicates a predefined weight for each match score.
- View Dependent Claims (6, 9, 10, 11)
- - 6. A data processing system as defined in claim 5 wherein the match score is generated by the Boolean match function as follows:
    - match_score=1 if a_niequals a_njelse function_result=0wherein,match_score indicates the match score generated by the Boolean function,i identifies a first selected record in the database,j identifies a second selected record in the database, andn identifies a field position for a given field a_niin the first selected record and a given field a_njin the second selected record.
  - 9. A data processing system as defined in claim 5 further comprising:
    - a communications interface configured to manage communications between the processor and the database.
  - 10. A data processing system as defined in claim 9 wherein the communications interface is a computer bus.
  - 11. A data processing system as defined in claim 9 wherein the communications interface is a network interface that provides access to the database over a data network.

7. A data processing system comprising:
- a database having a plurality of records;
  
  a processor capable of accessing records in the database, the processor configured to;
  
  (a) select two records in the database;
  
  (b) access two corresponding fields of the selected records;
  
  (c) determine a type of data in the accessed fields;
  
  (d) apply a match function to the accessed corresponding fields to generate a match score based on the type of data in the accessed fields, wherein the match score is generated by the ordinal match function as follows;
  
  match_score=1−
  
  ((a_ni−
  
  a_nj)/|Domain a_n|)wherein,match_score indicates the match score generated by the ordinal match function,i identifies a first selected record in the database,j identifies a second selected record in the database,n identifies a field position for a given field a_niin the first selected record and a given field a_njin the second selected record, and|Domain a_n| is the size of the data domain of the corresponding fields; and
  
  wherein,if the type of data in the fields is nominal, the match function applied is a Boolean match function, if the type of data in the fields is ordinal, the match function applied is an ordinal match function, and if the type of data in the fields is unstructured data, the match function applied is a vector-based match function.

8. A data processing system comprising:
- a database having a plurality of records;
  
  a processor capable of accessing records in the database, the processor configured to;
  
  (a) select two records in the database;
  
  (b) access two corresponding fields of the selected records;
  
  (c) determine a type of data in the accessed fields;
  
  (d) apply a match function to the accessed corresponding fields to generate a match score based on the type of data in the accessed fields, wherein the match score is generated by the vector-based match function as follows;
  
  $match_score = \sum_{x = 1 to V} \frac{{weight}_{nix} * {weight}_{njx}}{\sqrt{({weight}_{nix})^2 * ({weight}_{njx})^2}}$ wherein,match_score indicates the match score generated by the vector-based match function,V is a size of a vocabulary associated with the data in the corresponding fields,i identifies a first selected record in the database,j identifies a second selected record in the database,n identifies an accessed field in the selected records,x identifies a word within an accessed field,weight_nixis a weight of word x in field n of the first record, andweight_njxis a weight of word x in field n of the second record; and
  
  wherein,if the type of data in the fields is nominal, the match function applied is a Boolean match function, if the type of data in the fields is ordinal, the match function applied is an ordinal match function, and if the type of data in the fields is unstructured data, the match function applied is a vector-based match function.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Mitre Corporation
Original Assignee
Mitre Corporation
Inventors
Bloedorn, Eric
Primary Examiner(s)
Lee, Wilson
Assistant Examiner(s)
Black, Linh

Application Number

US11/331,934
Publication Number

US 20060116995A1
Time in Patent Office

1,012 Days
Field of Search

707/1, 707/6, 707/104.1
US Class Current

1/1
CPC Class Codes

G06F 16/2465   Query processing support fo...

G06F 16/3346   using probabilistic model

Y10S 707/99931   Database or file accessing

Y10S 707/99936   Pattern matching access

Y10S 707/99945   Object-oriented database st...

Method and system for finding similar records in mixed free-text and structured data

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

29 Citations

11 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for finding similar records in mixed free-text and structured data

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

29 Citations

11 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links