String comparison results for character strings using frequency data

US 9,747,273 B2
Filed: 08/19/2014
Issued: 08/29/2017
Est. Priority Date: 08/19/2014
Status: Active Grant

First Claim

Patent Images

1. A system for assessing similarity between character strings, the system comprising:

a data collection to store a collection of character strings; and

a server to access the data collection, the server comprising a processor configured with logic to;

calculate an initial similarity score for a first character string and a second character string based on an edit distance algorithm;

identify the first character string and the second character string as candidate similar character strings from the data collection based on the calculated initial similarity score being greater than or equal to a similarity threshold value;

determine, when the first character string and the second character string are identified as similar character strings, a frequency of occurrence for at least one of the first character string and the second character string from the collection of character strings, wherein the frequency of occurrence comprises a total number of times that at least one of the first character string and the second character string is present in the collection of character strings; and

decrease an occurrence of false designations of character strings as being similar, the decreasing further comprising;

adjusting the initial similarity score to a greater value as a final similarity score when the determined frequency of occurrence is no greater than a low frequency threshold value,adjusting the initial similarity score to a lower value as the final similarity score when the frequency of occurrence is greater than a high frequency threshold value, anddesignating the first character string and the second character string as similar based on the final similarity score being greater than or equal to the similarity threshold value.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A similarity between character strings is assessed by identifying first and second character strings as candidate similar character strings, determining a frequency of occurrence for at least one of the first and second character strings from a collection of character strings, and designating the first and second character strings as similar based on the determined frequency of occurrence.

Citations

9 Claims

1. A system for assessing similarity between character strings, the system comprising:
- a data collection to store a collection of character strings; and
  
  a server to access the data collection, the server comprising a processor configured with logic to;
  
  calculate an initial similarity score for a first character string and a second character string based on an edit distance algorithm;
  
  identify the first character string and the second character string as candidate similar character strings from the data collection based on the calculated initial similarity score being greater than or equal to a similarity threshold value;
  
  determine, when the first character string and the second character string are identified as similar character strings, a frequency of occurrence for at least one of the first character string and the second character string from the collection of character strings, wherein the frequency of occurrence comprises a total number of times that at least one of the first character string and the second character string is present in the collection of character strings; and
  
  decrease an occurrence of false designations of character strings as being similar, the decreasing further comprising;
  
  adjusting the initial similarity score to a greater value as a final similarity score when the determined frequency of occurrence is no greater than a low frequency threshold value,adjusting the initial similarity score to a lower value as the final similarity score when the frequency of occurrence is greater than a high frequency threshold value, anddesignating the first character string and the second character string as similar based on the final similarity score being greater than or equal to the similarity threshold value.
- View Dependent Claims (2, 3, 4, 5, 8)
- - 2. The system of claim 1, wherein the first character string and the second character string stored within the data collection include a name.
  - 3. The system of claim 1, wherein the collection of character strings is partitioned based on one or more attributes of the character strings, and the processor is configured to determine the frequency of occurrence by:
    - identifying a specific partition of the collection of character strings based on an attribute of the first character string and the second character string; and
      
      determining the frequency of occurrence of at least one of the first character string and the second character string from the specific partition of the collection of character strings.
  - 4. The system of claim 1, wherein the processor is further configured to:
    - normalize the determined frequency of occurrence of at least one of the first character string and the second character string.
  - 5. The system of claim 4, wherein the processor is configured to normalize the determined frequency of occurrence of each of the first character string and the second character string by:
    - identifying a first group of character strings and a second group of character strings within the collection of character strings, wherein the first group includes a greater number of character strings in relation to the second group; and
      
      modifying the determined frequency of occurrence of the second character string to a greater value.
  - 8. The system of claim 1, wherein the collection of character strings stores a collection of people names such that the first and second character strings represent names of people.

6. A computer program product for assessing similarity between character strings, the computer program product comprising:
- a computer readable storage medium having computer readable program code embodied therewith, the computer readable program code configured to;
  
  calculate an initial similarity score for a first character string and a second character string based on an edit distance algorithm;
  
  identify the first character string and the second character string as candidate similar character strings from a collection of character strings based on the calculated initial similarity score being greater than or equal to a similarity threshold value;
  
  determine, when the first character string and the second character string are identified as similar character strings, a frequency of occurrence for at least one of the first character string and the second character string from the collection of character strings, wherein the frequency of occurrence comprises a total number of times that at least one of the first character string and the second character string is present in the collection of character strings;
  
  decrease an occurrence of false designations of character strings as being similar, the decreasing further comprising;
  
  adjusting the initial similarity score to a greater value as a final similarity score only when the determined frequency of occurrence is no greater than a low frequency threshold,adjusting the initial similarity score to a lower value as the final similarity score when the frequency of occurrence is greater than a high frequency threshold value, anddesignating the first character string and the second character string as similar based on the final similarity score being greater than or equal to the similarity threshold value.
- View Dependent Claims (7, 9)
- - 7. The computer program product of claim 6, wherein the collection of character strings is partitioned based on one or more attributes of the character strings, and the computer readable program code is further configured to determine the frequency of occurrence by:
    - identifying a specific partition of the collection of character strings based on an attribute of the first character string and the second character string; and
      
      determining the frequency of occurrence of at least one of the first character string and the second character string from the specific partition of the collection of character strings.
  - 9. The computer program product of claim 6, wherein the collection of character strings stores a collection of people names such that the first and second character strings represent names of people.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Huang, Shudong, Jonas, Jeffrey J., Macy, Brian E., Patman Maguire, Frankie E., Williams, Charles K.
Primary Examiner(s)
Riley, Marcus T

Application Number

US14/463,162
Publication Number

US 20160055141A1
Time in Patent Office

1,106 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 40/232 Orthographic correction, e....

G06F 40/295 Named entity recognition

String comparison results for character strings using frequency data

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

9 Claims

Specification

Solutions

Use Cases

Quick Links

String comparison results for character strings using frequency data

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

9 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links