×

Methods, computer readable mediums and systems for linking related data from at least two data sources based upon a scoring algorithm

  • US 7,644,077 B2
  • Filed: 10/21/2004
  • Issued: 01/05/2010
  • Est. Priority Date: 10/21/2004
  • Status: Active Grant
First Claim
Patent Images

1. A method for linking related data from at least two data sources, said method comprising:

  • formatting data items of a first data source, each of said data items of the first data source including a plurality of attributes, wherein each of said data items of the first data source is formatted according to the attributes included therewith;

    formatting data items of a second data source, each of said data items of the second data source including a plurality of attributes, wherein each of said data items of the second data source is formatted according to the attributes included therewith, and wherein a first attribute included in a first data item of the first data source comprises a first string and a corresponding first attribute of a first data item of the second data source comprises a second string, a second attribute included in the first data item of the first data source comprises a third string and a corresponding second attribute of the first data item of the second data source comprises a fourth string, the first attribute of a second data item of the second data source comprises a fifth string, and the second attribute of the second data item of the second data source comprises a sixth string;

    selecting one or more high-cardinality attributes from the plurality of attributes included in the data items of the first data source and the attributes included in the data items of the second data source;

    executing, by a computing device, a preliminary matching algorithm for the selected high-cardinality attributes to generate a preliminary score for each set of a group of data item sets, said each set comprising a formatted data item from the first data source and a formatted data item from the second data source;

    identifying sets which have unrelated data items of the first and second data sources based upon the generated preliminary scores;

    modifying the group of data item sets to exclude said sets identified as having unrelated data items;

    executing, by the computing device, a scoring algorithm for each data item set in the modified group of data item sets, wherein executing the scoring algorithm for the first attributes comprises performing a string comparison between the first string and the fifth string and executing the scoring algorithm for the second attributes comprises performing a string comparison between the third string and the sixth string, and wherein executing the scoring algorithm further comprises combining a score from the string comparison between the first string and the fifth string and from the string comparison between the third string and the sixth string to produce a total match score for the first data item of the first data source and the second data item of the second data source;

    identifying sets which have related data items of the first and second data sources based upon the total match scores;

    linking the first data item of the first data source with the first data item of the second data source when the total match score for the first data items of each of the first and second data sources is greater than the total match score for the first and second data items of the first and second data sources, respectively, and the total match score for the first data items is greater than a threshold matching criterion; and

    linking the first data item of the first data source with the second data item of the second data source when the total match score for the first and second data items of the first and second data sources, respectively, is greater than the total match score for first data items of each of the first and second data sources and the total match score for the first and second data items of the first and second data sources, respectively, is greater than a threshold matching criterion.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×