Methods, computer readable mediums and systems for linking related data from at least two data sources based upon a scoring algorithm
First Claim
1. A method for linking related data from at least two data sources, said method comprising:
- formatting data items of a first data source, each of said data items of the first data source including a plurality of attributes, wherein each of said data items of the first data source is formatted according to the attributes included therewith;
formatting data items of a second data source, each of said data items of the second data source including a plurality of attributes, wherein each of said data items of the second data source is formatted according to the attributes included therewith, and wherein a first attribute included in a first data item of the first data source comprises a first string and a corresponding first attribute of a first data item of the second data source comprises a second string, a second attribute included in the first data item of the first data source comprises a third string and a corresponding second attribute of the first data item of the second data source comprises a fourth string, the first attribute of a second data item of the second data source comprises a fifth string, and the second attribute of the second data item of the second data source comprises a sixth string;
selecting one or more high-cardinality attributes from the plurality of attributes included in the data items of the first data source and the attributes included in the data items of the second data source;
executing, by a computing device, a preliminary matching algorithm for the selected high-cardinality attributes to generate a preliminary score for each set of a group of data item sets, said each set comprising a formatted data item from the first data source and a formatted data item from the second data source;
identifying sets which have unrelated data items of the first and second data sources based upon the generated preliminary scores;
modifying the group of data item sets to exclude said sets identified as having unrelated data items;
executing, by the computing device, a scoring algorithm for each data item set in the modified group of data item sets, wherein executing the scoring algorithm for the first attributes comprises performing a string comparison between the first string and the fifth string and executing the scoring algorithm for the second attributes comprises performing a string comparison between the third string and the sixth string, and wherein executing the scoring algorithm further comprises combining a score from the string comparison between the first string and the fifth string and from the string comparison between the third string and the sixth string to produce a total match score for the first data item of the first data source and the second data item of the second data source;
identifying sets which have related data items of the first and second data sources based upon the total match scores;
linking the first data item of the first data source with the first data item of the second data source when the total match score for the first data items of each of the first and second data sources is greater than the total match score for the first and second data items of the first and second data sources, respectively, and the total match score for the first data items is greater than a threshold matching criterion; and
linking the first data item of the first data source with the second data item of the second data source when the total match score for the first and second data items of the first and second data sources, respectively, is greater than the total match score for first data items of each of the first and second data sources and the total match score for the first and second data items of the first and second data sources, respectively, is greater than a threshold matching criterion.
2 Assignments
0 Petitions
Accused Products
Abstract
A method for linking related data, such as metadata, from at least two data sources. The method includes formatting items of data of the data sources according to attributes. The method also executes a scoring algorithm for one or more of the attributes to generate a score for one or more sets of the formatted items of data, each of the sets includes an item of data from one data source and an item of data from another data source. Finally the method identifies related items of data of the separate data sources based upon the generated scores to facilitate linking related data of the two data sources. The method may also provide a link between data items of the data sources.
84 Citations
34 Claims
-
1. A method for linking related data from at least two data sources, said method comprising:
-
formatting data items of a first data source, each of said data items of the first data source including a plurality of attributes, wherein each of said data items of the first data source is formatted according to the attributes included therewith; formatting data items of a second data source, each of said data items of the second data source including a plurality of attributes, wherein each of said data items of the second data source is formatted according to the attributes included therewith, and wherein a first attribute included in a first data item of the first data source comprises a first string and a corresponding first attribute of a first data item of the second data source comprises a second string, a second attribute included in the first data item of the first data source comprises a third string and a corresponding second attribute of the first data item of the second data source comprises a fourth string, the first attribute of a second data item of the second data source comprises a fifth string, and the second attribute of the second data item of the second data source comprises a sixth string; selecting one or more high-cardinality attributes from the plurality of attributes included in the data items of the first data source and the attributes included in the data items of the second data source; executing, by a computing device, a preliminary matching algorithm for the selected high-cardinality attributes to generate a preliminary score for each set of a group of data item sets, said each set comprising a formatted data item from the first data source and a formatted data item from the second data source; identifying sets which have unrelated data items of the first and second data sources based upon the generated preliminary scores; modifying the group of data item sets to exclude said sets identified as having unrelated data items; executing, by the computing device, a scoring algorithm for each data item set in the modified group of data item sets, wherein executing the scoring algorithm for the first attributes comprises performing a string comparison between the first string and the fifth string and executing the scoring algorithm for the second attributes comprises performing a string comparison between the third string and the sixth string, and wherein executing the scoring algorithm further comprises combining a score from the string comparison between the first string and the fifth string and from the string comparison between the third string and the sixth string to produce a total match score for the first data item of the first data source and the second data item of the second data source; identifying sets which have related data items of the first and second data sources based upon the total match scores; linking the first data item of the first data source with the first data item of the second data source when the total match score for the first data items of each of the first and second data sources is greater than the total match score for the first and second data items of the first and second data sources, respectively, and the total match score for the first data items is greater than a threshold matching criterion; and linking the first data item of the first data source with the second data item of the second data source when the total match score for the first and second data items of the first and second data sources, respectively, is greater than the total match score for first data items of each of the first and second data sources and the total match score for the first and second data items of the first and second data sources, respectively, is greater than a threshold matching criterion. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27)
-
-
28. One or more computer storage media having computer-executable components for linking related data from at least two sources of data, said components comprising:
-
an attribute component for formatting data items of a first data source and data items of a second data source, said data items from the first data source and said data items from the second data source each including a plurality of attributes, wherein said formatting includes formatting the data items of the first data source and the data items of the second data source according to attributes preselected from the plurality of attributes, wherein a first attribute included in a first data item of the first data source comprises a first string and a corresponding first attribute of a first data item of the second data source comprises a second string, a second attribute included in the first data item of the first data source comprises a third string and a corresponding second attribute of the first data item of the second data source comprises a fourth string, the first attribute of a second data item of the second data source comprises a fifth string, and the second attribute of the second data item of the second data source comprises a sixth string; an engine component for; selecting one or more high-cardinality attributes from the plurality of attributes included in the data items of the first data source and the attributes included in the data items of the second data source; executing a preliminary matching algorithm for the selected high-cardinality attributes to generate a preliminary score for each set of a group of data item sets, said each set comprising a data item from the first data source formatted by the attribute component and a data item from the second data source formatted by the attribute component; identifying sets which have unrelated data items of the first and second data sources based upon the generated preliminary scores; modifying the group of data item sets to exclude said sets identified as having unrelated data items; and executing a scoring algorithm for each data item set in the modified group of data item sets, wherein executing the scoring algorithm for the first attributes comprises performing a string comparison between the first string and the fifth string and executing the scoring algorithm for the second attributes comprises performing a string comparison between the third string and the sixth string, and wherein executing the scoring algorithm further comprises combining a score from the string comparison between the first string and the fifth string and from the string comparison between the third string and the sixth string to produce a total match score for the first data item of the first data source and the second data item of the second data source; linking the first data item of the first data source with the first data item of the second data source when the total match score for the first data items of each of the first and second data sources is greater than the total match score for the first and second data items of the first and second data sources, respectively, and the total match score for the first data items is greater than a threshold matching criterion; linking the first data item of the first data source with the second data item of the second data source when the total match score for the first and second data items of the first and second data sources, respectively, is greater than the total match score for first data items of each of the first and second data sources and the total match score for the first and second data items of the first and second data sources, respectively, is greater than a threshold matching criterion; and a filter component for identifying sets which have related data items of the first and second data sources based upon the total match scores generated by the engine component. - View Dependent Claims (29)
-
-
30. A system for linking related data from at least two sources of data, said system comprising:
-
a first data feed comprising a stream of data items, said data items of the first data feed including a plurality of attributes; a second data feed comprising a stream of data items, said data items of the second data feed including a plurality of attributes, wherein a first attribute included in a first data item of the first data feed comprises a first string and a corresponding first attribute of a first data item of the second data feed comprises a second string, a second attribute included in the first data item of the first data feed comprises a third string and a corresponding second attribute of the first data item of the second data feed comprises a fourth string, the first attribute of a second data item of the second data feed comprises a fifth string, and the second attribute of the second data item of the second data feed comprises a sixth string; and a processor coupled to a memory, wherein the processor configured to; receive said first and second data feeds, format data items of the first data feed according to one or more of the plurality of attributes included therewith, format data items of the second data feed according to one or more of the plurality of attributes included therewith, select one or more high-cardinality attributes from the plurality of attributes included in the data items of the first data source and the attributes included in the data items of the second data source; execute a preliminary matching algorithm for the selected high-cardinality attributes to generate a preliminary score for each set of a group of data item sets, said each set comprising a formatted data item from the first data feed and a formatted data item from the second data feed; identify sets which have unrelated data items of the first and second data feeds based upon the generated preliminary scores; modify the group of data item sets to exclude said sets identified as having unrelated data items; execute a scoring algorithm for each data item set in the modified group of data item sets, wherein executing the scoring algorithm for the first attributes comprises performing a string comparison between the first string and the fifth string and executing the scoring algorithm for the second attributes comprises performing a string comparison between the third string and the sixth string, and wherein executing the scoring algorithm further comprises combining a score from the string comparison between the first string and the fifth string and from the string comparison between the third string and the sixth string to produce a total match score for the first data item of the first data source and the second data item of the second data source; link the first data item of the first data feed with the first data item of the second data feed when the total match score for the first data items of each of the first and second data feeds is greater than the total match score for the first and second data items of the first and second data feeds, respectively, and the total match score for the first data items is greater than a threshold matching criterion; link the first data item of the first data feed with the second data item of the second data feed when the total match score for the first and second data items of the first and second data feeds, respectively, is greater than the total match score for first data items of each of the first and second data feeds and the total match score for the first and second data items of the first and second data feeds, respectively, is greater than a threshold matching criterion; and identify sets which have related data items of the first and second data feeds based upon the total match scores. - View Dependent Claims (31, 32)
-
-
33. A method for establishing a link between related metadata from at least two sources of metadata, said metadata including property data associated with a media file accessible by a client, comprising:
-
formatting data items of a first metadata source, each of said data items of the first metadata source including a purality of attributes, wherein each of said data items of the first metadata source is formatted according to the attributes included therewith; formatting data items of a second metadata source, each of said data items of the second metadata source including a plurality of attributes, wherein each of said data items of the second metadata source is formatted according to the attributes included therewith, wherein a first attribute included in a first data item of the first metadata source comprises a first string and a corresponding first attribute of a first data item of the second metadata source comprises a second string, a second attribute included in the first data item of the first metadata source comprises a third string and a corresponding second attribute of the first data item of the second metadata source comprises a fourth string, the first attribute of a second data item of the second metadata source comprises a fifth string, and the second attribute of the second data item of the second metadata source comprises a sixth string; selecting one or more attributes from the plurality of attributes included in the data items of the first metadata source and the data items of the second metadata source; executing, by a computing device, a preliminary matching algorithm for the one or more selected attributes to generate a preliminary score for each set of a group of data item sets, said each set comprising a formatted data item from the first metadata source and a formatted data item from the second metadata source; identifying sets which have unrelated data items of the first and second metadata sources based upon the generated preliminary scores; modifying the group of data item sets to exclude said sets identified as having unrelated data items; executing, by the computing device, a scoring algorithm for each data item set in the modified group of data item sets, wherein executing the scoring algorithm for the first attributes comprises performing a string comparison between the first string and the fifth string and executing the scoring algorithm for the second attributes comprises performing a string comparison between the third string and the sixth string, and wherein executing the scoring algorithm further comprises combining a score from the string comparison between the first string and the fifth string and from the string comparison between the third string and the sixth string to produce a total match score for the first data item of the first metadata source and the second data item of the second metadata source; identifying sets which have related data items of the first and second metadata sources based upon the total match scores; linking the first data item of the first metadata source with the first data item of the second metadata source when the total match score for the first data items of each of the first and second metadata sources is greater than the total match score for the first and second data items of the first and second metadata sources, respectively, and the total match score for the first data items is greater than a threshold matching criterion; linking the first data item of the first metadata source with the second data item of the second metadata source when the total match score for the first and second data items of the first and second metadata sources, respectively, is greater than the total match score for first data items of each of the first and second metadata sources and the total match score for the first and second data items of the first and second metadata sources, respectively, is greater than a threshold matching criterion; establishing at least one link between data items of the first metadata source related to data items of the second metadata source identified as being related; and generating a user interface displaying the established link. - View Dependent Claims (34)
-
Specification