Methods, computer readable mediums and systems for linking related data from at least two data sources based upon a scoring algorithm

US 7,644,077 B2
Filed: 10/21/2004
Issued: 01/05/2010
Est. Priority Date: 10/21/2004
Status: Active Grant

First Claim

Patent Images

1. A method for linking related data from at least two data sources, said method comprising:

formatting data items of a first data source, each of said data items of the first data source including a plurality of attributes, wherein each of said data items of the first data source is formatted according to the attributes included therewith;

formatting data items of a second data source, each of said data items of the second data source including a plurality of attributes, wherein each of said data items of the second data source is formatted according to the attributes included therewith, and wherein a first attribute included in a first data item of the first data source comprises a first string and a corresponding first attribute of a first data item of the second data source comprises a second string, a second attribute included in the first data item of the first data source comprises a third string and a corresponding second attribute of the first data item of the second data source comprises a fourth string, the first attribute of a second data item of the second data source comprises a fifth string, and the second attribute of the second data item of the second data source comprises a sixth string;

selecting one or more high-cardinality attributes from the plurality of attributes included in the data items of the first data source and the attributes included in the data items of the second data source;

executing, by a computing device, a preliminary matching algorithm for the selected high-cardinality attributes to generate a preliminary score for each set of a group of data item sets, said each set comprising a formatted data item from the first data source and a formatted data item from the second data source;

identifying sets which have unrelated data items of the first and second data sources based upon the generated preliminary scores;

modifying the group of data item sets to exclude said sets identified as having unrelated data items;

executing, by the computing device, a scoring algorithm for each data item set in the modified group of data item sets, wherein executing the scoring algorithm for the first attributes comprises performing a string comparison between the first string and the fifth string and executing the scoring algorithm for the second attributes comprises performing a string comparison between the third string and the sixth string, and wherein executing the scoring algorithm further comprises combining a score from the string comparison between the first string and the fifth string and from the string comparison between the third string and the sixth string to produce a total match score for the first data item of the first data source and the second data item of the second data source;

identifying sets which have related data items of the first and second data sources based upon the total match scores;

linking the first data item of the first data source with the first data item of the second data source when the total match score for the first data items of each of the first and second data sources is greater than the total match score for the first and second data items of the first and second data sources, respectively, and the total match score for the first data items is greater than a threshold matching criterion; and

linking the first data item of the first data source with the second data item of the second data source when the total match score for the first and second data items of the first and second data sources, respectively, is greater than the total match score for first data items of each of the first and second data sources and the total match score for the first and second data items of the first and second data sources, respectively, is greater than a threshold matching criterion.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for linking related data, such as metadata, from at least two data sources. The method includes formatting items of data of the data sources according to attributes. The method also executes a scoring algorithm for one or more of the attributes to generate a score for one or more sets of the formatted items of data, each of the sets includes an item of data from one data source and an item of data from another data source. Finally the method identifies related items of data of the separate data sources based upon the generated scores to facilitate linking related data of the two data sources. The method may also provide a link between data items of the data sources.

84 Citations

View as Search Results

34 Claims

1. A method for linking related data from at least two data sources, said method comprising:
- formatting data items of a first data source, each of said data items of the first data source including a plurality of attributes, wherein each of said data items of the first data source is formatted according to the attributes included therewith;
  
  formatting data items of a second data source, each of said data items of the second data source including a plurality of attributes, wherein each of said data items of the second data source is formatted according to the attributes included therewith, and wherein a first attribute included in a first data item of the first data source comprises a first string and a corresponding first attribute of a first data item of the second data source comprises a second string, a second attribute included in the first data item of the first data source comprises a third string and a corresponding second attribute of the first data item of the second data source comprises a fourth string, the first attribute of a second data item of the second data source comprises a fifth string, and the second attribute of the second data item of the second data source comprises a sixth string;
  
  selecting one or more high-cardinality attributes from the plurality of attributes included in the data items of the first data source and the attributes included in the data items of the second data source;
  
  executing, by a computing device, a preliminary matching algorithm for the selected high-cardinality attributes to generate a preliminary score for each set of a group of data item sets, said each set comprising a formatted data item from the first data source and a formatted data item from the second data source;
  
  identifying sets which have unrelated data items of the first and second data sources based upon the generated preliminary scores;
  
  modifying the group of data item sets to exclude said sets identified as having unrelated data items;
  
  executing, by the computing device, a scoring algorithm for each data item set in the modified group of data item sets, wherein executing the scoring algorithm for the first attributes comprises performing a string comparison between the first string and the fifth string and executing the scoring algorithm for the second attributes comprises performing a string comparison between the third string and the sixth string, and wherein executing the scoring algorithm further comprises combining a score from the string comparison between the first string and the fifth string and from the string comparison between the third string and the sixth string to produce a total match score for the first data item of the first data source and the second data item of the second data source;
  
  identifying sets which have related data items of the first and second data sources based upon the total match scores;
  
  linking the first data item of the first data source with the first data item of the second data source when the total match score for the first data items of each of the first and second data sources is greater than the total match score for the first and second data items of the first and second data sources, respectively, and the total match score for the first data items is greater than a threshold matching criterion; and
  
  linking the first data item of the first data source with the second data item of the second data source when the total match score for the first and second data items of the first and second data sources, respectively, is greater than the total match score for first data items of each of the first and second data sources and the total match score for the first and second data items of the first and second data sources, respectively, is greater than a threshold matching criterion.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27)
- - 2. The method as set forth in claim 1 further comprising providing a link between data items of the first data source and data items of the second data source identified as being related.
  - 3. The method as set forth in claim 2 wherein said providing a link is in response to receiving an item selection from a user.
  - 4. The method as set forth in claim 2 wherein said link is accessible via at least one of a web browser, a media player, a handheld electronic device, or a personal computer.
  - 5. The method as set forth in claim 1wherein said executing the scoring algorithm for said first attributes comprises performing a string comparison between the first string and the second string and scoring the comparison of the first attribute of the first data item of the first data source and the corresponding first attribute of the first data item of the second data source according to said scoring algorithm.
  - 6. The method as set forth in claim 5wherein said executing the scoring algorithm for said second attributes comprises performing a string comparison between the third string and the fourth string and scoring the comparison of the second attribute of the first data item of the first data source and the corresponding second attribute of the first data item of the second data source according to said scoring algorithm.
  - 7. The method as set forth in claim 6 wherein said executing the scoring algorithm further comprises combining the score from the string comparison between the first string and the second string and from the string comparison between the third string and the fourth string to produce a total match score for said first data items.
  - 8. The method as set forth in claim 7 wherein said executing the scoring algorithm further comprisesweighting the score for said first attribute of the first data items before said combining, andweighting the score for said second attribute of the first data items before said combining.
  - 9. The method as set forth in claim 7wherein executing the scoring algorithm for said first attributes further comprises scoring the comparison of the first attribute of the first data item of the first data source and a corresponding first attribute of the second data item of the second data source according to said scoring algorithm.
  - 10. The method as set forth in claim 9wherein executing the scoring algorithm for said second attributes further comprises scoring the comparison of the second attribute of the first data item of the first data source and a corresponding second attribute of the second data item of the second data source according to said scoring algorithm.
  - 11. The method as set forth in claim 5 wherein said scoring the comparison comprises,assigning a high score when the string comparison between the first string and the second string yields an exact match,assigning a neutral score less than said high score when at least one of said first string and said second string contains no value,assigning a low score less than said neutral score when the string comparison between the first string and the second string yields a partial match, andassigning a zero score when none of the high score, the neutral score, and the low score is assigned.
  - 12. The method as set forth in claim 1 wherein said executing the scoring algorithm further comprisesweighting the score for said first attribute of the first and second data items of the first and second data sources, respectively, before said combining, andweighting the score for said second attribute of the first and second data items of the first and second data sources, respectively, before said combining.
  - 13. The method as set forth in claim 1 wherein said identifying sets which have related data items of the first and second data sources based upon the total match scores further comprises,linking the first data item of the first data source with the first data item of the second data source when the total match score for the first data items is greater than the total match score for the first data item of the first data source and any other data item of the second data source and the total match score for the first data items is greater than a threshold matching criterion.
  - 14. The method as set forth in claim 1 wherein said identifying sets which have related data items of the first and second data sources based upon the total match scores further comprises,linking a data item of the first data source with a data item of the second data source when a total match score for said data items is greater than any total match score for the data item of the first data source and any other data item of the second data source and the total match score for said data items is greater than a threshold matching criterion.
  - 15. The method as set forth in claim 1 wherein said executing comprises generating a score for each set of data items comprising an item of data from said first data source and an item of data from said second data source.
  - 16. The method as set forth in claim 15 wherein said executing comprises aggregating said sets of data comprising an item of data from said first data source and an item of data from said second data source for at least two of said attributes to generate a total match score for each of said sets.
  - 17. The method as set forth in claim 1 further comprising formatting items of data of a third data source according to said attributes;
    - wherein said executing further comprises executing a scoring algorithm for one or more of the attributes to generate a score for sets of data comprising an item of data from one of said data sources and an item of data from another of said data sources; and
      
      wherein said identifying sets which have related data items further comprises identifying sets which have related data items of the first, second, and third data sources based upon the generated scores.
  - 18. The method as set forth in claim 17 wherein said identifying sets which have related data items of the first and second data sources based upon the generated scores further comprises,linking a data item of the first data source with a data item of the second data source when a total match score for said data items is greater than any total match score for the data item of the first data source and any other data item of the second data source and the total match score for said data items of the first and second data sources is greater than a threshold matching criterion, andlinking the data item of the first data source with a data item of the third data source when a total match score for said data items is greater than any total match score for the data item of the first data source and any other data item of the third data source and the total match score for said data items of the first and third data sources is greater than a threshold matching criterion.
  - 19. The method as set forth in claim 17 wherein said executing comprises generating a score for each set of data comprising an item of data from said first data source and an item of data from said second data source, andgenerating a score for each set of data comprising an item of data from said second data source and an item of data from said third data source.
  - 20. The method as set forth in claim 17 wherein said executing comprises generating a score for each set of data comprising an item of data from said first data source and an item of data from said second data source, andgenerating a score for each set of data comprising an item of data from said first data source and an item of data from said third data source.
  - 21. The method as set forth in claim 20 wherein said first data source comprises a canonical source.
  - 22. The method as set forth in claim 1 wherein said data sources comprise property data associated with media files.
  - 23. The method as set forth in claim 1 wherein the data is data relating to at least one of video files, audio files, movies, music, executable files, and document files.
  - 24. The method as set forth in claim 23 wherein when said data relates to movies said attributes are at least two of movie title, movie run time, Motion Picture Association of America (MPAA) rating, movie genre, releasing studio, cast listing, cast member, release date, release year, and director.
  - 25. The method as set forth in claim 1 wherein said formatting comprises parsing items of data into data strings having a pre-defined format.
  - 26. The method as set forth in claim 1 wherein said first and second data sources are at least one of a database file, an xml document, and a delimited text file.
  - 27. The method of claim 1, further comprising serially interlinking the data sources by identifying related items of data of the first data source and the second data source and identifying related items of data of the second data source and a third data source thereby linking the first data source with the third data source.

28. One or more computer storage media having computer-executable components for linking related data from at least two sources of data, said components comprising:
- an attribute component for formatting data items of a first data source and data items of a second data source, said data items from the first data source and said data items from the second data source each including a plurality of attributes, wherein said formatting includes formatting the data items of the first data source and the data items of the second data source according to attributes preselected from the plurality of attributes, wherein a first attribute included in a first data item of the first data source comprises a first string and a corresponding first attribute of a first data item of the second data source comprises a second string, a second attribute included in the first data item of the first data source comprises a third string and a corresponding second attribute of the first data item of the second data source comprises a fourth string, the first attribute of a second data item of the second data source comprises a fifth string, and the second attribute of the second data item of the second data source comprises a sixth string;
  
  an engine component for;
  
  selecting one or more high-cardinality attributes from the plurality of attributes included in the data items of the first data source and the attributes included in the data items of the second data source;
  
  executing a preliminary matching algorithm for the selected high-cardinality attributes to generate a preliminary score for each set of a group of data item sets, said each set comprising a data item from the first data source formatted by the attribute component and a data item from the second data source formatted by the attribute component;
  
  identifying sets which have unrelated data items of the first and second data sources based upon the generated preliminary scores;
  
  modifying the group of data item sets to exclude said sets identified as having unrelated data items; and
  
  executing a scoring algorithm for each data item set in the modified group of data item sets, wherein executing the scoring algorithm for the first attributes comprises performing a string comparison between the first string and the fifth string and executing the scoring algorithm for the second attributes comprises performing a string comparison between the third string and the sixth string, and wherein executing the scoring algorithm further comprises combining a score from the string comparison between the first string and the fifth string and from the string comparison between the third string and the sixth string to produce a total match score for the first data item of the first data source and the second data item of the second data source;
  
  linking the first data item of the first data source with the first data item of the second data source when the total match score for the first data items of each of the first and second data sources is greater than the total match score for the first and second data items of the first and second data sources, respectively, and the total match score for the first data items is greater than a threshold matching criterion;
  
  linking the first data item of the first data source with the second data item of the second data source when the total match score for the first and second data items of the first and second data sources, respectively, is greater than the total match score for first data items of each of the first and second data sources and the total match score for the first and second data items of the first and second data sources, respectively, is greater than a threshold matching criterion; and
  
  a filter component for identifying sets which have related data items of the first and second data sources based upon the total match scores generated by the engine component.
- View Dependent Claims (29)
- - 29. The one or more computer storage media set forth in claim 28 further comprising an aggregation component for generating the total match score for each set of data items by combining said scores for each of said attributes generated by the engine component for each of said sets.

30. A system for linking related data from at least two sources of data, said system comprising:
- a first data feed comprising a stream of data items, said data items of the first data feed including a plurality of attributes;
  
  a second data feed comprising a stream of data items, said data items of the second data feed including a plurality of attributes, wherein a first attribute included in a first data item of the first data feed comprises a first string and a corresponding first attribute of a first data item of the second data feed comprises a second string, a second attribute included in the first data item of the first data feed comprises a third string and a corresponding second attribute of the first data item of the second data feed comprises a fourth string, the first attribute of a second data item of the second data feed comprises a fifth string, and the second attribute of the second data item of the second data feed comprises a sixth string; and
  
  a processor coupled to a memory,wherein the processor configured to;
  
  receive said first and second data feeds,format data items of the first data feed according to one or more of the plurality of attributes included therewith,format data items of the second data feed according to one or more of the plurality of attributes included therewith,select one or more high-cardinality attributes from the plurality of attributes included in the data items of the first data source and the attributes included in the data items of the second data source;
  
  execute a preliminary matching algorithm for the selected high-cardinality attributes to generate a preliminary score for each set of a group of data item sets, said each set comprising a formatted data item from the first data feed and a formatted data item from the second data feed;
  
  identify sets which have unrelated data items of the first and second data feeds based upon the generated preliminary scores;
  
  modify the group of data item sets to exclude said sets identified as having unrelated data items;
  
  execute a scoring algorithm for each data item set in the modified group of data item sets, wherein executing the scoring algorithm for the first attributes comprises performing a string comparison between the first string and the fifth string and executing the scoring algorithm for the second attributes comprises performing a string comparison between the third string and the sixth string, and wherein executing the scoring algorithm further comprises combining a score from the string comparison between the first string and the fifth string and from the string comparison between the third string and the sixth string to produce a total match score for the first data item of the first data source and the second data item of the second data source;
  
  link the first data item of the first data feed with the first data item of the second data feed when the total match score for the first data items of each of the first and second data feeds is greater than the total match score for the first and second data items of the first and second data feeds, respectively, and the total match score for the first data items is greater than a threshold matching criterion;
  
  link the first data item of the first data feed with the second data item of the second data feed when the total match score for the first and second data items of the first and second data feeds, respectively, is greater than the total match score for first data items of each of the first and second data feeds and the total match score for the first and second data items of the first and second data feeds, respectively, is greater than a threshold matching criterion; and
  
  identify sets which have related data items of the first and second data feeds based upon the total match scores.
- View Dependent Claims (31, 32)
- - 31. The system as set forth in claim 30 wherein said system comprises a data service for providing a data link between a data item of the first data feed and a data item of the second data feed identified as being related.
  - 32. The system as set forth in claim 31 wherein said data service is a web service.

33. A method for establishing a link between related metadata from at least two sources of metadata, said metadata including property data associated with a media file accessible by a client, comprising:
- formatting data items of a first metadata source, each of said data items of the first metadata source including a purality of attributes, wherein each of said data items of the first metadata source is formatted according to the attributes included therewith;
  
  formatting data items of a second metadata source, each of said data items of the second metadata source including a plurality of attributes, wherein each of said data items of the second metadata source is formatted according to the attributes included therewith, wherein a first attribute included in a first data item of the first metadata source comprises a first string and a corresponding first attribute of a first data item of the second metadata source comprises a second string, a second attribute included in the first data item of the first metadata source comprises a third string and a corresponding second attribute of the first data item of the second metadata source comprises a fourth string, the first attribute of a second data item of the second metadata source comprises a fifth string, and the second attribute of the second data item of the second metadata source comprises a sixth string;
  
  selecting one or more attributes from the plurality of attributes included in the data items of the first metadata source and the data items of the second metadata source;
  
  executing, by a computing device, a preliminary matching algorithm for the one or more selected attributes to generate a preliminary score for each set of a group of data item sets, said each set comprising a formatted data item from the first metadata source and a formatted data item from the second metadata source;
  
  identifying sets which have unrelated data items of the first and second metadata sources based upon the generated preliminary scores;
  
  modifying the group of data item sets to exclude said sets identified as having unrelated data items;
  
  executing, by the computing device, a scoring algorithm for each data item set in the modified group of data item sets, wherein executing the scoring algorithm for the first attributes comprises performing a string comparison between the first string and the fifth string and executing the scoring algorithm for the second attributes comprises performing a string comparison between the third string and the sixth string, and wherein executing the scoring algorithm further comprises combining a score from the string comparison between the first string and the fifth string and from the string comparison between the third string and the sixth string to produce a total match score for the first data item of the first metadata source and the second data item of the second metadata source;
  
  identifying sets which have related data items of the first and second metadata sources based upon the total match scores;
  
  linking the first data item of the first metadata source with the first data item of the second metadata source when the total match score for the first data items of each of the first and second metadata sources is greater than the total match score for the first and second data items of the first and second metadata sources, respectively, and the total match score for the first data items is greater than a threshold matching criterion;
  
  linking the first data item of the first metadata source with the second data item of the second metadata source when the total match score for the first and second data items of the first and second metadata sources, respectively, is greater than the total match score for first data items of each of the first and second metadata sources and the total match score for the first and second data items of the first and second metadata sources, respectively, is greater than a threshold matching criterion;
  
  establishing at least one link between data items of the first metadata source related to data items of the second metadata source identified as being related; and
  
  generating a user interface displaying the established link.
- View Dependent Claims (34)
- - 34. The method as set forth in claim 33 further comprising determining that a media file associated with an item of data located in one of said metadata sources is accessed by said client and presenting a link associated with said item of data to said client.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Deeds, Paul, Picker, Saar
Primary Examiner(s)
Truong; Cam Y
Assistant Examiner(s)
Chau; Dung K

Application Number

US10/970,602
Publication Number

US 20060089948A1
Time in Patent Office

1,902 Days
Field of Search

None
US Class Current

707/783
CPC Class Codes

H04H 60/72   using electronic programme ...

H04H 60/73   using meta-information

H04N 21/4622   Retrieving content or addit...

H04N 21/4722   for requesting additional d...

H04N 21/8133   specifically related to the...

H04N 21/84   Generation or processing of...

H04N 21/8586   by using a URL processing c...

H04N 7/17318   Direct or substantially dir...

Methods, computer readable mediums and systems for linking related data from at least two data sources based upon a scoring algorithm

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

84 Citations

34 Claims

Specification

Solutions

Use Cases

Quick Links

Methods, computer readable mediums and systems for linking related data from at least two data sources based upon a scoring algorithm

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

84 Citations

34 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links