Large scale item representation matching
First Claim
1. A computerized method for matching item representations within a collection of item representations, the method comprising:
- determining candidate pairs of item representations based on frequency information indicative of the frequency at which terms appear in the collection of item representations, wherein the frequency information comprises an IDF score determined for each term based on the respective term'"'"'s frequency of use within the collection of item representations, and wherein determining candidate pairs comprises determining an aggregate IDF score for pairs of item representations by adding the IDF scores for terms shared by each pair of item representations and comparing the aggregate IDF score for each pair of item representations against a threshold to determine if each pair of item representations qualifies as a candidate pair; and
matching item representations by analyzing the candidate pairs using one or more fuzzy matching functions.
2 Assignments
0 Petitions
Accused Products
Abstract
A two-phase process quickly and accurately identifies representations of the same items within a collection of item representations. In the first phase, referred to as a “blocking phase,” frequency information indicating the frequency with which terms appear within the collection of item representations is used to quickly identify “candidate pairs” (i.e., pairs of item representations that have a relatively high probability of matching). The blocking phase results in a reduced subset of the data for further analysis during the second phase. In the second phase, referred to as a “matching phase,” the candidate pairs are analyzed using fuzzy matching functions to accurately identify “matching pairs” (i.e., representations of the same items).
21 Citations
14 Claims
-
1. A computerized method for matching item representations within a collection of item representations, the method comprising:
-
determining candidate pairs of item representations based on frequency information indicative of the frequency at which terms appear in the collection of item representations, wherein the frequency information comprises an IDF score determined for each term based on the respective term'"'"'s frequency of use within the collection of item representations, and wherein determining candidate pairs comprises determining an aggregate IDF score for pairs of item representations by adding the IDF scores for terms shared by each pair of item representations and comparing the aggregate IDF score for each pair of item representations against a threshold to determine if each pair of item representations qualifies as a candidate pair; and matching item representations by analyzing the candidate pairs using one or more fuzzy matching functions. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. One or more computer-readable storage media embodying computer-useable instructions for performing a method of matching item representations from a collection of item representations, the method comprising:
-
extracting terms from the collection of item representation; determining frequency information indicative of the frequency with which the terms appear within the collection of item representations, wherein the frequency information comprises an IDF score calculated for each term; generating an inverted index mapping the terms to the item representations in which the terms appear, wherein the inverted index further includes the frequency information for the terms; determining one or more candidate pairs of item representations using the inverted index based on terms shared between item representations and frequency information associated with the terms by determining aggregate IDF scores for pairs of item representations based on terms shared between each pair of item representations and comparing the aggregate IDF scores against a threshold; and identifying one or more matching pairs of item representations by analyzing the candidate pairs using one or more fuzzy matching algorithms. - View Dependent Claims (9, 10, 11, 12)
-
-
13. A computerized system including one or more computer-readable media embodying software components for matching item representations from a collection of item representations, the software components comprising:
-
a blocking component that identifies candidate pairs of item representations based on frequency information associated with terms shared between the candidate pairs, wherein the blocking component identifies candidate pairs of item representations by determining aggregate IDF scores based on the frequency information associated with terms shared between pairs of item representations and comparing the aggregated IDF scores against a threshold; and a matching component that identifies matching pairs of item representations by analyzing the candidate pairs using one or more fuzzy matching algorithms. - View Dependent Claims (14)
-
Specification