Systems and methods of de-duplicating similar news feed items
First Claim
1. A method of efficient de-duplicating similar news feed items, the method including:
- assembling a set of news feed items from a plurality of electronic sources;
preprocessing the set to qualify some news feed items to return based on common company-name mentions and common token occurrences;
pairwise determining a resemblance measure for the qualified news feed items based on sequence alignment between news feed item pairs to calculate raw scores and boosted scores, including;
matching tokens from the news feed item pairs and whenever the tokens match, causing a raw score for the resemblance measure to reflect a positive match;
whenever two tokens mismatch, causing the raw score for the resemblance measure to reflect the mismatch using a term penalty matrix that assigns a negative penalty for an edit operation including insertion, deletion, and substitution;
augmenting the raw score for the resemblance measure to produce a boosted score for the resemblance measure by rewarding an n-gram including bigrams of two contiguous matching tokens and trigrams of three contiguous matching tokens, and responsive to existence of one or more factors including original distance, normalized distance, maximum string length, minimum string length, and longest consecutive matches; and
advancing to subsequent token positions in sequence;
after evaluating entire sequences of tokens in the qualified news feed items, constructing a graph of news feed item pairs with the resemblance measure above a threshold and representing the resemblance measure as edges between nodes representing the news feed item pairs, thereby forming connected node pairs; and
determining similar news feed items by clustering the connected node pairs into strongly connected components; and
wherein using the resemblance measure results in non-duplication of data entities holding news item data obtained from multiple sources.
1 Assignment
0 Petitions
Accused Products
Abstract
The technology disclosed relates to de-duplicating contextually similar news feed items. In particular, it relates to assembling a set of news feed items from a plurality of electronic sources and preprocessing the set to generate normalized news feed items that share common company-name mentions and token occurrences. The normalized news feed items are used to calculate one or more resemblance measures based on a sequence alignment score and/or a hyperlink score. The sequence alignment score determines contextual similarity between news feed item pairs, arranged as sequences, based on a number of matching elements in the news feed item sequences and a number of edit operations, such as insertion, deletion, and substitution, required to match the news feed item sequences. The hyperlink score determines contextual similarity between news feed item pairs by comparing the respective search results retrieved in response to supplying the news feed item pairs to a search engine.
208 Citations
20 Claims
-
1. A method of efficient de-duplicating similar news feed items, the method including:
-
assembling a set of news feed items from a plurality of electronic sources; preprocessing the set to qualify some news feed items to return based on common company-name mentions and common token occurrences; pairwise determining a resemblance measure for the qualified news feed items based on sequence alignment between news feed item pairs to calculate raw scores and boosted scores, including; matching tokens from the news feed item pairs and whenever the tokens match, causing a raw score for the resemblance measure to reflect a positive match; whenever two tokens mismatch, causing the raw score for the resemblance measure to reflect the mismatch using a term penalty matrix that assigns a negative penalty for an edit operation including insertion, deletion, and substitution; augmenting the raw score for the resemblance measure to produce a boosted score for the resemblance measure by rewarding an n-gram including bigrams of two contiguous matching tokens and trigrams of three contiguous matching tokens, and responsive to existence of one or more factors including original distance, normalized distance, maximum string length, minimum string length, and longest consecutive matches; and advancing to subsequent token positions in sequence; after evaluating entire sequences of tokens in the qualified news feed items, constructing a graph of news feed item pairs with the resemblance measure above a threshold and representing the resemblance measure as edges between nodes representing the news feed item pairs, thereby forming connected node pairs; and determining similar news feed items by clustering the connected node pairs into strongly connected components; and
wherein using the resemblance measure results in non-duplication of data entities holding news item data obtained from multiple sources. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method of efficient de-duplicating similar news feed items, the method including:
-
assembling a set of news feed items from a plurality of electronic sources; preprocessing the set to qualify some news feed items to return based on common company-name mentions and common token occurrences; pairwise determining a resemblance measure for the qualified news feed items based on results returned in response to supplying news feed item pairs as search criteria, including; matching tokens from the news feed item pairs and whenever the tokens match, a count is allocated to the resemblance measure or the resemblance measure is boosted when news feed item pairs appear in either'"'"'s returned results, a bigram of two contiguous matching tokens, or a trigram of three contiguous matching tokens is detected when matching the news feed item pairs; whenever two tokens mismatch, causing the resemblance measure to reflect the mismatch by reducing the resemblance measure by a count for an edit operation including insertion, deletion, and substitution; and advancing to subsequent token positions in sequence; after evaluating entire sequences of tokens in the qualified news feed items, constructing a graph of news feed item pairs with the resemblance measure above a threshold and representing the resemblance measure as edges between nodes representing the news feed item pairs, thereby forming connected node pairs; and determining similar news feed items by clustering the connected node pairs into strongly connected components; and
wherein using the resemblance measure results in non-duplication of data entities holding news item data obtained from multiple sources. - View Dependent Claims (8, 9, 10, 11, 12, 13)
-
-
14. A system of de-duplicating similar news feed items, the system including:
a processor and a computer readable storage medium storing computer instructions configured to cause the processor to; assemble a set of news feed items from a plurality of electronic sources; preprocess the set to qualify some news feed items to return based on common company-name mentions and common token occurrences; pairwise determine a resemblance measure for the qualified news feed items based on sequence alignment to calculate raw scores and boosted scores, including; matching tokens from news feed item pairs and whenever the tokens match, causing a raw score for the resemblance measure to reflect a match; whenever two tokens mismatch, causing the raw score for the resemblance measure to reflect the mismatch using a term penalty matrix that assigns a negative penalty for an edit operation including insertion, deletion, and substitution; augmenting the raw score for the resemblance measure to produce a boosted score for the resemblance measure by rewarding an n-gram including bigrams of two contiguous matching tokens and trigrams of three contiguous matching tokens, and responsive to existence of one or more factors including original distance, normalized distance, maximum string length, minimum string length, and longest consecutive matches; and advancing to subsequent token positions in sequence; after evaluating entire sequences of tokens in the qualified news feed items, construct a graph of news feed item pairs with the resemblance measure above a threshold and representing the resemblance measure as edges between nodes representing the news feed item pairs, thereby forming connected node pairs; and determine similar news feed items by clustering the connected node pairs into strongly connected components; and
wherein using the resemblance measure results in non-duplication of data entities holding news item data obtained from multiple sources.- View Dependent Claims (15, 16, 17, 18, 19, 20)
Specification