SYSTEMS AND METHODS OF DE-DUPLICATING SIMILAR NEWS FEED ITEMS
First Claim
1. A method of de-duplicating similar news feed items, the method including:
- assembling a set of news feed items from a plurality of electronic sources;
preprocessing the set to qualify some of the news feed items to return based on common company-name mentions and common token occurrences;
pairwise determining a resemblance measure for the qualified news feed items based on sequence alignment between news feed item pairs;
constructing a graph of news feed item pairs with the resemblance measure above a threshold and representing the resemblance measure as edges between nodes representing the news feed item pairs, thereby forming connected node pairs; and
determining similar news feed items by clustering the connected node pairs into strongly connected components.
1 Assignment
0 Petitions
Accused Products
Abstract
The technology disclosed relates to de-duplicating contextually similar news feed items. In particular, it relates to assembling a set of news feed items from a plurality of electronic sources and preprocessing the set to generate normalized news feed items that share common company-name mentions and token occurrences. The normalized news feed items are used to calculate one or more resemblance measures based on a sequence alignment score and/or a hyperlink score. The sequence alignment score determines contextual similarity between news feed item pairs, arranged as sequences, based on a number of matching elements in the news feed item sequences and a number of edit operations, such as insertion, deletion, and substitution, required to match the news feed item sequences. The hyperlink score determines contextual similarity between news feed item pairs by comparing the respective search results retrieved in response to supplying the news feed item pairs to a search engine.
54 Citations
20 Claims
-
1. A method of de-duplicating similar news feed items, the method including:
-
assembling a set of news feed items from a plurality of electronic sources; preprocessing the set to qualify some of the news feed items to return based on common company-name mentions and common token occurrences; pairwise determining a resemblance measure for the qualified news feed items based on sequence alignment between news feed item pairs; constructing a graph of news feed item pairs with the resemblance measure above a threshold and representing the resemblance measure as edges between nodes representing the news feed item pairs, thereby forming connected node pairs; and determining similar news feed items by clustering the connected node pairs into strongly connected components. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method of de-duplicating similar news feed items, the method including:
-
assembling a set of news feed items from a plurality of electronic sources; preprocessing the set to qualify some of the news feed items to return based on common company-name mentions and common token occurrences; pairwise determining a resemblance measure for the qualified news feed items based on results returned in response to supplying news feed item pairs as search criteria; constructing a graph of news feed item pairs with the resemblance measure above a threshold and representing the resemblance measure as edges between nodes representing the news feed item pairs, thereby forming connected node pairs; and determining similar news feed items by clustering the connected node pairs into strongly connected components. - View Dependent Claims (8, 9, 10, 11, 12, 13)
-
-
14. A system of de-duplicating similar news feed items, the system including:
a processor and a computer readable storage medium storing computer instructions configured to cause the processor to; assemble a set of news feed items from a plurality of electronic sources; preprocess the set to qualify some of the news feed items to return based on common company-name mentions and common token occurrences; pairwise determine a resemblance measure for the qualified news feed items based on sequence alignment between news feed item pairs; construct a graph of news feed item pairs with the resemblance measure above a threshold and representing the resemblance measure as edges between nodes representing the news feed item pairs, thereby forming connected node pairs; and determine similar news feed items by clustering the connected node pairs into strongly connected components. - View Dependent Claims (15, 16, 17, 18, 19, 20)
Specification