×

Systems and methods of de-duplicating similar news feed items

  • US 9,984,166 B2
  • Filed: 10/10/2014
  • Issued: 05/29/2018
  • Est. Priority Date: 10/10/2014
  • Status: Active Grant
First Claim
Patent Images

1. A method of efficient de-duplicating similar news feed items, the method including:

  • assembling a set of news feed items from a plurality of electronic sources;

    preprocessing the set to qualify some news feed items to return based on common company-name mentions and common token occurrences;

    pairwise determining a resemblance measure for the qualified news feed items based on sequence alignment between news feed item pairs to calculate raw scores and boosted scores, including;

    matching tokens from the news feed item pairs and whenever the tokens match, causing a raw score for the resemblance measure to reflect a positive match;

    whenever two tokens mismatch, causing the raw score for the resemblance measure to reflect the mismatch using a term penalty matrix that assigns a negative penalty for an edit operation including insertion, deletion, and substitution;

    augmenting the raw score for the resemblance measure to produce a boosted score for the resemblance measure by rewarding an n-gram including bigrams of two contiguous matching tokens and trigrams of three contiguous matching tokens, and responsive to existence of one or more factors including original distance, normalized distance, maximum string length, minimum string length, and longest consecutive matches; and

    advancing to subsequent token positions in sequence;

    after evaluating entire sequences of tokens in the qualified news feed items, constructing a graph of news feed item pairs with the resemblance measure above a threshold and representing the resemblance measure as edges between nodes representing the news feed item pairs, thereby forming connected node pairs; and

    determining similar news feed items by clustering the connected node pairs into strongly connected components; and

    wherein using the resemblance measure results in non-duplication of data entities holding news item data obtained from multiple sources.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×