Systems and methods of de-duplicating similar news feed items

US 9,984,166 B2
Filed: 10/10/2014
Issued: 05/29/2018
Est. Priority Date: 10/10/2014
Status: Active Grant

First Claim

Patent Images

1. A method of efficient de-duplicating similar news feed items, the method including:

assembling a set of news feed items from a plurality of electronic sources;

preprocessing the set to qualify some news feed items to return based on common company-name mentions and common token occurrences;

pairwise determining a resemblance measure for the qualified news feed items based on sequence alignment between news feed item pairs to calculate raw scores and boosted scores, including;

matching tokens from the news feed item pairs and whenever the tokens match, causing a raw score for the resemblance measure to reflect a positive match;

whenever two tokens mismatch, causing the raw score for the resemblance measure to reflect the mismatch using a term penalty matrix that assigns a negative penalty for an edit operation including insertion, deletion, and substitution;

augmenting the raw score for the resemblance measure to produce a boosted score for the resemblance measure by rewarding an n-gram including bigrams of two contiguous matching tokens and trigrams of three contiguous matching tokens, and responsive to existence of one or more factors including original distance, normalized distance, maximum string length, minimum string length, and longest consecutive matches; and

advancing to subsequent token positions in sequence;

after evaluating entire sequences of tokens in the qualified news feed items, constructing a graph of news feed item pairs with the resemblance measure above a threshold and representing the resemblance measure as edges between nodes representing the news feed item pairs, thereby forming connected node pairs; and

determining similar news feed items by clustering the connected node pairs into strongly connected components; and

wherein using the resemblance measure results in non-duplication of data entities holding news item data obtained from multiple sources.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The technology disclosed relates to de-duplicating contextually similar news feed items. In particular, it relates to assembling a set of news feed items from a plurality of electronic sources and preprocessing the set to generate normalized news feed items that share common company-name mentions and token occurrences. The normalized news feed items are used to calculate one or more resemblance measures based on a sequence alignment score and/or a hyperlink score. The sequence alignment score determines contextual similarity between news feed item pairs, arranged as sequences, based on a number of matching elements in the news feed item sequences and a number of edit operations, such as insertion, deletion, and substitution, required to match the news feed item sequences. The hyperlink score determines contextual similarity between news feed item pairs by comparing the respective search results retrieved in response to supplying the news feed item pairs to a search engine.

208 Citations

20 Claims

1. A method of efficient de-duplicating similar news feed items, the method including:
- assembling a set of news feed items from a plurality of electronic sources;
  
  preprocessing the set to qualify some news feed items to return based on common company-name mentions and common token occurrences;
  
  pairwise determining a resemblance measure for the qualified news feed items based on sequence alignment between news feed item pairs to calculate raw scores and boosted scores, including;
  
  matching tokens from the news feed item pairs and whenever the tokens match, causing a raw score for the resemblance measure to reflect a positive match;
  
  whenever two tokens mismatch, causing the raw score for the resemblance measure to reflect the mismatch using a term penalty matrix that assigns a negative penalty for an edit operation including insertion, deletion, and substitution;
  
  augmenting the raw score for the resemblance measure to produce a boosted score for the resemblance measure by rewarding an n-gram including bigrams of two contiguous matching tokens and trigrams of three contiguous matching tokens, and responsive to existence of one or more factors including original distance, normalized distance, maximum string length, minimum string length, and longest consecutive matches; and
  
  advancing to subsequent token positions in sequence;
  
  after evaluating entire sequences of tokens in the qualified news feed items, constructing a graph of news feed item pairs with the resemblance measure above a threshold and representing the resemblance measure as edges between nodes representing the news feed item pairs, thereby forming connected node pairs; and
  
  determining similar news feed items by clustering the connected node pairs into strongly connected components; and
  
  wherein using the resemblance measure results in non-duplication of data entities holding news item data obtained from multiple sources.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, wherein the news feed items are published within a predetermined time window prior to a current time.
  - 3. The method of claim 1, further including determining representative news feed items for the similar news feed items by identifying cluster heads of respective strongly connected components, wherein the cluster heads have highest degree of connectivity in the respective strongly connected components.
  - 4. The method of claim 1, further including pairwise determining the resemblance measure for the news feed items based on results returned in response to supplying the news feed item pairs as search criteria.
  - 5. The method of claim 4, wherein the results returned include at least one of:
    - unified resource locators (URLs) of web pages;
      
      content of the web pages;
      
      ormetadata about the web pages.
  - 6. The method of claim 1, wherein preprocessing the set further includes removing stop tokens from the news feed items.

7. A method of efficient de-duplicating similar news feed items, the method including:
- assembling a set of news feed items from a plurality of electronic sources;
  
  preprocessing the set to qualify some news feed items to return based on common company-name mentions and common token occurrences;
  
  pairwise determining a resemblance measure for the qualified news feed items based on results returned in response to supplying news feed item pairs as search criteria, including;
  
  matching tokens from the news feed item pairs and whenever the tokens match, a count is allocated to the resemblance measure or the resemblance measure is boosted when news feed item pairs appear in either'"'"'s returned results, a bigram of two contiguous matching tokens, or a trigram of three contiguous matching tokens is detected when matching the news feed item pairs;
  
  whenever two tokens mismatch, causing the resemblance measure to reflect the mismatch by reducing the resemblance measure by a count for an edit operation including insertion, deletion, and substitution; and
  
  advancing to subsequent token positions in sequence;
  
  after evaluating entire sequences of tokens in the qualified news feed items, constructing a graph of news feed item pairs with the resemblance measure above a threshold and representing the resemblance measure as edges between nodes representing the news feed item pairs, thereby forming connected node pairs; and
  
  determining similar news feed items by clustering the connected node pairs into strongly connected components; and
  
  wherein using the resemblance measure results in non-duplication of data entities holding news item data obtained from multiple sources.
- View Dependent Claims (8, 9, 10, 11, 12, 13)
- - 8. The method of claim 7, wherein the news feed items are published within a predetermined time window prior to a current time.
  - 9. The method of claim 7, further including determining representative news feed items for the similar news feed items by identifying cluster heads of respective strongly connected components, wherein the cluster heads have highest degree of connectivity in the respective strongly connected components.
  - 10. The method of claim 7, wherein the results returned include at least one of:
    - unified resource locators (URLs) of web pages;
      
      content of the web pages;
      
      ormetadata about the web pages.
  - 11. The method of claim 7, wherein the results returned in response to supplying a first news feed item as the search criteria include a second news feed item, further including augmenting the resemblance measure for the first news feed item and the second news feed item as paired.
  - 12. The method of claim 7, further including pairwise determining the resemblance measure for the news feed items based on sequence alignment between the news feed item pairs.
  - 13. The method of claim 7, wherein preprocessing the set further includes removing stop word tokens from the news feed items.

14. A system of de-duplicating similar news feed items, the system including:
- a processor and a computer readable storage medium storing computer instructions configured to cause the processor to;
  
  assemble a set of news feed items from a plurality of electronic sources;
  
  preprocess the set to qualify some news feed items to return based on common company-name mentions and common token occurrences;
  
  pairwise determine a resemblance measure for the qualified news feed items based on sequence alignment to calculate raw scores and boosted scores, including;
  
  matching tokens from news feed item pairs and whenever the tokens match, causing a raw score for the resemblance measure to reflect a match;
  
  whenever two tokens mismatch, causing the raw score for the resemblance measure to reflect the mismatch using a term penalty matrix that assigns a negative penalty for an edit operation including insertion, deletion, and substitution;
  
  augmenting the raw score for the resemblance measure to produce a boosted score for the resemblance measure by rewarding an n-gram including bigrams of two contiguous matching tokens and trigrams of three contiguous matching tokens, and responsive to existence of one or more factors including original distance, normalized distance, maximum string length, minimum string length, and longest consecutive matches; and
  
  advancing to subsequent token positions in sequence;
  
  after evaluating entire sequences of tokens in the qualified news feed items, construct a graph of news feed item pairs with the resemblance measure above a threshold and representing the resemblance measure as edges between nodes representing the news feed item pairs, thereby forming connected node pairs; and
  
  determine similar news feed items by clustering the connected node pairs into strongly connected components; and
  
  wherein using the resemblance measure results in non-duplication of data entities holding news item data obtained from multiple sources.
- View Dependent Claims (15, 16, 17, 18, 19, 20)
- - 15. The system of claim 14, wherein the news feed items are published within a predetermined time window prior to a current time.
  - 16. The system of claim 14, further configured to determine representative news feed items for the similar news feed items by identifying cluster heads of respective strongly connected components, wherein the cluster heads have highest degree of connectivity in the respective strongly connected components.
  - 17. The system of claim 14, further configured to pairwise determine the resemblance measure for the news feed items based on results returned in response to supplying the news feed item pairs as search criteria.
  - 18. The system of claim 17, wherein the results returned include at least one of:
    - unified resource locators of web pages;
      
      content of the web pages;
      
      ormetadata about the web pages.
  - 19. The system of claim 17, wherein the results returned in response to supplying a first news feed item as the search criteria include a second news feed item, further including augmenting the resemblance measure for the first news feed item and the second news feed item as paired.
  - 20. The system of claim 14, wherein preprocessing the set further includes removing stop word tokens from the news feed items.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Salesforce.com, Inc.
Original Assignee
Salesforce.com, Inc.
Inventors
Even-Zohar, Yair, Tsur, Elad
Primary Examiner(s)
Uddin, Mohammed R

Application Number

US14/512,215
Publication Number

US 20160103916A1
Time in Patent Office

1,327 Days
Field of Search

707738, 707727, 707748, 707749, 707750, 707999101
US Class Current
CPC Class Codes

G06F 16/9535 Search customisation based ...

Systems and methods of de-duplicating similar news feed items

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

208 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods of de-duplicating similar news feed items

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

208 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links