Computer implemented methods and apparatus for identifying similar labels using collaborative filtering

US 9,465,828 B2
Filed: 01/22/2014
Issued: 10/11/2016
Est. Priority Date: 01/22/2013
Status: Active Grant

First Claim

Patent Images

1. A system for identifying similar labels, the system comprising:

a database system implemented using a server system comprising one or more hardware processors, the database system configurable to cause;

maintaining, through one or more databases, a plurality of data entries, each data entry of a first portion of the data entries identifying;

a text sequence, a label, and a text-to-label association score indicating a number of times that the text sequence appears in one or more previous incoming texts associated with the label, and each data entry of a second portion of the data entries identifying;

a first label, a second label, and a similarity score;

generating a plurality of pairs based on the first portion of data entries, each pair comprising information identifying a first label and a second label;

calculating a similarity score for each of the pairs comprising calculating a collaborative filtering similarity score for the first label and the second label identified by the pair using a first vector of text sequences associated with the first label and a second vector of text sequences associated with the second label, wherein a text sequence is associated with a label when the text sequence appears in a previous incoming text associated with the label; and

updating the second portion of the data entries to identify the pairs and the respective similarity scores;

processing a request for labels having similar associated text sequences;

identifying, based on the pairs and the respective similarity scores, a set of pairs having the same first label; and

selecting a pair of the identified set of pairs as having a higher respective similarity score than one or more other pairs of the identified set of pairs.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed are methods, apparatus, systems, and computer-readable storage media for identifying similar labels. In some implementations, one or more servers maintain a plurality of data entries in one or more database tables storing textual data, each data entry of a first portion of the data entries including: a text sequence, a label, and a text-to-label association score, and each data entry of a second portion of the data entries including: a first label, a second label, and a similarity score. The one or more servers analyze the data of the first portion of data entries to generate one or more pairs, each pair including information identifying a first label and a second label. The one or more servers calculate a similarity score for each of the one or more pairs and store the respective similarity scores in the second portion of the data entries.

200 Citations

19 Claims

1. A system for identifying similar labels, the system comprising:
- a database system implemented using a server system comprising one or more hardware processors, the database system configurable to cause;
  
  maintaining, through one or more databases, a plurality of data entries, each data entry of a first portion of the data entries identifying;
  
  a text sequence, a label, and a text-to-label association score indicating a number of times that the text sequence appears in one or more previous incoming texts associated with the label, and each data entry of a second portion of the data entries identifying;
  
  a first label, a second label, and a similarity score;
  
  generating a plurality of pairs based on the first portion of data entries, each pair comprising information identifying a first label and a second label;
  
  calculating a similarity score for each of the pairs comprising calculating a collaborative filtering similarity score for the first label and the second label identified by the pair using a first vector of text sequences associated with the first label and a second vector of text sequences associated with the second label, wherein a text sequence is associated with a label when the text sequence appears in a previous incoming text associated with the label; and
  
  updating the second portion of the data entries to identify the pairs and the respective similarity scores;
  
  processing a request for labels having similar associated text sequences;
  
  identifying, based on the pairs and the respective similarity scores, a set of pairs having the same first label; and
  
  selecting a pair of the identified set of pairs as having a higher respective similarity score than one or more other pairs of the identified set of pairs.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The system of claim 1, wherein the collaborative filtering similarity score comprises a cosine similarity score.
  - 3. The system of claim 1, the database system further configurable to cause:
    - determining that the similarity score for the selected pair is higher than a designated similarity threshold; and
      
      transmitting the first label and the second label of the selected pair to a computing device as a similar pair of labels.
  - 4. The system of claim 3, wherein the similar pair of labels is transmitted to the computing device with one or more other pairs of labels.
  - 5. The system of claim 1, wherein the similar pair of labels is selected in response to receiving from the computing device a request for similar pairs of labels.
  - 6. The system of claim 1, the database system further configurable to cause:
    - responsive to a request to associate a label with an incoming text, updating the first portion of data entries with one or more text sequences of the incoming text and the requested label.
  - 7. The system of claim 6, the database system further configurable to cause:
    - for each text sequence of the incoming text;
      
      identifying or creating a first data entry of the first portion of data entries that includes the text sequence and the requested label; and
      
      incrementing the text-to-label association score of the identified or created first data entry by an inflation factor.
  - 8. The system of claim 7, wherein the inflation factor has a value based on a measure of time.
  - 9. The system of claim 1, the database system further configurable to cause:
    - periodically identifying one or more incoming texts and associated labels, wherein each incoming text includes one or more text sequences; and
      
      for each incoming text and associated label, updating the first portion of data entries with the one or more text sequences of the incoming text and the associated label.
  - 10. The system of claim 9, wherein the incoming texts are text documents, wherein a label is a file name of a text document, and wherein the similarity scores identify text documents that are different versions of the same original text document.
  - 11. The system of claim 1, wherein an incoming text is one of:
    - a social media message, a text document, a customer relationship management (CRM) object including textual data, a feed item, and an article.
  - 12. The system of claim 1, wherein a text sequence includes one or more words.
  - 13. The system of claim 1, wherein a label is a user-generated topic assigned to a social media message or a CRM object having textual data.
  - 14. The system of claim 1, wherein a label is a hashtag.
  - 15. The system of claim 1, wherein the similarity score of a pair including a first label and a second label is based at least in part on a frequency of one or more text sequences appearing in a first one or more previous incoming texts associated the first label and in a second one or more previous incoming texts associated with the second label.
  - 16. The system of claim 1, wherein the similarity score of a pair is normalized for a frequency at which the text sequences associated with the labels are used in incoming texts.
  - 17. The system of claim 1, wherein calculating the similarity score for each of the one or more pairs comprises using a collaborative filtering method to determine cosine similarity scores for the pairs, and wherein the second portion of the data entries is a collaborative filter table having the one or more pairs recorded therein based on the analysis.

18. One or more computing devices for identifying similar labels to a user, the one or more computing devices comprising:
- one or more hardware processors configurable to cause;
  
  maintaining, by one or more servers, a plurality of data entries, each data entry of a first portion of the data entries identifying;
  
  a text sequence, a label, and a text-to-label association score indicating a number of times that the text sequence appears in one or more previous incoming texts associated with the label, and each data entry of a second portion of the data entries identifying;
  
  a first label, a second label, and a similarity score;
  
  generating a plurality of pairs based on the first portion of data entries, each pair comprising information identifying a first label and a second label;
  
  calculating a similarity score for each of the pairs comprising calculating a collaborative filtering similarity score for the first label and the second label identified by the pair using a first vector of text sequences associated with the first label and a second vector of text sequences associated with the second label, wherein a text sequence is associated with a label when the text sequence appears in a previous incoming text associated with the label; and
  
  updating the second portion of the data entries to identify the pairs and the respective similarity scores;
  
  processing a request for labels having similar associated text sequences;
  
  identifying, based on the pairs and the respective similarity scores, a set of pairs having the same first label; and
  
  selecting a pair of the identified set of pairs as having a higher respective similarity score than one or more other pairs of the identified set of pairs.

19. A non-transitory computer-readable storage medium storing instructions executable by a computing device for identifying similar labels to a user, the instructions being configurable to cause:
- maintaining, through one or more databases, a plurality of data entries, each data entry of a first portion of the data entries identifying;
  
  a text sequence, a label, and a text-to-label association score indicating a number of times that the text sequence appears in one or more previous incoming texts associated with the label, and each data entry of a second portion of the data entries identifying;
  
  a first label, a second label, and a similarity score;
  
  generating a plurality of pairs based on the first portion of data entries, each pair comprising information identifying a first label and a second label;
  
  calculating a similarity score for each of the pairs comprising calculating a collaborative filtering similarity score for the first label and the second label identified by the pair using a first vector of text sequences associated with the first label and a second vector of text sequences associated with the second label, wherein a text sequence is associated with a label when the text sequence appears in a previous incoming text associated with the label; and
  
  updating the second portion of the data entries to identify the pairs and the respective similarity scores;
  
  processing a request for labels having similar associated text sequences;
  
  identifying, based on the pairs and the respective similarity scores, a set of pairs having the same first label; and
  
  selecting a pair of the identified set of pairs as having a higher respective similarity score than one or more other pairs of the identified set of pairs.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Salesforce.com, Inc.
Original Assignee
Salesforce.com, Inc.
Inventors
Meng, Xiao, Palmert, Joel
Primary Examiner(s)
Arjomandi, Noosha

Application Number

US14/161,232
Publication Number

US 20140207777A1
Time in Patent Office

993 Days
Field of Search

707/737
US Class Current

1/1
CPC Class Codes

G06F 16/22 Indexing; Data structures t...

G06F 16/90335 Query processing

Computer implemented methods and apparatus for identifying similar labels using collaborative filtering

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

200 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Computer implemented methods and apparatus for identifying similar labels using collaborative filtering

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

200 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links