Recommending content using discriminatively trained document similarity

US 8,027,977 B2
Filed: 06/20/2007
Issued: 09/27/2011
Est. Priority Date: 06/20/2007
Status: Active Grant

First Claim

Patent Images

1. A method for training document similarity models, the method comprising:

obtaining a set of training samples;

obtaining prior information of document relations and non-relations for the set of training samples, wherein the prior information of document relations comprises information indicating that two or more documents in the set of training samples are considered related to each other, and wherein the prior information of document non-relations comprises information indicating that two or more documents in the set of training samples are not considered related to each other; and

discriminatively training an ensemble of document similarity classification models using the set of training samples and using the prior information of document relations and non-relations using a processor of a computer, wherein the ensemble of document similarity classification models are discriminatively trained based at least in part on prior information of non-relation between a first document and a second document in the set of training samples such that a first classification model configured to determine document similarity with respect to the first document does not compete with a second classification model configured to determine document similarity with respect to the second document.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A generalized discriminative training framework for reconciling the training and evaluation objectives for document similarity is provided. Prior information about document relations and non-relations, are used to discriminatively train an ensemble of document similarity classification models. This result is a model set that can be used to compute similarity between seen documents in the training sets and new documents. The measure of similarity forms the basis of recommending documents to a user as well as being able to obtain metadata information such as keywords and tags for new documents not having such information.

29 Citations

View as Search Results

19 Claims

1. A method for training document similarity models, the method comprising:
- obtaining a set of training samples;
  
  obtaining prior information of document relations and non-relations for the set of training samples, wherein the prior information of document relations comprises information indicating that two or more documents in the set of training samples are considered related to each other, and wherein the prior information of document non-relations comprises information indicating that two or more documents in the set of training samples are not considered related to each other; and
  
  discriminatively training an ensemble of document similarity classification models using the set of training samples and using the prior information of document relations and non-relations using a processor of a computer, wherein the ensemble of document similarity classification models are discriminatively trained based at least in part on prior information of non-relation between a first document and a second document in the set of training samples such that a first classification model configured to determine document similarity with respect to the first document does not compete with a second classification model configured to determine document similarity with respect to the second document.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1 and further comprising:
    - applying discriminative training framework to probabilistic latent semantic analysis to create a discriminatively trained probabilistic latent semantic analysis similarity measure .
  - 3. The method of claim 2 and further comprising:
    - applying discriminative training framework to latent semantic analysis to create a discriminatively trained latent semantic analysis similarity measure.
  - 4. The method of claim 1 wherein training includes minimizing an expected number of errors for the ensemble of document similarity classification models for the set of training samples.
  - 5. The method of claim 4 wherein minimizing an expected number of errors for the ensemble of document similarity classification models comprises:
    - forming a set of individual class loss functions where each class is modeled by document model;
      
      obtaining the training set using a set of target document word vectors, a set of training document word vectors and a document similarity matrix;
      
      initially setting values of target document models of the ensemble; and
      
      while a stopping criteria has not been met, iterate where each training iteration includes for each document;
      
      computing a set of related documents;
      
      for each related document.assume a word vector for the document belongs to a related class; and
      
      for each document model in ensemble compute new model parameters; and
      
      update individual document models.

6. A document recommendation system comprising:
- a set of positive documents determined to be of interest to a user;
  
  a set of negative documents determined to not be of interest to the user;
  
  a plurality of candidate documents; and
  
  a module configured to calculate similarity scores of each document in the set of positive documents relative to the plurality of candidate documents and to calculate similarity scores of each document in the set of negative documents relative to the plurality of candidate documents, and wherein the module receives a new document apart from the plurality of candidate documents, calculates a similarity score, using a processor, of the new document relative to each of the plurality of candidate documents using a measure of discriminatively trained similarity associated with each of the plurality of candidate documents, and outputs a reference to at least one of the plurality of candidate documents based on the calculated similarity scores.
- View Dependent Claims (7, 8, 9, 10, 11, 12, 13)
- - 7. The document recommendation system of claim 6, wherein the new document comprises at least one of an audio, video, and image data.
  - 8. The document recommendation system of claim 6 wherein the module is configured to output a reference to at least one of the candidate documents that are considered similar based on the calculated similarity scores of the new document relative to each of the candidate documents.
  - 9. The document recommendation system of claim 6 wherein the module is configured to receive an input from the user indicative of whether the user has an interest or not in the at least one referenced candidate document, and wherein each document of the at least one referenced candidate document is added to the set of positive documents if the user has an interest in the document, or added to the set of negative documents if the user has no interest in the document.
  - 10. The document recommendation system of claim 9 wherein the module is configured to discard contents of the set of positive documents and the set of negative documents and add a currently rendered document to the set of positive documents.
  - 11. The document recommendation system of claim 6 wherein the measure of similarity is based on Latent Semantic Analysis (LSA) or Probabilistic Latent Semantic Analysis (PLSA).
  - 12. The document recommendation system of claim 6 wherein the module is configured to render a recommended document on a first display area of a monitor and render a list of further recommended documents on a second display area of the monitor.
  - 13. The document recommendation system of claim 6 wherein the module is configured to select an advertisement from a set of advertisements based on a recommended document.

14. A system for obtaining metadata related to a document, the system comprising:
- a plurality of documents, each document having metadata associated therewith, the metadata comprising at least one of a keyword and tag associated with the document; and
  
  a module configured to receive a new document apart from the plurality of documents, generate metadata for the new document, and associate the generated metadata with the new document using a processor, wherein the metadata is generated for the new document based on the metadata associated with one or more of the plurality of documents and based on a similarity score of the new document relative to each of the plurality of documents using a measure of similarity based on a weighting factor associated with each document of the plurality of documents.
- View Dependent Claims (15, 16, 17, 18, 19)
- - 15. The document recommendation system of claim 14 wherein the metadata associated with the plurality of documents comprises keywords and wherein the module is configured to determine keywords for the new document based on decomposing at least one of the similarity scores.
  - 16. The document recommendation system of claim 15 wherein the module is configured to rank the similarity scores.
  - 17. The document recommendation system of claim 14 wherein the similarity scores are based on factor space and wherein module is configured to determine keywords for the new document based on decomposing at least one of the similarity scores with respect to the factor space and to terms in the factor space.
  - 18. The document recommendation system of claim 14 wherein the metadata associated with the plurality of documents comprises tags and wherein the module is configured to determine tags for the new document based on inferring tags from the plurality of documents.
  - 19. The document recommendation system of claim 18 wherein the module is configured to rank the similarity scores.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Seide, Frank T. B., Thambiratnam, Albert J. K., Yu, Peng, Lu, Lie
Primary Examiner(s)
Robinson; Greta
Assistant Examiner(s)
Chang; Jeffrey

Application Number

US11/765,653
Publication Number

US 20080319973A1
Time in Patent Office

1,560 Days
Field of Search

707/2, 707/6, 707/736, 707/706, 707/748
US Class Current

707/736
CPC Class Codes

G06F 16/313 Selection or weighting of t...

G06F 16/38 Retrieval characterised by ...

Recommending content using discriminatively trained document similarity

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

29 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Recommending content using discriminatively trained document similarity

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

29 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links