INFORMATION RETRIEVAL SYSTEM AND METHOD USING A BAYESIAN ALGORITHM BASED ON PROBABILISTIC SIMILARITY SCORES

US 20100223258A1
Filed: 12/01/2006
Published: 09/02/2010
Est. Priority Date: 12/01/2005
Status: Abandoned Application

First Claim

Patent Images

1. A computer-implemented method of scoring similarity between one or more query items and one or more other items, each of the items being represented by a feature vector x_icomprising a plurality of digitally represented features x_ij, the method including:

a) receiving an input identifying the query items;

b) for each of the other items computing a score which is a function of a conditional probability of the feature vectors x_iof the query items being generated from a generating distribution p(x_i|θ

defined by parameters θ

given that the feature vector x_iof the respective other item is generated from the generating distribution p(x_i|θ

; and

c) returning a score for each of the other items, a list of some or all of the other items sorted by their respective score or a list of n other items which have the highest score.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An algorithm is provided which uses a model-based concept of a cluster and scores items using a score representative of the probability that a given item has been generated from the same distribution as one or more query items. The items are represented by a feature vector x_icomprising a plurality of digitally represented features x_ijthe method including: receiving an input identifying the query items; for each of the other items computing a score which is a function of a conditional probability of the feature vectors x_ijof the query items being generated from the generating distribution formula (I) given that the respective other item is generated from the generating distribution formula (I) and returning a score for each of the other items, a list of some or all of the other items, sorted by their respective score, or a list of n other items which have the highest score.

Citations

27 Claims

1. A computer-implemented method of scoring similarity between one or more query items and one or more other items, each of the items being represented by a feature vector x_icomprising a plurality of digitally represented features x_ij, the method including:
- a) receiving an input identifying the query items;
  
  b) for each of the other items computing a score which is a function of a conditional probability of the feature vectors x_iof the query items being generated from a generating distribution p(x_i|θ
  
  defined by parameters θ
  
  given that the feature vector x_iof the respective other item is generated from the generating distribution p(x_i|θ
  
  ; and
  
  c) returning a score for each of the other items, a list of some or all of the other items sorted by their respective score or a list of n other items which have the highest score.
- View Dependent Claims (2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 2. A method as claimed in claim 1 in which the function has the effect of averaging over all possible values of the parameters θ
    - , weighted by a probability distribution p(θ
      
      ) over parameter values.
  - 3. A method as claimed in claim 2, in which the feature vectors x_iare binary vectors, the generating distribution is a product of Bernoulli distributions, the product includes a Bernoulli distribution for each feature x_ijand the probability distribution p(θ
    - ) over parameter values is a Beta distribution p(θ
      
      |α
      
      ,β
      
      ) with parameters α and
      
      β
      
      .
  - 4. A method as claimed in claim 3 in which the function includes a product of a matrix X containing the feature vectors x_iof the other items and a vector q the elements of which are given by q_j=log {tilde over (α
    - )}_j−
      
      log α
      
      _j−
      
      log {tilde over (β
      
      )}_j+log β
      
      _jwhereby α
      
      _jand β
      
      _jare parameters of the Beta distribution {tilde over (α
      
      )}_j=α
      
      _j+Σ
      
      _k=1^Nx_kjand {tilde over (β
      
      )}_j=β
      
      _j+N−
      
      Σ
      
      _k=1^Nx_kj, N is the number of items in the query and the sums are over query items.
  - 6. A method as claimed in claim 4, including using sparse matrix multiplication methods for calculating the product of X and q.
  - 7. A method as claimed in claim 4 including pre-processing the items such that only those other items x_iwhich have at least a predefined number of features x_ijin common with the query items are scored.
  - 8. A method as claimed in claim 4 the function including adding $c = \sum$
    - j 
      
      log 
      
      ( α
      
      j + β
      
      j ) - log 
      
      ( α
      
      j + β
      
      j + N ) + log 
      
      
      
      β
      
      ~ j - log 
      
      
      
      β
      
      j to the score to make it comparable between queries.
  - 9. A method as claimed in claim 4 in which α
    - _j=const·
      
      m_jand β
      
      _j=const·
      
      (1−
      
      m_j), whereby const is a constant and m_jis the average of x_ijover all or some of the items.
  - 10. A method as claimed in claim 1 in which receiving an input identifying the query items includes:
    - i) responsive to a user input of search criteria, searching a database to return one or more hits;
      
      ii) receiving a user selection of items among the hits;
      
      iii) using the selection to define the query items; and
      
      wherein the method includes returning a list of M other items which have the highest score.
  - 11. A method as claimed in claim 1 in which the items are images and receiving an input identifying the query items includes, responsive to a user input of search criteria, identifying one or more images associated with a searchable label which matches the search criteria and identifying the identified images as query items.
  - 12. A method as claimed in claim 1 in which the feature vectors are representative of one of the group of web pages, images, patient records, gene sequences, proteins, pharmaceutical molecules, movies, music pieces, goods, people, investment instruments, companies, patents and words.
  - 13. A method as claimed in claim 1 including presenting a completed set of items similar to the query items to a user.
  - 14. A method of cleaning up a data set of items labelled with a particular label including:
    - for each item of the data set calculating a clean-up score using a method as claimed in claim 1 wherein the query items are all items in the data set leaving out the item to be scored and the other item is the item to be scored; and
      
      removing items based on the respective clean-up scores, thereby cleaning up the data set.
  - 15. A method as claimed in claim 14 including removing a predetermined number of items having the lowest scores or all items with a score less than a threshold value.
  - 16. A method of annotating an item including calculating an annotation score for each of a set of labels using a method as claimed in claim 1 wherein the query items are items labelled with the label to be scored, the other item is the item to be annotated and the annotation score is the returned score for the other item;
    - selecting one or more labels to be applied to the item to be annotated based on the respective annotation scores.
  - 17. A method as claimed in claim 16 in which a predetermined number of items having the highest annotation score is selected or in which items having an annotation score greater than a threshold are selected.
  - 18. A method as claimed in claim 1 in which the feature vectors are derived from real-valued feature vectors by thresholding the values of the features such that the resulting feature vectors are sparse.
  - 19. A method as claimed in claims 1 in which the generating distribution is a member of the exponential family of distributions.
  - 20. A method as claimed in claim 19 in which the generating distribution is a Gaussian having a diagonal covariance matrix.
  - 21. A computer system arranged to implement a method as claimed in claim 1.
  - 22. A computer program product comprising computer code instructions adapted to implement a method as claimed in claim 1.
  - 23. A computer readable medium carrying a computer program product as claimed in claim 22.
  - 24. A data signal carrying a computer program product as claimed in claim 22

5. A computer implemented method of scoring the similarity between N query items and one or more other items, each of the items being represented by a feature vector x_icomprising a plurality of binary features x_ij, the method including:
- a) receiving an input identifying the query itemsb) defining a vector q for the query, the elements of q being defined by q_j=log {tilde over (α
  
  )}_j−
  
  log α
  
  _j−
  
  log {tilde over (β
  
  )}_j+log β
  
  _jwhereby α
  
  _jand β
  
  _jare parameters, {tilde over (α
  
  )}_j=α
  
  _j+Σ
  
  _k=1^Nx_kj, {tilde over (β
  
  )}_j=β
  
  _j+N−
  
  Σ
  
  _k=1_Nx_kj, and the sum is over the query itemsc) calculating a score as a function of a product of a matrix X and q, whereby X is a matrix containing all feature vectors x_iof the other itemsd) returning a score for each of the other items a list of some or all of the other items sorted by their respective score, or a list of n other items which have the highest score.

25. A computer implemented method of searching a data base of images including:
- responsive to a user input of search criteria, searching a data base of labelled images to return one or more images having at least one label matching the query;
  
  receiving a user selection of images among the returned images;
  
  calculating a similarity score between the selected images and unlabelled images in the data base; and
  
  returning a set of unlabelled images based on their respective scores.

26. A computer implemented method of cleaning up a data set of items labelled with a particular label including:
- for each item of the data set calculating a clean up score which is a measure of the similarity between all the items in the data set leaving out the item to be scored and the item to be scored; and
  
  removing items based on the respective clean ups scores, thereby cleaning up the data set.

27. A computer implemented method of annotating an item including:
- calculating an annotation score for each of a set of labels as a measure of similarity between items labelled with the label to be scored and the item to be annotated; and
  
  selecting one or more labels to be applied to the item to be annotated based on the respective annotation scores.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
UCL Business PLC (University of London)
Original Assignee
UCL Business PLC (University of London)
Inventors
Heller, Katherine Anne, Ghahramani, Zoubin

Application Number

US12/095,637
Publication Number

US 20100223258A1
Time in Patent Office

Days
Field of Search
US Class Current

707/723
CPC Class Codes

G06F 16/3346   using probabilistic model

G06F 16/583   using metadata automaticall...

G06F 16/951   Indexing; Web crawling tech...

G06F 18/24155   Bayesian classification

INFORMATION RETRIEVAL SYSTEM AND METHOD USING A BAYESIAN ALGORITHM BASED ON PROBABILISTIC SIMILARITY SCORES

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

INFORMATION RETRIEVAL SYSTEM AND METHOD USING A BAYESIAN ALGORITHM BASED ON PROBABILISTIC SIMILARITY SCORES

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links