×

Content identification

  • US 8,572,087 B1
  • Filed: 10/17/2007
  • Issued: 10/29/2013
  • Est. Priority Date: 10/17/2007
  • Status: Active Grant
First Claim
Patent Images

1. A method implemented by data processing apparatus, the method comprising:

  • generating a plurality of distinct clusters from training content wherein each of the plurality of clusters represents similar features of content items in the training content;

    identifying one or more conjunctions of the clusters based on a respective probability of observing features of each cluster of the conjunction in a collection of content items;

    scoring each of a plurality of the identified conjunctions based at least partly on a conditional probability that the conjunction is associated with a label;

    selecting a conjunction of the scored conjunctions that has a highest score as a current conjunction;

    until a stopping condition is reached;

    generating one or more higher-order child conjunctions for the current conjunction wherein each of the child conjunctions is a conjunction of the conjoined clusters of the current conjunction with one or more respective additional clusters that are not included in the conjoined clusters;

    scoring each of the child conjunctions based at least partly on a conditional probability that the child conjunction is associated with the label;

    if a highest scoring child conjunction has a score that is less than the score of the current conjunction, the stopping condition is reached, otherwise designating the highest scoring child conjunction as the current conjunction; and

    generating a classifier for the label from the current conjunction.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×