Content identification
First Claim
Patent Images
1. A method implemented by data processing apparatus, the method comprising:
- generating a plurality of distinct clusters from training content wherein each of the plurality of clusters represents similar features of content items in the training content;
identifying one or more conjunctions of the clusters based on a respective probability of observing features of each cluster of the conjunction in a collection of content items;
scoring each of a plurality of the identified conjunctions based at least partly on a conditional probability that the conjunction is associated with a label;
selecting a conjunction of the scored conjunctions that has a highest score as a current conjunction;
until a stopping condition is reached;
generating one or more higher-order child conjunctions for the current conjunction wherein each of the child conjunctions is a conjunction of the conjoined clusters of the current conjunction with one or more respective additional clusters that are not included in the conjoined clusters;
scoring each of the child conjunctions based at least partly on a conditional probability that the child conjunction is associated with the label;
if a highest scoring child conjunction has a score that is less than the score of the current conjunction, the stopping condition is reached, otherwise designating the highest scoring child conjunction as the current conjunction; and
generating a classifier for the label from the current conjunction.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems, computer program products, and methods can identify a training set of content, and generate one or more clusters from the training set of content, where each of the one or more clusters represent similar features of the training set of content. The one or more clusters can be used to generate a classifier. New content is identified and the classifier is used to associate at least one label with the new content.
33 Citations
24 Claims
-
1. A method implemented by data processing apparatus, the method comprising:
-
generating a plurality of distinct clusters from training content wherein each of the plurality of clusters represents similar features of content items in the training content; identifying one or more conjunctions of the clusters based on a respective probability of observing features of each cluster of the conjunction in a collection of content items; scoring each of a plurality of the identified conjunctions based at least partly on a conditional probability that the conjunction is associated with a label; selecting a conjunction of the scored conjunctions that has a highest score as a current conjunction; until a stopping condition is reached; generating one or more higher-order child conjunctions for the current conjunction wherein each of the child conjunctions is a conjunction of the conjoined clusters of the current conjunction with one or more respective additional clusters that are not included in the conjoined clusters; scoring each of the child conjunctions based at least partly on a conditional probability that the child conjunction is associated with the label; if a highest scoring child conjunction has a score that is less than the score of the current conjunction, the stopping condition is reached, otherwise designating the highest scoring child conjunction as the current conjunction; and generating a classifier for the label from the current conjunction. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system comprising:
-
a machine-readable storage device with having instructions stored thereon; data processing apparatus configured to execute the instructions to perform operations comprising; generating a plurality of distinct clusters from training content wherein each of the plurality of clusters represents similar features of content items in the training content; identifying one or more conjunctions of the clusters based on a respective probability of observing features of each cluster of the conjunction in a collection of content items; scoring each of a plurality of the identified conjunctions based at least partly on a conditional probability that the conjunctions is associated with a label; selecting a conjunction of the scored conjunctions that has a highest score as a current conjunction; until a stopping condition is reached; generating one or more higher-order child conjunctions for the current conjunction wherein each of the child conjunctions is a conjunction of the conjoined clusters of the current conjunction with one or more respective additional clusters that are not included in the conjoined clusters; scoring each of the child conjunctions based at least partly on a conditional probability that the child conjunction is associated with the label; if a highest scoring child conjunction has a score that is less than the score of the current conjunction, the stopping condition is reached, otherwise designating the highest scoring child conjunction as the current conjunction; and generating a classifier for the label from the current conjunction. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A machine-readable storage device having instructions stored thereon, the instructions operable to cause data processing apparatus to perform operations comprising:
-
generating a plurality of distinct clusters from training content wherein each of the plurality of clusters represents similar features of content items in the training content; identifying one or more conjunctions of the clusters based on a respective probability of observing features of each cluster of the conjunction in a collection of content items; scoring each of a plurality of the identified conjunctions based at least partly on a conditional probability that the conjunction is associated with a label; selecting a conjunction of the scored conjunctions that has a highest score as a current conjunction; until a stopping condition is reached; generating one or more higher-order child conjunctions for the current conjunction wherein each of the child conjunctions is a conjunction of the conjoined clusters of the current conjunction with one or more respective additional clusters that are not included in the conjoined clusters; scoring each of the child conjunctions based at least partly on a conditional probability that the child conjunction is associated with the label; if a highest scoring child conjunction has a score that is less than the score of the current conjunction, the stopping condition is reached, otherwise designating the highest scoring child conjunction as the current conjunction; and generating a classifier for the label from the current conjunction. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
-
Specification