Content identification

US 8,572,087 B1
Filed: 10/17/2007
Issued: 10/29/2013
Est. Priority Date: 10/17/2007
Status: Active Grant

First Claim

Patent Images

1. A method implemented by data processing apparatus, the method comprising:

generating a plurality of distinct clusters from training content wherein each of the plurality of clusters represents similar features of content items in the training content;

identifying one or more conjunctions of the clusters based on a respective probability of observing features of each cluster of the conjunction in a collection of content items;

scoring each of a plurality of the identified conjunctions based at least partly on a conditional probability that the conjunction is associated with a label;

selecting a conjunction of the scored conjunctions that has a highest score as a current conjunction;

until a stopping condition is reached;

generating one or more higher-order child conjunctions for the current conjunction wherein each of the child conjunctions is a conjunction of the conjoined clusters of the current conjunction with one or more respective additional clusters that are not included in the conjoined clusters;

scoring each of the child conjunctions based at least partly on a conditional probability that the child conjunction is associated with the label;

if a highest scoring child conjunction has a score that is less than the score of the current conjunction, the stopping condition is reached, otherwise designating the highest scoring child conjunction as the current conjunction; and

generating a classifier for the label from the current conjunction.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems, computer program products, and methods can identify a training set of content, and generate one or more clusters from the training set of content, where each of the one or more clusters represent similar features of the training set of content. The one or more clusters can be used to generate a classifier. New content is identified and the classifier is used to associate at least one label with the new content.

33 Citations

View as Search Results

24 Claims

1. A method implemented by data processing apparatus, the method comprising:
- generating a plurality of distinct clusters from training content wherein each of the plurality of clusters represents similar features of content items in the training content;
  
  identifying one or more conjunctions of the clusters based on a respective probability of observing features of each cluster of the conjunction in a collection of content items;
  
  scoring each of a plurality of the identified conjunctions based at least partly on a conditional probability that the conjunction is associated with a label;
  
  selecting a conjunction of the scored conjunctions that has a highest score as a current conjunction;
  
  until a stopping condition is reached;
  
  generating one or more higher-order child conjunctions for the current conjunction wherein each of the child conjunctions is a conjunction of the conjoined clusters of the current conjunction with one or more respective additional clusters that are not included in the conjoined clusters;
  
  scoring each of the child conjunctions based at least partly on a conditional probability that the child conjunction is associated with the label;
  
  if a highest scoring child conjunction has a score that is less than the score of the current conjunction, the stopping condition is reached, otherwise designating the highest scoring child conjunction as the current conjunction; and
  
  generating a classifier for the label from the current conjunction.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1 wherein identifying the conjunctions of clusters further comprises generating a distributed data set that indicates probabilities of observing the features in the collection of content items.
  - 3. The method of claim 1 wherein the content items are images.
  - 4. The method of claim 1 wherein the content items are text.
  - 5. The method of claim 1 wherein a conjunction is a set of two or more clusters.
  - 6. The method of claim 1 wherein a cluster represents a collection of content item portions that have similar features.
  - 7. The method of claim 1 wherein the content items are images and wherein the images are examined at different scales and positions, and similar portions of the images are placed into respective clusters.
  - 8. The method of claim 1 wherein content items are parsed into regions and those regions that have similar patterns are grouped into respective clusters.

9. A system comprising:
- a machine-readable storage device with having instructions stored thereon;
  
  data processing apparatus configured to execute the instructions to perform operations comprising;
  
  generating a plurality of distinct clusters from training content wherein each of the plurality of clusters represents similar features of content items in the training content;
  
  identifying one or more conjunctions of the clusters based on a respective probability of observing features of each cluster of the conjunction in a collection of content items;
  
  scoring each of a plurality of the identified conjunctions based at least partly on a conditional probability that the conjunctions is associated with a label;
  
  selecting a conjunction of the scored conjunctions that has a highest score as a current conjunction;
  
  until a stopping condition is reached;
  
  generating one or more higher-order child conjunctions for the current conjunction wherein each of the child conjunctions is a conjunction of the conjoined clusters of the current conjunction with one or more respective additional clusters that are not included in the conjoined clusters;
  
  scoring each of the child conjunctions based at least partly on a conditional probability that the child conjunction is associated with the label;
  
  if a highest scoring child conjunction has a score that is less than the score of the current conjunction, the stopping condition is reached, otherwise designating the highest scoring child conjunction as the current conjunction; and
  
  generating a classifier for the label from the current conjunction.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The system of claim 9 wherein the operations wherein identifying the conjunctions of clusters further comprises generating a distributed data set that indicates probabilities of observing the features in the collection of content items.
  - 11. The system of claim 9 wherein the content items are images.
  - 12. The system of claim 9 wherein the content items are text.
  - 13. The system of claim 9 wherein a conjunction is a set of two or more clusters.
  - 14. The system of claim 9 wherein a cluster represents a collection of content item portions that have similar features.
  - 15. The system of claim 9 wherein the content items are images and wherein the images are examined at different scales and positions, and similar portions of the images are placed into respective clusters.
  - 16. The system of claim 9 wherein content items are parsed into regions and those regions that have similar patterns are grouped into respective clusters.

17. A machine-readable storage device having instructions stored thereon, the instructions operable to cause data processing apparatus to perform operations comprising:
- generating a plurality of distinct clusters from training content wherein each of the plurality of clusters represents similar features of content items in the training content;
  
  identifying one or more conjunctions of the clusters based on a respective probability of observing features of each cluster of the conjunction in a collection of content items;
  
  scoring each of a plurality of the identified conjunctions based at least partly on a conditional probability that the conjunction is associated with a label;
  
  selecting a conjunction of the scored conjunctions that has a highest score as a current conjunction;
  
  until a stopping condition is reached;
  
  generating one or more higher-order child conjunctions for the current conjunction wherein each of the child conjunctions is a conjunction of the conjoined clusters of the current conjunction with one or more respective additional clusters that are not included in the conjoined clusters;
  
  scoring each of the child conjunctions based at least partly on a conditional probability that the child conjunction is associated with the label;
  
  if a highest scoring child conjunction has a score that is less than the score of the current conjunction, the stopping condition is reached, otherwise designating the highest scoring child conjunction as the current conjunction;
  
  andgenerating a classifier for the label from the current conjunction.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
- - 18. The storage device of claim 17 wherein identifying the conjunctions of clusters further comprises generating a distributed data set that indicates probabilities of observing the features in the collection of content items.
  - 19. The storage device of claim 17 wherein the content items are images.
  - 20. The storage device of claim 17 wherein the content items are text.
  - 21. The storage device of claim 17 wherein a conjunction is a set of two or more clusters.
  - 22. The storage device of claim 17 wherein a cluster represents a collection of content item portions that have similar features.
  - 23. The storage device of claim 17 wherein the content items are images and wherein the images are examined at different scales and positions, and similar portions of the images are placed into respective clusters.
  - 24. The storage device of claim 17 wherein content items are parsed into regions and those regions that have similar patterns are grouped into respective clusters.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Yagnik, Jay
Primary Examiner(s)
Mamillapalli, Pavan

Application Number

US11/873,790
Time in Patent Office

2,204 Days
Field of Search

707/6, 707/737, 707/738, 707/771, 382/203
US Class Current

707/738
CPC Class Codes

G06F 16/583   using metadata automaticall...

G06F 18/231   Hierarchical techniques, i....

G06V 10/7625   Hierarchical techniques, i....

G06V 20/35   Categorising the entire sce...

Content identification

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

33 Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Content identification

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

33 Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links