Clustering and classification of multimedia data

US 7,774,288 B2
Filed: 05/16/2006
Issued: 08/10/2010
Est. Priority Date: 05/16/2006
Status: Active Grant

First Claim

Patent Images

1. A computerized method comprising:

generating, with a cluster content computer, a hierarchy of clusters of category data, the hierarchy comprises a plurality of levels of different sets of clusters, wherein at least one higher set of clusters is derived from a lower set of clusters and the generating the hierarchy includes,calculating similarity values between clusters in the lower set of clusters, the similarity values are based on a probability distribution for each cluster and an entropic distance metric, the probability distribution for each cluster is a probability of an occurrence of an attribute in the category data occurring in that cluster, and the entropic distance metric is a instance metric of cluster pairs in the lower set of clusters, andidentifying a cluster pair in the lower set of clusters that minimizes the loss of information; and

representing records of multimedia content as the hierarchy of clusters of category data, wherein the category data is defined in a vector space comprising multiple attributes, and wherein the records comprise category data.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Records including category data is clustered by representing the data as a plurality of clusters, and generating a hierarchy of clusters based on the clusters. Records including category data are classified into folders according to a predetermined entropic similarity condition.

111 Citations

View as Search Results

44 Claims

1. A computerized method comprising:
- generating, with a cluster content computer, a hierarchy of clusters of category data, the hierarchy comprises a plurality of levels of different sets of clusters, wherein at least one higher set of clusters is derived from a lower set of clusters and the generating the hierarchy includes,calculating similarity values between clusters in the lower set of clusters, the similarity values are based on a probability distribution for each cluster and an entropic distance metric, the probability distribution for each cluster is a probability of an occurrence of an attribute in the category data occurring in that cluster, and the entropic distance metric is a instance metric of cluster pairs in the lower set of clusters, andidentifying a cluster pair in the lower set of clusters that minimizes the loss of information; and
  
  representing records of multimedia content as the hierarchy of clusters of category data, wherein the category data is defined in a vector space comprising multiple attributes, and wherein the records comprise category data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein the deriving the at least one higher set of clusters comprising:
    - merging clusters from the lower set of clusters together according to the entropic distance metric.
  - 3. The method of claim 2, wherein successively merging clusters together comprises:
    - determining a cluster pair in the lower set of clusters that has an entropic similarity characteristic value that satisfies the predetermined entropic similarity condition; and
      
      merging the selected cluster pair in the lower set of clusters into a single cluster for the at least one higher set of clusters.
  - 4. The method of claim 3, further comprising:
    - representing the merged cluster pair in the at least one higher set of clusters; and
      
      mapping non-merged clusters into the at least one higher set of clusters.
  - 5. The method of claim 3, wherein selecting a cluster pair comprises selecting a cluster pair that has a minimum entropic divergence.
  - 6. The method of claim 3, wherein selecting a cluster pair comprises selecting a cluster pair that has a maximum entropic proximity.
  - 7. The method of claim 1, further comprising:
    - mapping each record onto a system ontology; and
      
      cleaning at least one record.
  - 8. The method of claim 7, wherein cleaning a record comprises at least one of removing terms from attributes of the record, splitting attributes of the record into a plurality of sub-attributes, and replacing terms in attributes of the record.
  - 9. The method of claim 1, further comprising:
    - generating a distance matrix representing possible combinations of clusters present within a current hierarchy layer.

10. A machine-readable storage medium having executable instructions to a cause a machine to perform a method comprising:
- generating a hierarchy of clusters of category data, the hierarchy comprises a plurality of levels of different sets of clusters, wherein at least one higher set of clusters derived from a lower set of clusters and the generating the hierarchy includes,calculating similarity values between clusters in the lower set of clusters, the similarity values are based on a probability distribution for each cluster and an entropic distance metric, the probability distribution for each cluster is a probability of an occurrence of an attribute in the category data occurring in that cluster, and the entropic distance metric is a distance metric of cluster pairs in the lower set of clusters, andidentifying a cluster pair in the lower set of clusters that minimizes the loss of information; and
  
  representing records of multimedia content as the hierarchy of clusters of category data, wherein the category data is defined in a vector space comprising multiple attributes, and wherein the records comprise category data.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The machine-readable storage medium of claim 10, wherein the wherein the deriving the at least one higher set, of clusters comprises:
    - merging clusters from the lower set of clusters together according to the entropic distance metric.
  - 12. The machine-readable storage medium of claim 11, wherein the method further comprises:
    - determining a cluster pair at the lower set of clusters that has an entropic similarity characteristic value that satisfies the predetermined entropic similarity condition; and
      
      merging the selected cluster pair at the lower set of clusters into a single cluster for the at least one higher set of clusters.
  - 13. The machine-readable storage medium of claim 12, wherein the method further comprises:
    - representing the merged cluster pair in the at least one higher set of clusters; and
      
      mapping non-merged clusters into the at least one higher set of clusters.
  - 14. The machine-readable storage medium of claim 12, wherein selecting a cluster pair comprises selecting a cluster pair that has a minimum entropic divergence.
  - 15. The machine-readable storage medium of claim 12, wherein selecting a cluster pair comprises selecting a cluster pair that has a maximum entropic proximity.
  - 16. The machine-readable storage medium of claim 10, wherein the methodfurther comprises:
    - mapping each record onto a system ontology; and
      
      cleaning at least one record.
  - 17. The machine-readable storage medium of claim 16, wherein cleaning a record comprises at least one of removing terms from attributes of the record, splitting attributed of the record into a plurality of sub-attributes, and replacing terms in attributes of the record.
  - 18. The machine-readable storage medium of claim 10, wherein the method further comprises:
    - generating a distance matrix representing possible combinations of clusters present within a current hierarchy layer.

19. A computerized system comprising:
- a processor coupled to a memory through a bus; and
  
  a process executed from the memory by the processor to cause the processor to;
  
  generate a hierarchy of clusters of category data, the hierarchy comprises a plurality of levels of different sets of clusters, wherein at least one higher set of clusters derived from a lower set of clusters and the generation of the hierarchy further causes the processor to calculate similarity values between clusters in the lower set of clusters, the similarity values are based on a probability distribution for each cluster and an entropic distance metric, the probability distribution for each cluster is a probability of an occurrence of an attribute in the category data occurring in that cluster, and the entropic distance metric is a distance metric of cluster pairs in the lower set of clusters, and to identify a cluster pair in the lower set of clusters that minimizes the loss of information, and represent records of multimedia content as the hierarchy of clusters of category data, wherein the category data is defined in a vector space comprising multiple attributes, and wherein the records comprise category data.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27)
- - 20. The system of claim 19, wherein the deriving the at least one higher set of clusters comprises:
    - merging clusters from the lower set of clusters together according to the entropic distance metric.
  - 21. The method of claim 20, wherein the process further causes the processor to:
    - determine a cluster pair in the lower set of clusters that has an entropic similarity characteristic value that satisfies the predetermined entropic similarity condition; and
      
      merge the selected cluster pair in the lower set of clusters into a single cluster for the at least one higher set of clusters.
  - 22. The system of claim 21, wherein the process further causes the processor to:
    - represent the merged cluster pair in the at least one higher set of clusters; and
      
      map non-merged clusters into the at least one higher set of clusters.
  - 23. The system of claim 21, wherein selecting a cluster pair comprises selecting a cluster pair that has a minimum entropic divergence.
  - 24. The system of claim 21, wherein selecting a cluster pair comprises selecting a cluster pair that has a maximum entropic proximity.
  - 25. The system of claim 19, wherein the process further causes the processor to:
    - map each record onto a system ontology; and
      
      clean at least one record.
  - 26. The system of claim 25, wherein cleaning a record comprises at least one of removing terms from attributes of the record, splitting attributes of the record into a plurality of sub-attributes, and replacing terms in attributes of the record.
  - 27. The system of claim 19, wherein the process further causes the processor to:
    - generate a distance matrix representing possible combinations of clusters present within a current hierarchy layer.

28. A computerized method comprising:
- creating, with a classifying content computer, an internal representation for each of a plurality of folders of records, wherein each folder internal representation is based on a first probability distribution of category data, the category data defined in a vector space comprising multiple attributes, and each of the first probability distributions corresponding to a folder includes a probability of occurrence that each of the multiple attributes occurs in that folder;
  
  creating an internal representation for each of a plurality of records, wherein each record internal representation is based on a second probability distribution of category data and each of the second probability distributions corresponding to a record includes a probability of occurrence that each of the multiple attributes occurs in that record; and
  
  classifying the plurality of records into the plurality of folders according to a predetermined entropic similarity condition using the plurality of first and second probability distributions.
- View Dependent Claims (29, 30, 31, 32, 33, 34)
- - 29. The method of claim 28, wherein the plurality of folders is user-defined.
  - 30. The method of claim 29, further comprising:
    - creating a distance matrix listing representing possible record and folder combinations.
  - 31. The method of claim 30, wherein the record is classified in more than one folder.
  - 32. The method of claim 29, further comprising:
    - assigning labels to folders within the plurality of folders.
  - 33. The method of claim 28, further comprising:
    - creating a conditional likelihood matrix from the distance matrix, the conditional likelihood matrix representing a probability of occurrence of a folder relative to a given record.
  - 34. The method of claim 28, further comprising:
    - creating a binary assignment matrix, wherein every record is classified in a single folder.

35. A machine-readable storage medium having executable instructions to cause a processor to perform a method, the method comprising:
- creating an internal representation for each of a plurality of folders of records, wherein each folder internal representation is based on a first probability distribution of category data, the category data defined in a vector space comprising multiple attributes, and each of the first probability distributions corresponding to a folder includes a provability of occurrence that each of the multiple attributes occurs in that folder;
  
  creating an internal representation for each of a plurality of records, wherein each record internal representation is based on a second probability distribution of category data and each of the second probability distributions corresponding to a record includes a probability of occurrence that each of the multiple attributes occurs in that record; and
  
  classifying the plurality of records into the plurality of folders according to a predetermined entropic similarity condition using the plurality of the first and second probability distributions, and wherein the records comprise the category data.
- View Dependent Claims (36, 37, 38, 39, 40, 41)
- - 36. The machine-readable storage medium of claim 35, wherein the plurality of folders is user-defined.
  - 37. The machine-readable storage medium of claim 36, wherein the method further causes the processor to create a distance matrix listing representing possible record and folder combinations.
  - 38. The machine-readable storage medium of claim 37, wherein the record is classified in more than one folder.
  - 39. The machine-readable storage medium of claim 37, wherein the method further causes the processor to create a binary assignment matrix, wherein every record is classified in a single folder.
  - 40. The machine-readable storage medium of claim 35, wherein the method further causes the processor to assign labels to the folders.
  - 41. The machine-readable storage medium of claim 35, wherein the method further causes the processor to create a conditional likelihood matrix from the distance matrix listing probability of occurrence of folder, given a record.

42. A computer system comprising:
- a processor coupled to a memory through a bus; and
  
  a process executed from the memory by the processor to cause the processor to create an internal representation for each of a plurality of folders of records, wherein each folder internal representation is based on a probability distribution of category data, the category data defined in a vector space comprising multiple attributes, and each of the probability distributions corresponding to a folder include a probability of occurrence that each of the multiple attributes occurs in that folder, to create an internal representation for each of a plurality of records, wherein each record internal representation is based on a second probability distribution of category data and each of the second probability distributions corresponding to a record includes a probability of occurrence that each of the multiple attributes occurs in that record, and to classify the plurality of records into the plurality of folders according to a predetermined entropic similarity condition using the plurality of first and second probability distributions, wherein the records comprise category data.
- View Dependent Claims (43, 44)
- - 43. The computer system of claim 42, wherein the processor further causes the processor to receive a user-defined plurality of folders.
  - 44. The computer system of claim 42, wherein the processor further causes the processor to:
    - create a distance matrix representing possible record and folder combinations;
      
      assign labels to the folders; and
      
      compute a conditional likelihood matrix from the distance matrix listing a probability of occurrence of a folder relative to a given record.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Sony Corporation (Sony Group Corp.), Sony Electronics Inc. (Sony Group Corp.)
Original Assignee
Sony Corporation (Sony Group Corp.), Sony Electronics Inc. (Sony Group Corp.)
Inventors
Ohwa, Tsunayuki, Plutowski, Mark, Purang, Khemdut, Acharya, Chiranjit, Usuki, Takashi
Primary Examiner(s)
Trujillo; James
Assistant Examiner(s)
Casanova; Jorge A

Application Number

US11/436,142
Publication Number

US 20070271287A1
Time in Patent Office

1,547 Days
Field of Search

706/20, 706/45, 707/6
US Class Current

706/45
CPC Class Codes

G06F 16/285 Clustering or classification

G06F 18/231 Hierarchical techniques, i....

Clustering and classification of multimedia data

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

111 Citations

44 Claims

Specification

Solutions

Use Cases

Quick Links

Clustering and classification of multimedia data

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

111 Citations

44 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links