Systems and methods for the distributed categorization of source data

US 10,157,217 B2
Filed: 05/27/2016
Issued: 12/18/2018
Est. Priority Date: 05/18/2012
Status: Active Grant

First Claim

Patent Images

1. A method for labeling a set of source data, comprising:

obtaining a set of source data using a distributed data categorization server system comprising a processor and a memory connected to the processor;

determining a plurality of subsets of the source data using the distributed data categorization server system, where a subset of the source data comprises a plurality of pieces of source data in the set of source data;

obtaining sets of pairwise annotations for each subset of source data using the data categorization server system, where a pairwise annotation indicates when a first piece of source data in a subset of source data is similar to a second piece of source data in the subset of source data;

identifying a category for each subset of source data based on the obtained pairwise annotations for the subset of source data using the distributed data categorization server system;

locating pieces of source data located in at least two of the subsets of source data using the data categorization server system;

generating source data metadata describing attributes for at least one of the located pieces of source data based on the categories assigned to each of the subsets of source data in which the located pieces of content are contained using the data categorization server system, where the source data metadata for a piece of source data describes attributes of the piece of source data; and

generating a taxonomy based on the identified categories and the set of source data using the distributed data categorization server system, where the taxonomy comprises relationships between the identified categories and the pieces of source data in the set of source data.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for the crowdsourced clustering of data items in accordance embodiments of the invention are disclosed. In one embodiment of the invention, a method for determining categories for a set of source data includes obtaining a set of source data, determining a plurality of subsets of the source data, where a subset of the source data includes a plurality of pieces of source data in the set of source data, generating a set of pairwise annotations for the pieces of source data in each subset of source data, clustering the set of source data into related subsets of source data based on the sets of pairwise labels for each subset of source data, and identifying a category for each related subset of source data based on the clusterings of source data and the source data metadata for the pieces of source data in the group of source data.

56 Citations

View as Search Results

18 Claims

1. A method for labeling a set of source data, comprising:
- obtaining a set of source data using a distributed data categorization server system comprising a processor and a memory connected to the processor;
  
  determining a plurality of subsets of the source data using the distributed data categorization server system, where a subset of the source data comprises a plurality of pieces of source data in the set of source data;
  
  obtaining sets of pairwise annotations for each subset of source data using the data categorization server system, where a pairwise annotation indicates when a first piece of source data in a subset of source data is similar to a second piece of source data in the subset of source data;
  
  identifying a category for each subset of source data based on the obtained pairwise annotations for the subset of source data using the distributed data categorization server system;
  
  locating pieces of source data located in at least two of the subsets of source data using the data categorization server system;
  
  generating source data metadata describing attributes for at least one of the located pieces of source data based on the categories assigned to each of the subsets of source data in which the located pieces of content are contained using the data categorization server system, where the source data metadata for a piece of source data describes attributes of the piece of source data; and
  
  generating a taxonomy based on the identified categories and the set of source data using the distributed data categorization server system, where the taxonomy comprises relationships between the identified categories and the pieces of source data in the set of source data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein a category in the taxonomy comprises one or more attributes of the pieces of source data associated with the category in the taxonomy.
  - 3. The method of claim 1, further comprising iteratively identifying sub-categories for at least one identified category based on the pieces of source data associated with the identified category using the distributed data categorization server system.
  - 4. The method of claim 3, wherein:
    - the at least one identified category is selected based on the attributes of the pieces of source data associated with the identified category; and
      
      the identified sub-categories comprise at least one attribute from a piece of source data associated with the sub-category that is not present in the identified category.
  - 5. The method of claim 1, further comprising generating instruction data using the distributed data categorization server system, where the instruction data describes the attributes of the pieces of the source data that should be used in generating the set of pairwise annotations.
  - 6. The method of claim 5, wherein the instruction data is generated based on the attributes of the pieces of source data in the set of source data.
  - 7. The method of claim 1, wherein generating a set of pairwise annotations for the pieces of source data in each subset of source data using the distributed data categorization server system is based on data characterization device metadata, where the data characterization device metadata describes anticipated annotations based on the pieces of source data in the obtained subset of source data.
  - 8. The method of claim 1, wherein clustering the set of source data into related subsets of source data further comprises:
    - generating a model comprising a set of points representing the pieces of source data in a Euclidian space using the distributed data categorization server system; and
      
      clustering the set of points within the Euclidian space based on the set of pairwise annotations using the distributed data categorization server system.
  - 9. The method of claim 8, further comprising estimating the number of clusters within the Euclidian space using the distributed data categorization server system.
  - 10. The method of claim 1, wherein determining a plurality of subsets further comprises:
    - determining a subset size using the distributed data categorization server system, where the subset size is a measure of the number of pieces of source data assigned to a subset; and
      
      deterministically allocating the pieces of source data to the determined subsets using the distributed data categorization server system.
  - 11. The method of claim 10, further comprising allocating additional pieces of source data to the subsets using the distributed data categorization server system, where the additional pieces of source data are sampled without replacement from the set of source data not already assigned to the subset.

12. A distributed data categorization server system, comprising:
- a processor; and
  
  a memory connected to the processor and storing a data categorization application;
  
  wherein the data categorization application directs the processor to;
  
  obtain a set of source data;
  
  determine a plurality of subsets of the source data, where a subset of the source data comprises a plurality of pieces of source data in the set of source data;
  
  obtain a set of pairwise annotation for each subset of source data, where a pairwise annotation indicates when a first piece of source data in a subset of source data is similar to a second piece of source data in the subset of source data;
  
  identify a category for each subset of source data based on the obtained pairwise annotations for the subset of source data;
  
  locate pieces of content contained in at least two of the subsets of source data;
  
  generate label data for each of the located pieces of content based on the categories assigned to each of the subsets of source data in which the located pieces of content are contained; and
  
  generate a taxonomy based on the identified categories and the set of source data, where the taxonomy comprises relationships between the identified categories and the pieces of source data in the set of source data.
- View Dependent Claims (13, 14, 15, 16, 17, 18)
- - 13. The system of claim 12, wherein a category in the taxonomy comprises one or more attributes of the pieces of source data associated with the category in the taxonomy.
  - 14. The system of claim 12, wherein the data categorization application further directs the processor to iteratively identify sub-categories for at least one identified category based on the pieces of source data associated with the identified category.
  - 15. The system of claim 14, wherein:
    - the at least one identified category is selected based on the attributes of the pieces of source data associated with the identified category; and
      
      the identified sub-categories comprise at least one attribute from a piece of source data associated with the sub-category that is not present in the identified category.
  - 16. The system of claim 12, wherein the data categorization application further directs the processor to generate instruction data, where the instruction data describes the attributes of the pieces of the source data that should be used in generating the set of pairwise annotations.
  - 17. The system of claim 16, wherein the instruction data is generated based on the attributes of the pieces of source data in the set of source data.
  - 18. The system of claim 12, wherein generating a set of pairwise annotations for the pieces of source data in each subset of source data using the distributed data categorization server system is based on data characterization device metadata, where the data characterization device metadata describes anticipated annotations based on the pieces of source data in the obtained subset of source data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
California Institute of Technology
Original Assignee
California Institute of Technology
Inventors
Gomes, Ryan, Welinder, Peter, Krause, Andreas, Perona, Pietro
Primary Examiner(s)
Nguyen, Kim T

Application Number

US15/166,598
Publication Number

US 20160275173A1
Time in Patent Office

935 Days
Field of Search

707706, 707737
US Class Current
CPC Class Codes

G06F 16/24573 using data annotations, e.g...

G06F 16/285 Clustering or classification

Systems and methods for the distributed categorization of source data

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

56 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for the distributed categorization of source data

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

56 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links