Systems and methods for the distributed categorization of source data
First Claim
1. A method for labeling a set of source data, comprising:
- obtaining a set of source data using a distributed data categorization server system comprising a processor and a memory connected to the processor;
determining a plurality of subsets of the source data using the distributed data categorization server system, where a subset of the source data comprises a plurality of pieces of source data in the set of source data;
obtaining sets of pairwise annotations for each subset of source data using the data categorization server system, where a pairwise annotation indicates when a first piece of source data in a subset of source data is similar to a second piece of source data in the subset of source data;
identifying a category for each subset of source data based on the obtained pairwise annotations for the subset of source data using the distributed data categorization server system;
locating pieces of source data located in at least two of the subsets of source data using the data categorization server system;
generating source data metadata describing attributes for at least one of the located pieces of source data based on the categories assigned to each of the subsets of source data in which the located pieces of content are contained using the data categorization server system, where the source data metadata for a piece of source data describes attributes of the piece of source data; and
generating a taxonomy based on the identified categories and the set of source data using the distributed data categorization server system, where the taxonomy comprises relationships between the identified categories and the pieces of source data in the set of source data.
4 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for the crowdsourced clustering of data items in accordance embodiments of the invention are disclosed. In one embodiment of the invention, a method for determining categories for a set of source data includes obtaining a set of source data, determining a plurality of subsets of the source data, where a subset of the source data includes a plurality of pieces of source data in the set of source data, generating a set of pairwise annotations for the pieces of source data in each subset of source data, clustering the set of source data into related subsets of source data based on the sets of pairwise labels for each subset of source data, and identifying a category for each related subset of source data based on the clusterings of source data and the source data metadata for the pieces of source data in the group of source data.
56 Citations
18 Claims
-
1. A method for labeling a set of source data, comprising:
-
obtaining a set of source data using a distributed data categorization server system comprising a processor and a memory connected to the processor; determining a plurality of subsets of the source data using the distributed data categorization server system, where a subset of the source data comprises a plurality of pieces of source data in the set of source data; obtaining sets of pairwise annotations for each subset of source data using the data categorization server system, where a pairwise annotation indicates when a first piece of source data in a subset of source data is similar to a second piece of source data in the subset of source data; identifying a category for each subset of source data based on the obtained pairwise annotations for the subset of source data using the distributed data categorization server system; locating pieces of source data located in at least two of the subsets of source data using the data categorization server system; generating source data metadata describing attributes for at least one of the located pieces of source data based on the categories assigned to each of the subsets of source data in which the located pieces of content are contained using the data categorization server system, where the source data metadata for a piece of source data describes attributes of the piece of source data; and generating a taxonomy based on the identified categories and the set of source data using the distributed data categorization server system, where the taxonomy comprises relationships between the identified categories and the pieces of source data in the set of source data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A distributed data categorization server system, comprising:
-
a processor; and a memory connected to the processor and storing a data categorization application; wherein the data categorization application directs the processor to; obtain a set of source data; determine a plurality of subsets of the source data, where a subset of the source data comprises a plurality of pieces of source data in the set of source data; obtain a set of pairwise annotation for each subset of source data, where a pairwise annotation indicates when a first piece of source data in a subset of source data is similar to a second piece of source data in the subset of source data; identify a category for each subset of source data based on the obtained pairwise annotations for the subset of source data; locate pieces of content contained in at least two of the subsets of source data; generate label data for each of the located pieces of content based on the categories assigned to each of the subsets of source data in which the located pieces of content are contained; and generate a taxonomy based on the identified categories and the set of source data, where the taxonomy comprises relationships between the identified categories and the pieces of source data in the set of source data. - View Dependent Claims (13, 14, 15, 16, 17, 18)
-
Specification