Multi-label content recategorization
First Claim
Patent Images
1. A computing apparatus, comprising:
- a hardware platform comprising a processor and a memory; and
one or more tangible, non-transitory computer-readable mediums having instructions to provide a two-phase classification engine to;
in a first phase, receive a clean multi-labeled dataset comprising a plurality of documents, each assigned to one or more categories from a set of fixed categories;
receive an unclean multi-labeled dataset, wherein at least some objects of the unclean multi-labeled dataset belong to overlapping classes, wherein the probability that a document belongs to the overlapping classes is approximately equal;
produce a recategorized and cleansed dataset from the unclean multi-labeled dataset, comprising predicting a number of labels {circumflex over (l)} for a document j, and comparing {circumflex over (l)} to an existing number of labels {circumflex over (l)}; and
in a second phase, compute from the recategorized and cleansed dataset a probability difference between l and {circumflex over (l)} for j, and take l to be correct if the difference is less than or equal to a threshold.
13 Assignments
0 Petitions
Accused Products
Abstract
In an example, there is disclosed a computing apparatus, including one or more logic elements, including at least one hardware logic element, comprising a classification engine to: receive a clean multi-labeled dataset comprising a plurality of document each assigned to one or more of a plurality of categories; receive an unclean multi-labeled dataset; and produce a recategorized and cleansed dataset from the unclean multi-labeled dataset, comprising predicting a number of labels {circumflex over (l)} for a document j, and comparing {circumflex over (l)} to an existing number of labels l. There is also disclosed a method of providing a classification engine.
15 Citations
25 Claims
-
1. A computing apparatus, comprising:
-
a hardware platform comprising a processor and a memory; and one or more tangible, non-transitory computer-readable mediums having instructions to provide a two-phase classification engine to; in a first phase, receive a clean multi-labeled dataset comprising a plurality of documents, each assigned to one or more categories from a set of fixed categories; receive an unclean multi-labeled dataset, wherein at least some objects of the unclean multi-labeled dataset belong to overlapping classes, wherein the probability that a document belongs to the overlapping classes is approximately equal; produce a recategorized and cleansed dataset from the unclean multi-labeled dataset, comprising predicting a number of labels {circumflex over (l)} for a document j, and comparing {circumflex over (l)} to an existing number of labels {circumflex over (l)}; and in a second phase, compute from the recategorized and cleansed dataset a probability difference between l and {circumflex over (l)} for j, and take l to be correct if the difference is less than or equal to a threshold. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. One or more tangible, non-transitory computer-readable mediums having stored thereon executable instructions for providing a two-phase classification engine to:
-
in a first phase, receive a clean multi-labeled dataset comprising a plurality of documents, each assigned to one or more categories from a set of fixed categories; receive an unclean multi-labeled dataset, wherein at least some objects of the unclean multi-labeled dataset belong to overlapping classes, wherein the probability that a document belongs to the overlapping classes is approximately equal; produce a recategorized and cleansed dataset from the unclean multi-labeled dataset, comprising predicting a number of labels {circumflex over (l)} for a document j, and comparing {circumflex over (l)} to an existing number of labels l; and in a second phase, compute from the recategorized and cleansed dataset a probability difference between l and {circumflex over (l)} for j, and take l to be correct if the difference is less than or equal to a threshold. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22)
-
-
23. A computer-implemented method of providing two-phase multi-label content recategorization, comprising:
-
in a first phase, receiving a clean multi-labeled dataset comprising a plurality of documents, each assigned to one or more categories from a set of fixed categories; receiving an unclean multi-labeled dataset, wherein at least some objects of the unclean multi-labeled dataset belong to overlapping classes, wherein the probability that a document belongs to the overlapping classes is approximately equal; producing a recategorized and cleansed dataset from the unclean multi-labeled dataset, comprising predicting a number of labels {circumflex over (l)} for a document j, and comparing {circumflex over (l)} to an existing number of labels l; and in a second phase, computing from the recategorized and cleansed dataset a probability difference between l and {circumflex over (l)} for j, and take l to be correct if the difference is less than or equal to a threshold. - View Dependent Claims (24, 25)
-
Specification