Method and apparatus for auditing training supersets
First Claim
Patent Images
1. A computer assisted method of auditing a superset of training data, the superset comprising examples of documents having one or more preexisting category assignments, the method including:
- partitioning the superset into at least two disjoint sets, including a test set and a training set, wherein the test set includes one or more test documents and the training set includes examples of documents belonging to at least two categories;
automatically categorizing the test documents using the training set;
calculating a metric of confidence based on results of the categorizing step and comparing the automatic category assignments for the test documents to the preexisting category assignments; and
reporting the test documents and preexisting category assignments that are suspicious and the automatic category assignments that appear to be missing from the test documents, based on the metric of confidence.
4 Assignments
0 Petitions
Accused Products
Abstract
The present invention includes methods and systems to audit and identify potential errors and/or omissions in a training set. Particular aspects of the present invention are described in the claims, specification and drawings.
-
Citations
34 Claims
-
1. A computer assisted method of auditing a superset of training data, the superset comprising examples of documents having one or more preexisting category assignments, the method including:
-
partitioning the superset into at least two disjoint sets, including a test set and a training set, wherein the test set includes one or more test documents and the training set includes examples of documents belonging to at least two categories; automatically categorizing the test documents using the training set; calculating a metric of confidence based on results of the categorizing step and comparing the automatic category assignments for the test documents to the preexisting category assignments; and reporting the test documents and preexisting category assignments that are suspicious and the automatic category assignments that appear to be missing from the test documents, based on the metric of confidence. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. A computer assisted method of auditing a superset of training data, the superset comprising examples of documents having one or more preexisting category assignments, the method including:
-
determining k nearest neighbors of the documents in a test subset automatically partitioned from the superset; automatically categorizing the documents based on the k nearest neighbors into a plurality of categories; calculating a metric of confidence based on results of the categorizing step and comparing the automatic category assignments for the documents to the preexisting category assignments; and reporting the documents in the test subset and preexisting category assignments that are suspicious and the automatic category assignments that appear to be missing from the documents in the test subset, based on the metric of confidence. - View Dependent Claims (23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34)
-
Specification