Automatic labeling of unlabeled text data
First Claim
1. A method of automated labeling of unlabeled text data comprising the steps of:
- establishing a document collection as a reference answer set;
converting members of the answer set to vectors representing centroids of unknown groups of unlabeled text data;
clustering unlabeled text data relative to said centroids by a nearest neighbor algorithm;
assigning an ID to each said centroid; and
labeling each of the unlabeled text data documents with said ID of the answer in the cluster to which the unlabeled text data document has been assigned by said clustering step.
2 Assignments
0 Petitions
Accused Products
Abstract
A method of automatically labeling of unlabeled text data can be practiced independent of human intervention, but that does not preclude manual intervention. The method can be used to extract relevant features of unlabeled text data for a keyword search. The method of automated labeling of unlabeled text data uses a document collection as a reference answer set. Members of the answer set are converted to vectors representing centroids of unknown groups of unlabeled text data. Unlabeled text data are clustered relative to the centroids by a nearest neighbor algorithm and the ID of the relevant answer is assigned to all documents in the cluster. At this point in the process, a supervised machine learning algorithm is trained on labeled data, and a classifier for assigning labels to new text data is output. Alternatively, a feature extraction algorithm may be run on classes generated by the step of clustering, and search features output which index the unlabeled text data.
-
Citations
5 Claims
-
1. A method of automated labeling of unlabeled text data comprising the steps of:
-
establishing a document collection as a reference answer set;
converting members of the answer set to vectors representing centroids of unknown groups of unlabeled text data;
clustering unlabeled text data relative to said centroids by a nearest neighbor algorithm;
assigning an ID to each said centroid; and
labeling each of the unlabeled text data documents with said ID of the answer in the cluster to which the unlabeled text data document has been assigned by said clustering step. - View Dependent Claims (2, 3, 4, 5)
training a supervised machine learning algorithm on the newly labeled data; and
outputting a classifier for assigning labels to new text data.
-
-
3. The method of automated labeling of unlabeled text data recited in claim 1, further comprising the steps of:
-
running a feature extraction algorithm on classes generated by the step of clustering; and
outputting search features indexing the unlabeled text data.
-
-
4. The method of automated labeling of unlabeled text data recited in claim 1, further comprising the steps of:
-
checking selected categorizations and recalculating centroids;
re-clustering data using the nearest neighbor algorithm;
iterating the steps of checking and re-categorizing until process stabilizes or an iteration parameter is reached;
training a supervised machine learning algorithm on the newly labeled data; and
outputting a classifier for assigning labels to new text data.
-
-
5. The method of automated labeling of unlabeled text data recited in claim 1, further comprising the step of augmenting and/or editing text from the document collection as the reference answer set with additional information before converting the reference set to vectors.
Specification