Automatic labeling of unlabeled text data

US 6,697,998 B1
Filed: 06/12/2000
Issued: 02/24/2004
Est. Priority Date: 06/12/2000
Status: Expired due to Term

First Claim

Patent Images

1. A method of automated labeling of unlabeled text data comprising the steps of:

establishing a document collection as a reference answer set;

converting members of the answer set to vectors representing centroids of unknown groups of unlabeled text data;

clustering unlabeled text data relative to said centroids by a nearest neighbor algorithm;

assigning an ID to each said centroid; and

labeling each of the unlabeled text data documents with said ID of the answer in the cluster to which the unlabeled text data document has been assigned by said clustering step.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of automatically labeling of unlabeled text data can be practiced independent of human intervention, but that does not preclude manual intervention. The method can be used to extract relevant features of unlabeled text data for a keyword search. The method of automated labeling of unlabeled text data uses a document collection as a reference answer set. Members of the answer set are converted to vectors representing centroids of unknown groups of unlabeled text data. Unlabeled text data are clustered relative to the centroids by a nearest neighbor algorithm and the ID of the relevant answer is assigned to all documents in the cluster. At this point in the process, a supervised machine learning algorithm is trained on labeled data, and a classifier for assigning labels to new text data is output. Alternatively, a feature extraction algorithm may be run on classes generated by the step of clustering, and search features output which index the unlabeled text data.

Citations

5 Claims

1. A method of automated labeling of unlabeled text data comprising the steps of:
- establishing a document collection as a reference answer set;
  
  converting members of the answer set to vectors representing centroids of unknown groups of unlabeled text data;
  
  clustering unlabeled text data relative to said centroids by a nearest neighbor algorithm;
  
  assigning an ID to each said centroid; and
  
  labeling each of the unlabeled text data documents with said ID of the answer in the cluster to which the unlabeled text data document has been assigned by said clustering step.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of automated labeling of unlabeled text data recited in claim 1, further comprising the steps of:
3. The method of automated labeling of unlabeled text data recited in claim 1, further comprising the steps of:
- running a feature extraction algorithm on classes generated by the step of clustering; and
  
  outputting search features indexing the unlabeled text data.
4. The method of automated labeling of unlabeled text data recited in claim 1, further comprising the steps of:
- checking selected categorizations and recalculating centroids;
  
  re-clustering data using the nearest neighbor algorithm;
  
  iterating the steps of checking and re-categorizing until process stabilizes or an iteration parameter is reached;
  
  training a supervised machine learning algorithm on the newly labeled data; and
  
  outputting a classifier for assigning labels to new text data.
5. The method of automated labeling of unlabeled text data recited in claim 1, further comprising the step of augmenting and/or editing text from the document collection as the reference answer set with additional information before converting the reference set to vectors.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
International Business Machines Corporation
Inventors
Johnson, David E., Damerau, Frederick J., Buskirk, Martin C. Jr.
Primary Examiner(s)
Shah, Sanjiv

Application Number

US09/591,497
Time in Patent Office

1,352 Days
Field of Search

715/512, 715/530, 715/531, 706/45, 707/103, 345/630, 345/648
US Class Current

715/260
CPC Class Codes

G06F 16/355 Class or cluster creation o...

Automatic labeling of unlabeled text data

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

5 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic labeling of unlabeled text data

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

5 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links