Error-driven feature ideation in machine learning

US 10,068,185 B2
Filed: 12/07/2014
Issued: 09/04/2018
Est. Priority Date: 12/07/2014
Status: Active Grant

First Claim

Patent Images

1. A method for textual classification, comprising:

receiving, by a processing unit, a training set of textual data;

classifying, by the processing unit, the training set of textual data to obtain a first plurality of classifications for the training set of textual data;

determining, by the processing unit, a plurality of errors based on differences between the first plurality of classifications and a first plurality of labels having been previously assigned to the training set of textual data;

determining, by the processing unit, a set of candidate features based on the determined plurality of errors to correct at least one error of the plurality of errors;

causing, by the processing unit, a display of one or more candidate features from the determined set of candidate features for selection as an applied feature;

receiving, by the processing unit, a selection of at least one candidate feature of the displayed one or more candidate features to be an applied feature; and

retraining a classifier, using the applied feature, to re-classify the training set of textual data.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed herein are technologies directed to a feature ideator. The feature ideator can initiate a classifier that analyzes a training set of data in a classification process. The feature ideator can generate one or more suggested features relating to errors generated during the classification process. The feature ideator can generate an output to cause the errors to be rendered in a format that provides for an interaction with a user. A user can review the summary of the errors or the individual errors and select one or more features to increase the accuracy of the classifier.

21 Citations

View as Search Results

20 Claims

1. A method for textual classification, comprising:
- receiving, by a processing unit, a training set of textual data;
  
  classifying, by the processing unit, the training set of textual data to obtain a first plurality of classifications for the training set of textual data;
  
  determining, by the processing unit, a plurality of errors based on differences between the first plurality of classifications and a first plurality of labels having been previously assigned to the training set of textual data;
  
  determining, by the processing unit, a set of candidate features based on the determined plurality of errors to correct at least one error of the plurality of errors;
  
  causing, by the processing unit, a display of one or more candidate features from the determined set of candidate features for selection as an applied feature;
  
  receiving, by the processing unit, a selection of at least one candidate feature of the displayed one or more candidate features to be an applied feature; and
  
  retraining a classifier, using the applied feature, to re-classify the training set of textual data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein determining a plurality of errors in the training set of textual data comprises:
    - receiving the training set of data comprising a plurality of labeled textual data; and
      
      initiating the classifier to examine the labeled textual data to determine the plurality of errors.
  - 3. The method of claim 2, further comprising deconstructing the plurality of textual data into constituent components.
  - 4. The method of claim 1, further comprising generating an error percent by determining a percentage of textual data identified correctly by the classifier.
  - 5. The method of claim 1, further comprising:
    - receiving a selection of at least one feature candidate of the set of feature candidates for further exploration; and
      
      presenting a plurality of words or n-grams associated with the selection of the at least one feature candidate of the set of feature candidates for further exploration.
  - 6. The method of claim 1, further comprising rendering a featuring area comprising the applied feature.
  - 7. The method of claim 1, further comprising:
    - determining an updated plurality of errors in the training set of textual data based on the applied feature;
      
      displaying a set of updated feature candidates based on the training set to correct at least one error of the updated plurality of errors;
      
      receiving a selection of at least one feature candidate of the updated set of feature candidates to be a second applied feature; and
      
      retraining the classifier based on the second applied feature.
  - 8. The method of claim 7, further comprising updating the featuring area with a second set of candidate features determined by the classifier trained with the second applied feature.
  - 9. The method of claim 1, further comprising displaying a frequency indicator proximate to at least one of the set of feature candidates, the frequency indicator indicating a frequency of occurrences in which the at least one of the set of feature candidates is associated with an error and a frequency of occurrences in which the at least one of the set of feature candidates is associated with a positive match or an estimated impact of adding the at least one of the set of feature candidates as the applied feature.

10. A computer comprising:
- a processor; and
  
  a non-transitory, computer-readable storage medium in communication with the processor, the non-transitory, computer-readable storage medium comprising computer-executable instructions for textual classification that, when executed by the processor, cause the processor to;
  
  initiate a classifier of a feature ideator to obtain a first plurality of classifications by classifying a training set of textual data;
  
  initiate the classifier of the feature ideator to determine a plurality of errors in the training set of textual data based on differences between the first plurality of classifications and a first plurality of labels have been previously assigned to the training set of textual data;
  
  initiate a candidate feature generator of the feature ideator to determine a set of feature candidates based on the determined plurality of errors to correct at least one error of the plurality of errors;
  
  cause a display of one or more candidate features from the determined set of candidate features for selection as an applied feature;
  
  initiate the feature ideator to receive a selection of the displayed one or more candidate features to be an applied feature and to retrain the classifier to re-classify the training set of textual data based on the applied feature.
- View Dependent Claims (11, 12, 13, 14, 15, 16)
- - 11. The computer of claim 10, further comprising computer-executable instructions to:
    - determine contrast terms that do not generate an error; and
      
      display the contrast terms.
  - 12. The computer of claim 11, wherein the contrast terms displayed and the set of feature candidates displayed are summarized by computer-executable instructions to:
    - obtain a frequency of words occurring as a potential member of the set of feature candidate and as a potential member of the plurality of contrast terms;
      
      calculate a difference in frequency between the occurrence of the words as a potential member of the set of feature candidate and as a potential member of the plurality of contrast terms;
      
      select a number of words occurring more often as errors as the feature candidates; and
      
      select a number of words occurring more often as contrasts as the contrast terms.
  - 13. The computer of claim 12, further comprising computer-executable instructions to calculate an improvement score to be obtained if a selected feature candidate or a selected contrast term were used to create a new feature.
  - 14. The computer of claim 13, wherein the computer-executable instructions to calculate an improvement is performed using a logarithmic loss technique.
  - 15. The computer of claim 12, further comprising computer-executable instructions to rank the feature candidates and the contrast terms by the improvement score associated with each of the feature candidates and the contrast terms.
  - 16. The computer of claim 15, further comprising computer-executable instructions to display a number of the feature candidates having a certain improvement score as a set of feature candidates and a number of the contrast terms selected having a certain improvement score as the contrast terms.

17. A non-transitory, computer-readable storage medium having computer-executable instructions for textual classification that, when executed by a computer, cause the computer to:
- receive a training set of textual data;
  
  classifying the training set of textual data to obtain a first plurality of classifications for the training set of textual data;
  
  determine a plurality of errors based on the differences between the first plurality of classifications and a first plurality of labels having been previously assigned to the training set of textual data;
  
  determine a plurality of candidate features based on the determined plurality of errors to correct at least one error of the plurality of errors;
  
  render a feature ideation user interface comprising;
  
  a featuring area comprising a create feature section for receiving an input to initiate a feature idealization process and an applied feature section for displaying currently applied features;
  
  a feature candidate section for displaying the candidate features; and
  
  a contrast term section for displaying contrast terms, the contrast terms comprising terms that are properly classified; and
  
  retrain a classifier to re-classify the training set of textual data based on the contrast terms.
- View Dependent Claims (18, 19, 20)
- - 18. The non-transitory, computer-readable storage medium of claim 17, wherein the feature ideation user interface further comprises a focus selection control configured to receive an input of which of the error type to apply to the candidate features displayed in the feature candidate section.
  - 19. The non-transitory, computer-readable storage medium of claim 17, wherein the feature ideation user interface further comprises a frequency indicator proximate to at least one of the candidate features or at least one of the contrast terms, the frequency indicator comprising a top bar having a certain length to indicate a frequency of the at least one of the candidate features or the at least one of the contrast terms in positive documents and a lower bar having a certain length that indicates a frequency of the at least one of the candidate features or the at least one of the contrast terms term in negative documents.
  - 20. The non-transitory, computer-readable storage medium of claim 17, wherein the feature ideation user interface further comprises an accuracy percentage indicator displaying an accuracy of the classifier.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Amershi, Saleema, Brooks, Michael J., Lee, Bongshin, Drucker, Steven M., Simard, Patrice Y., Suh, Jin A., Kapoor, Ashish
Primary Examiner(s)
Le, Dieu-Minh

Application Number

US14/562,750
Publication Number

US 20160162803A1
Time in Patent Office

1,367 Days
Field of Search

714 26, 714 27, 714 25
US Class Current
CPC Class Codes

G06F 11/0787   Storage of error reports, e...

G06F 11/162   Displays

G06F 16/35   Clustering; Classification

G06N 20/00   Machine learning

Error-driven feature ideation in machine learning

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

21 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Error-driven feature ideation in machine learning

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

21 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links