Evaluating text classifier parameters based on semantic features

US 10,078,688 B2
Filed: 05/18/2016
Issued: 09/18/2018
Est. Priority Date: 04/12/2016
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

identifying a plurality of feature extraction parameters of a text classifier model, wherein the plurality of feature extraction parameters comprises a first attribute of a first semantic class and a second attribute of a second semantic class, wherein a value of the second attribute is produced by applying a pre-defined transformation to a value of the first attribute;

partitioning a corpus of natural language texts into a training data set comprising a first plurality of natural language texts and a validation data set comprising a second plurality of natural language texts;

determining, in view of the training data set, a set of values of the feature extraction parameters, which maximizes a number of natural language texts of the validation data set that are classified correctly by the text classifier model using the set of values of the feature extraction parameters;

performing, by a processing device, a semantico-syntactic analysis of an input natural language text to produce a semantic structure representing a set of semantic classes;

producing a plurality of values by applying, to the semantic structure representing the input natural language text, the text classifier model using the set of values of the feature extraction parameters, wherein each value of the plurality of values reflects a degree of association of the input natural language text with a particular category of natural language texts;

associating the input natural language text with a category corresponding to an optimal value among the plurality of values; and

utilizing the category to perform a natural language processing task.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for evaluating text classifier parameters based on semantic features. An example method comprises: performing a semantico-syntactic analysis of a natural language text of a corpus of natural language texts to produce a semantic structure representing a set of semantic classes; identifying a natural language text feature to be extracted using a set of values of a plurality of feature extraction parameters; partitioning the corpus of natural language texts into a training data set comprising a first plurality of natural language texts and a validation data set comprising a second plurality of natural language texts; determining, in view of the category of the training data set, the set of values of the feature extraction parameters; validating the set of values of the feature extraction parameters using the validation data set.

Citations

16 Claims

1. A method, comprising:
- identifying a plurality of feature extraction parameters of a text classifier model, wherein the plurality of feature extraction parameters comprises a first attribute of a first semantic class and a second attribute of a second semantic class, wherein a value of the second attribute is produced by applying a pre-defined transformation to a value of the first attribute;
  
  partitioning a corpus of natural language texts into a training data set comprising a first plurality of natural language texts and a validation data set comprising a second plurality of natural language texts;
  
  determining, in view of the training data set, a set of values of the feature extraction parameters, which maximizes a number of natural language texts of the validation data set that are classified correctly by the text classifier model using the set of values of the feature extraction parameters;
  
  performing, by a processing device, a semantico-syntactic analysis of an input natural language text to produce a semantic structure representing a set of semantic classes;
  
  producing a plurality of values by applying, to the semantic structure representing the input natural language text, the text classifier model using the set of values of the feature extraction parameters, wherein each value of the plurality of values reflects a degree of association of the input natural language text with a particular category of natural language texts;
  
  associating the input natural language text with a category corresponding to an optimal value among the plurality of values; and
  
  utilizing the category to perform a natural language processing task.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, wherein partitioning the corpus of natural language texts comprises cross-validating the first plurality of natural language texts and the second plurality of natural language texts.
  - 3. The method of claim 1, further comprising:
    - validating the set of values of the feature extraction parameters to produce a quasi-optimal set of values relative to a third plurality of natural language texts.
  - 4. The method of claim 1, wherein the plurality of feature extraction parameters comprises a number of levels of the semantic structure to be analyzed by the text classifier model.
  - 5. The method of claim 1, wherein applying the pre-defined transformation comprises multiplying the value of the first attribute by a pre-defined multiplier.

6. A method, comprising:
- identifying a plurality of hyper-parameters of a text classifier model, wherein the plurality of hyper-parameters include a number of nearest neighbors to be analyzed by the text classifier model;
  
  partitioning a corpus of natural language texts into a training data set comprising a first plurality of natural language texts and a validation data set comprising a second plurality of natural language texts;
  
  determining, in view of the training data set, a set of values of the hyper-parameters of the text classifier model, which maximizes a number of natural language texts of the validation data set that are classified correctly by the text classifier model using the set of values of the hyper-parameters;
  
  performing, by a processing device, a semantico-syntactic analysis of an input natural language text to produce a semantic structure representing a set of semantic classes;
  
  producing a plurality of values by applying, to the semantic structure representing the input natural language text, the text classifier model using the set of values of the hyper-parameters, wherein each value of the plurality of values reflects a degree of association of the input natural language text with a particular category of natural language texts;
  
  associating the input natural language text with a category corresponding to an optimal value among the plurality of values; and
  
  utilizing the category to perform a natural language processing task.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The method of claim 6, wherein partitioning the corpus of natural language texts comprises cross-validating the first plurality of natural language texts and the second plurality of natural language texts.
  - 8. The method of claim 6, further comprising:
    - validating the set of values of the feature extraction parameters to produce a quasi-optimal set of values relative to a third plurality of natural language texts.
  - 9. The method of claim 6, wherein the plurality of hyper-parameters further comprises a regularization parameter of the text classifier model.
  - 10. The method of claim 6, wherein determining the set of values of the hyper-parameters of the text classifier model further comprises:
    - modifying, using a pre-defined transformation, values of one or more hyper-parameters of the text classifier model to produce a modified set of values of the hyper-parameters;
      
      evaluating the number of natural language texts of the validation data set that are classified correctly by the text classifier model using the modified set of values of the hyper-parameters;
      
      responsive to determining that the number of natural language texts falls below a threshold number, repeating the modifying operation.

11. A system, comprising:
- a memory;
  
  a processor, coupled to the memory, the processor configured to;
  
  identify a plurality of feature extraction parameters of a text classifier model, wherein the plurality of feature extraction parameters comprises a first attribute of a first semantic class and a second attribute of a second semantic class, wherein a value of the second attribute is produced by applying a pre-defined transformation to a value of the first attribute;
  
  partition a corpus of natural language texts into a training data set comprising a first plurality of natural language texts and a validation data set comprising a second plurality of natural language texts;
  
  determine, in view of the training data set, a set of values of the feature extraction parameters, which maximizes a number of natural language texts of the validation data set that are classified correctly by the text classifier model using the set of values of the feature extraction parameters;
  
  perform a semantico-syntactic analysis of an input natural language text to produce a semantic structure representing a set of semantic classes;
  
  produce a plurality of values by applying, to the semantic structure representing the input natural language text, the text classifier model using the set of values of the feature extraction parameters, wherein each value of the plurality of values reflects a degree of association of the input natural language text with a particular category of natural language text;
  
  associate the input natural language text with a category corresponding to an optimal value among the plurality of values; and
  
  utilize the category to perform a natural language processing task.
- View Dependent Claims (12, 13)
- - 12. The system of claim 11, wherein partitioning the corpus of natural language texts comprises cross-validating the first plurality of natural language texts and the second plurality of natural language texts.
  - 13. The system of claim 11, wherein the plurality of feature extraction parameters comprises a number of levels of the semantic structure to be analyzed by the text classifier model.

14. A computer-readable non-transitory storage medium comprising executable instructions that, when executed by a computer system, cause the computer system to:
- identify a plurality of hyper-parameters of a text classifier model, wherein the plurality of hyper-parameters include a number of nearest neighbors to be analyzed by the text classifier model;
  
  partition a corpus of natural language texts into a training data set comprising a first plurality of natural language texts and a validation data set comprising a second plurality of natural language texts;
  
  determine, in view of the training data set, a set of values of the hyper-parameters of the text classifier model, which maximizes a number of natural language texts of the validation data set that are classified correctly by the text classifier model using the set of values of the hyper-parameters;
  
  perform a semantico-syntactic analysis of an input natural language text to produce a semantic structure representing a set of semantic classes; and
  
  produce a plurality of values by applying, to the semantic structure representing the input natural language text, the text classifier model using the set of values of the hyper-parameters, wherein each value of the plurality of values reflects a degree of association of the input natural language text with a particular category of natural language texts;
  
  associate the input natural language text with a category corresponding to an optimal value among the plurality of values; and
  
  utilize the category to perform a natural language processing task.
- View Dependent Claims (15, 16)
- - 15. The computer-readable non-transitory storage medium of claim 14, wherein partitioning the corpus of natural language texts comprises cross-validating the first plurality of natural language texts and the second plurality of natural language texts.
  - 16. The computer-readable non-transitory storage medium of claim 14, wherein a hyper-parameter of the plurality of hyper-parameters represents a regularization parameter of the text classifier model.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
ABBYY Development LLC
Original Assignee
ABBYY Production LLC (ABBYY Software)
Inventors
Kolotienko, Sergey, Anisimovich, Konstantin
Primary Examiner(s)
Yang, Qian

Application Number

US15/157,722
Publication Number

US 20170293687A1
Time in Patent Office

853 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/36   Creation of semantic tools,...

G06F 40/211   Syntactic parsing, e.g. bas...

G06F 40/226   Validation

G06F 40/268   Morphological analysis

G06F 40/30   Semantic analysis

G06F 40/40   Processing or translation o...

Evaluating text classifier parameters based on semantic features

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Evaluating text classifier parameters based on semantic features

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links