Methods and systems for providing universal portability in machine learning

US 9,836,450 B2
Filed: 12/09/2015
Issued: 12/05/2017
Est. Priority Date: 12/09/2014
Status: Active Grant

First Claim

Patent Images

1. A method for classifying a document in natural language processing using a natural language model stored in one or more data files, the method comprising:

accessing, by a processor in a natural language processing platform using the natural language model, one or more feature types from the one or more data files, the one or more feature types each defining a data structure configured to access a tokenized sequence of the document and generate linguistic features from content within the tokenized sequence;

performing, by the processor in the natural language processing platform, a tokenizing operation of the document, the tokenizing operation configured to generate one or more tokenized sequences from the content within the document;

generating, by the processor in the natural language processing platform, a plurality of features for the document from the one or more tokenized sequences, based on parameters defined by the one or more feature types and on parameters defined in task configuration data in the one or more data files, the task configuration data associated with a type of task analysis that the natural language model is configured to classify the document into;

accessing, by the processor in the natural language processing platform, a plurality of probabilities stored in the one or more data files, each probability among the plurality of probabilities associated with a feature among the plurality of features and defining a pre-computed likelihood that said feature predicts a presence or absence of a label that the document is to be classified into;

wherein;

the plurality of probabilities are pre-computed during a model training process configured to train the natural language model to classify documents according to at least said label and said task analysis;

the one or more data files is configured to store each probability in a logarithmic scale that is converted to said probability by the processor;

the one or more data files is configured to store a table of rows and columns, wherein a first column comprises the plurality of features, a second column comprises a first category of probabilities among the plurality of probabilities that describes a first likelihood that a feature in the first column belonging to the same row satisfies a first attribute of said label, and a third column comprises a second category of probabilities among the plurality of probabilities that describes a second likelihood that said feature in the first column belonging to the same row satisfies a second attribute of said label; and

the first attribute of said label represents a likelihood that said feature in the same row appears at a beginning of a span of the document, the second attribute of said label represents a likelihood that said feature in the same row appears inside said span of the document, and a fourth column comprises a third category of probabilities among the plurality of probabilities that represents a third likelihood that said feature in the same row appears outside said span of the document;

computing, by the processor in the natural language processing platform, a prediction score indicating how likely the document is to be classified into said label, based on the plurality of probabilities;

classifying, by the processor in the natural language processing platform, the document into said label based on comparing the prediction score to a threshold; and

training the natural language model at least based on the classified document.

View all claims

12 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems, methods, and apparatuses are presented for a trained language model to be stored in an efficient manner such that the trained language model may be utilized in virtually any computing device to conduct natural language processing. Unlike other natural language processing engines that may be computationally intensive to the point of being capable of running only on high performance machines, the organization of the natural language models according to the present disclosures allows for natural language processing to be performed even on smaller devices, such as mobile devices.

Citations

17 Claims

1. A method for classifying a document in natural language processing using a natural language model stored in one or more data files, the method comprising:
- accessing, by a processor in a natural language processing platform using the natural language model, one or more feature types from the one or more data files, the one or more feature types each defining a data structure configured to access a tokenized sequence of the document and generate linguistic features from content within the tokenized sequence;
  
  performing, by the processor in the natural language processing platform, a tokenizing operation of the document, the tokenizing operation configured to generate one or more tokenized sequences from the content within the document;
  
  generating, by the processor in the natural language processing platform, a plurality of features for the document from the one or more tokenized sequences, based on parameters defined by the one or more feature types and on parameters defined in task configuration data in the one or more data files, the task configuration data associated with a type of task analysis that the natural language model is configured to classify the document into;
  
  accessing, by the processor in the natural language processing platform, a plurality of probabilities stored in the one or more data files, each probability among the plurality of probabilities associated with a feature among the plurality of features and defining a pre-computed likelihood that said feature predicts a presence or absence of a label that the document is to be classified into;
  
  wherein;
  
  the plurality of probabilities are pre-computed during a model training process configured to train the natural language model to classify documents according to at least said label and said task analysis;
  
  the one or more data files is configured to store each probability in a logarithmic scale that is converted to said probability by the processor;
  
  the one or more data files is configured to store a table of rows and columns, wherein a first column comprises the plurality of features, a second column comprises a first category of probabilities among the plurality of probabilities that describes a first likelihood that a feature in the first column belonging to the same row satisfies a first attribute of said label, and a third column comprises a second category of probabilities among the plurality of probabilities that describes a second likelihood that said feature in the first column belonging to the same row satisfies a second attribute of said label; and
  
  the first attribute of said label represents a likelihood that said feature in the same row appears at a beginning of a span of the document, the second attribute of said label represents a likelihood that said feature in the same row appears inside said span of the document, and a fourth column comprises a third category of probabilities among the plurality of probabilities that represents a third likelihood that said feature in the same row appears outside said span of the document;
  
  computing, by the processor in the natural language processing platform, a prediction score indicating how likely the document is to be classified into said label, based on the plurality of probabilities;
  
  classifying, by the processor in the natural language processing platform, the document into said label based on comparing the prediction score to a threshold; and
  
  training the natural language model at least based on the classified document.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein the plurality of probabilities comprise a first probability that said feature predicts the presence of said label and a second probability that said feature predicts the absence of said label.
  - 3. The method of claim 1, wherein the plurality of probabilities comprise a first probability that said feature appears at a beginning of a subset of the document, a second probability that said feature appears at an inside of the subset of the document, and a third probability that said feature appears on an outside of the subset of the document.
  - 4. The method of claim 1, wherein each feature among the generated plurality of Features comprises a first array storing integer indices of every label in the natural language model having a non-zero probability.
  - 5. The method of claim 4, wherein each feature among the generated plurality of Features further comprises a second array storing a quantized 16-bit fixed-point value of each probability converted as a logarithm.
  - 6. The method of claim 1, wherein generating the plurality of features is based further on task configuration data stored in the one or more data files, the task configuration data including training feature types used to train the natural language model.
  - 7. The method of claim 6, wherein the task configuration data stored in the one or more data files includes executable code configured perform a user-specified data transformation to generate custom feature types.
  - 8. The method of claim 1, wherein the task configuration data further comprises analyst-defined tuning rules.
  - 9. The method of claim 1, wherein the first attribute of said label represents a likelihood that said label is present in said feature belonging to the same row, and the second attribute of said label represents a likelihood that said label is absent in said feature belonging to the same row.

10. A natural language platform configured to classify a document in natural language processing using a natural language model stored in one or more data files, the natural language platform comprising:
- a memory configured to store the one or more data files; and
  
  a processor coupled to the memory and configured to;
  
  access one or more feature types from the one or more data files, the one or more feature types each defining a data structure configured to access a tokenized sequence of the document and generate linguistic features from content within the tokenized sequence;
  
  perform a tokenizing operation of the document, the tokenizing operation configured to generate one or more tokenized sequences from the content within the document;
  
  generate a plurality of features for the document from the one or more tokenized sequences, based on parameters defined by the one or more feature types and on parameters defined in task configuration data in the one or more data files, the task configuration data associated with a type of task analysis that the natural language model is configured to classify the document into;
  
  access a plurality of probabilities stored in the one or more data files, each probability among the plurality of probabilities associated with a feature among the plurality of features and defining a pre-computed likelihood that said feature predicts a presence or absence of a label that the document is to be classified into;
  
  wherein;
  
  the plurality of probabilities are pre-computed during a model training process configured to train the natural language model to classify documents according to at least said label and said task analysis;
  
  the one or more data files is configured to store each probability in a logarithmic scale that is converted to said probability by the processor;
  
  the one or more data files is configured to store a table of rows and columns, wherein a first column comprises the plurality of features, a second column comprises a first category of probabilities among the plurality of probabilities that describes a first likelihood that a feature in the first column belonging to the same row satisfies a first attribute of said label, and a third column comprises a second category of probabilities among the plurality of probabilities that describes a second likelihood that said feature in the first column belonging to the same row satisfies a second attribute of said label; and
  
  the first attribute of said label represents a likelihood that said feature in the same row appears at a beginning of a span of the document, the second attribute of said label represents a likelihood that said feature in the same row appears inside said span of the document, and a fourth column comprises a third category of probabilities among the plurality of probabilities that represents a third likelihood that said feature in the same row appears outside said span of the document;
  
  compute a prediction score indicating how likely the document is to be classified into said label, based on the plurality or probabilities;
  
  classify the document into said label based on comparing the prediction score to a threshold; and
  
  train the natural language model at least based on the classified document.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The natural language platform of claim 10, wherein the plurality of probabilities comprise a first probability that said feature predicts the presence of said label and a second probability that said feature predicts the absence of said label.
  - 12. The natural language platform of claim 10, wherein the plurality of probabilities comprise a first probability that said feature appears at a beginning of a subset or the document, a second probability that said feature appears at an inside of the subset of the document, and a third probability that said feature appears on an outside of the subset or the document.
  - 13. The natural language platform of claim 10, wherein each feature among the generated plurality of features comprises a first array storing integer indices of every label in the natural language model having a non-zero probability.
  - 14. The natural language platform of claim 13, wherein each feature among the generated plurality of features further comprises a second array storing a quantized 16-bit fixed-point value of each probability converted as a logarithm.
  - 15. The natural language platform of claim 10, wherein the first attribute of said label represents a likelihood that said label is present in said feature belonging to the same row, and the second attribute of said label represents a likelihood that said label is absent in said feature belonging to the same row.

16. A non-transitory computer-readable medium embodying instructions that, when executed by a processor, perform operations comprising:
- accessing one or more feature types from a data file storing a natural language model, the one or more Feature types each defining a data structure configured to access a tokenized sequence of the document and generate linguistic features from content within the tokenized sequence;
  
  performing a tokenizing operation of the document, the tokenizing operation configured to generate one or more tokenized sequences from the content within the document;
  
  generating a plurality of features for the document from the one or more tokenized sequences, based on parameters defined by the one or more feature types and on parameters defined in task configuration data in the data file, the task configuration data associated with a type of task analysis that the natural language model is configured to classify the document into;
  
  accessing a plurality of probabilities stored in the data file, each probability among the plurality of probabilities associated with a feature among the plurality of features and defining a pre-computed likelihood that said feature predicts a presence or absence of a label that the document is to be classified into;
  
  wherein;
  
  the plurality of probabilities are pre-computed during a model training process configured to train the natural language model to classify documents according to at least said label and said task analysis;
  
  the data file is configured to store each probability in a logarithmic scale that is converted to said probability by the processor;
  
  the data file is configured to store a table of rows and columns, wherein a first column comprises the plurality of features, a second column comprises a first category of probabilities among the plurality of probabilities that describes a first likelihood that a feature in the first column belonging to the same row satisfies a First attribute of said label, and a third column comprises a second category of probabilities among the plurality of probabilities that describes a second likelihood that said feature in the first column belonging to the same row satisfies a second attribute of said label; and
  
  the first attribute of said label represents a likelihood that said feature in the same row appears at a beginning of a span of the document, the second attribute of said label represents a likelihood that said feature in the same row appears inside said span of the document, and a fourth column comprises a third category of probabilities among the plurality of probabilities that represents a third likelihood that said feature in the same row appears outside said span of the document;
  
  computing a prediction score indicating how likely the document is to be classified into said label, based on the plurality of probabilities;
  
  classifying the document into said label based on comparing the prediction score to a threshold; and
  
  training the natural language model at least based on the classified document.
- View Dependent Claims (17)
- - 17. The computer-readable medium of claim 16, wherein the first attribute of said label represents a likelihood that said label is present in said feature belonging to the same row, and the second attribute of said label represents a likelihood that said label is absent in said feature belonging to the same row.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
AI IP Investments Limited
Original Assignee
Sansa AI Incorporated
Inventors
Erle, Schuyler D., Munro, Robert J., Callahan, Brendan D., King, Gary C., Brenier, Jason, Robinson, James B.
Primary Examiner(s)
Desir, Pierre-Louis
Assistant Examiner(s)
KIM, JONATHAN C

Application Number

US14/964,526
Publication Number

US 20160162468A1
Time in Patent Office

727 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 40/169   Annotation, e.g. comment da...

G06F 40/284   Lexical analysis, e.g. toke...

G06F 40/30   Semantic analysis

Methods and systems for providing universal portability in machine learning

First Claim

12 Assignments

0 Petitions

Accused Products

Abstract

Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and systems for providing universal portability in machine learning

First Claim

12 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links