Methods and systems for providing universal portability in machine learning
First Claim
1. A method for classifying a document in natural language processing using a natural language model stored in one or more data files, the method comprising:
- accessing, by a processor in a natural language processing platform using the natural language model, one or more feature types from the one or more data files, the one or more feature types each defining a data structure configured to access a tokenized sequence of the document and generate linguistic features from content within the tokenized sequence;
performing, by the processor in the natural language processing platform, a tokenizing operation of the document, the tokenizing operation configured to generate one or more tokenized sequences from the content within the document;
generating, by the processor in the natural language processing platform, a plurality of features for the document from the one or more tokenized sequences, based on parameters defined by the one or more feature types and on parameters defined in task configuration data in the one or more data files, the task configuration data associated with a type of task analysis that the natural language model is configured to classify the document into;
accessing, by the processor in the natural language processing platform, a plurality of probabilities stored in the one or more data files, each probability among the plurality of probabilities associated with a feature among the plurality of features and defining a pre-computed likelihood that said feature predicts a presence or absence of a label that the document is to be classified into;
wherein;
the plurality of probabilities are pre-computed during a model training process configured to train the natural language model to classify documents according to at least said label and said task analysis;
the one or more data files is configured to store each probability in a logarithmic scale that is converted to said probability by the processor;
the one or more data files is configured to store a table of rows and columns, wherein a first column comprises the plurality of features, a second column comprises a first category of probabilities among the plurality of probabilities that describes a first likelihood that a feature in the first column belonging to the same row satisfies a first attribute of said label, and a third column comprises a second category of probabilities among the plurality of probabilities that describes a second likelihood that said feature in the first column belonging to the same row satisfies a second attribute of said label; and
the first attribute of said label represents a likelihood that said feature in the same row appears at a beginning of a span of the document, the second attribute of said label represents a likelihood that said feature in the same row appears inside said span of the document, and a fourth column comprises a third category of probabilities among the plurality of probabilities that represents a third likelihood that said feature in the same row appears outside said span of the document;
computing, by the processor in the natural language processing platform, a prediction score indicating how likely the document is to be classified into said label, based on the plurality of probabilities;
classifying, by the processor in the natural language processing platform, the document into said label based on comparing the prediction score to a threshold; and
training the natural language model at least based on the classified document.
12 Assignments
0 Petitions
Accused Products
Abstract
Systems, methods, and apparatuses are presented for a trained language model to be stored in an efficient manner such that the trained language model may be utilized in virtually any computing device to conduct natural language processing. Unlike other natural language processing engines that may be computationally intensive to the point of being capable of running only on high performance machines, the organization of the natural language models according to the present disclosures allows for natural language processing to be performed even on smaller devices, such as mobile devices.
-
Citations
17 Claims
-
1. A method for classifying a document in natural language processing using a natural language model stored in one or more data files, the method comprising:
-
accessing, by a processor in a natural language processing platform using the natural language model, one or more feature types from the one or more data files, the one or more feature types each defining a data structure configured to access a tokenized sequence of the document and generate linguistic features from content within the tokenized sequence; performing, by the processor in the natural language processing platform, a tokenizing operation of the document, the tokenizing operation configured to generate one or more tokenized sequences from the content within the document; generating, by the processor in the natural language processing platform, a plurality of features for the document from the one or more tokenized sequences, based on parameters defined by the one or more feature types and on parameters defined in task configuration data in the one or more data files, the task configuration data associated with a type of task analysis that the natural language model is configured to classify the document into; accessing, by the processor in the natural language processing platform, a plurality of probabilities stored in the one or more data files, each probability among the plurality of probabilities associated with a feature among the plurality of features and defining a pre-computed likelihood that said feature predicts a presence or absence of a label that the document is to be classified into; wherein; the plurality of probabilities are pre-computed during a model training process configured to train the natural language model to classify documents according to at least said label and said task analysis; the one or more data files is configured to store each probability in a logarithmic scale that is converted to said probability by the processor; the one or more data files is configured to store a table of rows and columns, wherein a first column comprises the plurality of features, a second column comprises a first category of probabilities among the plurality of probabilities that describes a first likelihood that a feature in the first column belonging to the same row satisfies a first attribute of said label, and a third column comprises a second category of probabilities among the plurality of probabilities that describes a second likelihood that said feature in the first column belonging to the same row satisfies a second attribute of said label; and the first attribute of said label represents a likelihood that said feature in the same row appears at a beginning of a span of the document, the second attribute of said label represents a likelihood that said feature in the same row appears inside said span of the document, and a fourth column comprises a third category of probabilities among the plurality of probabilities that represents a third likelihood that said feature in the same row appears outside said span of the document; computing, by the processor in the natural language processing platform, a prediction score indicating how likely the document is to be classified into said label, based on the plurality of probabilities; classifying, by the processor in the natural language processing platform, the document into said label based on comparing the prediction score to a threshold; and training the natural language model at least based on the classified document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A natural language platform configured to classify a document in natural language processing using a natural language model stored in one or more data files, the natural language platform comprising:
-
a memory configured to store the one or more data files; and a processor coupled to the memory and configured to; access one or more feature types from the one or more data files, the one or more feature types each defining a data structure configured to access a tokenized sequence of the document and generate linguistic features from content within the tokenized sequence; perform a tokenizing operation of the document, the tokenizing operation configured to generate one or more tokenized sequences from the content within the document; generate a plurality of features for the document from the one or more tokenized sequences, based on parameters defined by the one or more feature types and on parameters defined in task configuration data in the one or more data files, the task configuration data associated with a type of task analysis that the natural language model is configured to classify the document into; access a plurality of probabilities stored in the one or more data files, each probability among the plurality of probabilities associated with a feature among the plurality of features and defining a pre-computed likelihood that said feature predicts a presence or absence of a label that the document is to be classified into; wherein; the plurality of probabilities are pre-computed during a model training process configured to train the natural language model to classify documents according to at least said label and said task analysis; the one or more data files is configured to store each probability in a logarithmic scale that is converted to said probability by the processor; the one or more data files is configured to store a table of rows and columns, wherein a first column comprises the plurality of features, a second column comprises a first category of probabilities among the plurality of probabilities that describes a first likelihood that a feature in the first column belonging to the same row satisfies a first attribute of said label, and a third column comprises a second category of probabilities among the plurality of probabilities that describes a second likelihood that said feature in the first column belonging to the same row satisfies a second attribute of said label; and the first attribute of said label represents a likelihood that said feature in the same row appears at a beginning of a span of the document, the second attribute of said label represents a likelihood that said feature in the same row appears inside said span of the document, and a fourth column comprises a third category of probabilities among the plurality of probabilities that represents a third likelihood that said feature in the same row appears outside said span of the document; compute a prediction score indicating how likely the document is to be classified into said label, based on the plurality or probabilities; classify the document into said label based on comparing the prediction score to a threshold; and train the natural language model at least based on the classified document. - View Dependent Claims (11, 12, 13, 14, 15)
-
-
16. A non-transitory computer-readable medium embodying instructions that, when executed by a processor, perform operations comprising:
-
accessing one or more feature types from a data file storing a natural language model, the one or more Feature types each defining a data structure configured to access a tokenized sequence of the document and generate linguistic features from content within the tokenized sequence; performing a tokenizing operation of the document, the tokenizing operation configured to generate one or more tokenized sequences from the content within the document; generating a plurality of features for the document from the one or more tokenized sequences, based on parameters defined by the one or more feature types and on parameters defined in task configuration data in the data file, the task configuration data associated with a type of task analysis that the natural language model is configured to classify the document into; accessing a plurality of probabilities stored in the data file, each probability among the plurality of probabilities associated with a feature among the plurality of features and defining a pre-computed likelihood that said feature predicts a presence or absence of a label that the document is to be classified into; wherein; the plurality of probabilities are pre-computed during a model training process configured to train the natural language model to classify documents according to at least said label and said task analysis; the data file is configured to store each probability in a logarithmic scale that is converted to said probability by the processor; the data file is configured to store a table of rows and columns, wherein a first column comprises the plurality of features, a second column comprises a first category of probabilities among the plurality of probabilities that describes a first likelihood that a feature in the first column belonging to the same row satisfies a First attribute of said label, and a third column comprises a second category of probabilities among the plurality of probabilities that describes a second likelihood that said feature in the first column belonging to the same row satisfies a second attribute of said label; and the first attribute of said label represents a likelihood that said feature in the same row appears at a beginning of a span of the document, the second attribute of said label represents a likelihood that said feature in the same row appears inside said span of the document, and a fourth column comprises a third category of probabilities among the plurality of probabilities that represents a third likelihood that said feature in the same row appears outside said span of the document; computing a prediction score indicating how likely the document is to be classified into said label, based on the plurality of probabilities; classifying the document into said label based on comparing the prediction score to a threshold; and training the natural language model at least based on the classified document. - View Dependent Claims (17)
-
Specification