Automatic extraction of named entities from texts
First Claim
1. A method comprising:
- identifying, by a processor, a set of training texts;
extracting, by the processor, a respective set of features for each of the training texts;
training, by the processor, a classification model using the training texts and the extracted features;
extracting, by the processor, a token from a natural language text;
identifying, by the processor, a set of token attributes associated with the token based on a semantic-syntactic analysis of the natural language text, wherein the set of token attributes comprises at least one of a lexical attribute, a syntactic attribute, or a semantic attribute, and wherein the semantic-syntactic analysis of the natural language text comprises;
generating, by the processor, a lexical-morphological structure of a sentence of the natural language text;
identifying, by the processor, a syntactic tree using the lexical-morphological structure;
generating, by the processor, a language-independent semantic structure based on the syntactic tree; and
identifying, by the processor, the set of token attributes using the language-independent semantic structure;
determining, by the processor, a category for the token based on the trained classification model and the set of token attributes; and
generating, by the processor, a tagged representation of at least part of the natural language text, the tagged representation referencing the category for the token.
5 Assignments
0 Petitions
Accused Products
Abstract
Disclosed are systems, computer-readable mediums, and methods for extracting named entities from an untagged corpus of texts. Generating a set of attributes for each of the tokens based at least on a deep semantic-syntactic analysis. The set of attributes include lexical, syntactic, and semantic attributes. Selecting a subset of the attributes for each of the tokens. Retrieving classifier attributes and categories based on a trained model, wherein the classifier attributes are related to one or more categories. Comparing the subset of the attributes for each of the tokens with the classifier attributes. Classifying each of tokens to at least one of the categories based on the comparing. Generating tagged text based on the categorized tokens.
60 Citations
19 Claims
-
1. A method comprising:
-
identifying, by a processor, a set of training texts; extracting, by the processor, a respective set of features for each of the training texts; training, by the processor, a classification model using the training texts and the extracted features; extracting, by the processor, a token from a natural language text; identifying, by the processor, a set of token attributes associated with the token based on a semantic-syntactic analysis of the natural language text, wherein the set of token attributes comprises at least one of a lexical attribute, a syntactic attribute, or a semantic attribute, and wherein the semantic-syntactic analysis of the natural language text comprises; generating, by the processor, a lexical-morphological structure of a sentence of the natural language text; identifying, by the processor, a syntactic tree using the lexical-morphological structure; generating, by the processor, a language-independent semantic structure based on the syntactic tree; and identifying, by the processor, the set of token attributes using the language-independent semantic structure; determining, by the processor, a category for the token based on the trained classification model and the set of token attributes; and generating, by the processor, a tagged representation of at least part of the natural language text, the tagged representation referencing the category for the token. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system comprising:
-
a memory to store a natural language text; and a processor, operatively coupled to the memory, to; identify a set of training texts; extract a respective set of features for each of the training texts; train a classification model using the training texts and the extracted features; extract a token from the natural language text; identify a set of token attributes associated with the token based on a semantic-syntactic analysis of the natural language text, wherein the set of token attributes comprises at least one of a lexical attribute, a syntactic attribute, or a semantic attribute, and wherein to perform the semantic-syntactic analysis of the natural language text, the processor is to; generate a lexical-morphological structure of a sentence of the natural language text; identify a syntactic tree using the lexical-morphological structure; generate a language-independent semantic structure based on the syntactic tree; and identify the set of token attributes using the language-independent semantic structure; determine a category for the token based on the trained classification model and the set of token attributes; and generate a tagged representation of the natural language text, the tagged representation referencing the category for the token. - View Dependent Claims (10, 11, 12, 13)
-
-
14. A non-transitory computer readable medium having executable instructions stored thereon, the instructions causing a processor to:
-
identify a set of training texts; extract a respective set of features for each of the training texts; train a classification model using the training texts and the extracted features; extract a token from a natural language text; identify a set of token attributes associated with the token based on a semantic-syntactic analysis of the natural language text, wherein the set of token attributes comprises at least one of a lexical attribute, a syntactic attribute, or a semantic attribute, and wherein to perform the semantic-syntactic analysis of the natural language text, the processor is to; generate a lexical-morphological structure of a sentence of the natural language text; identify a syntactic tree using the lexical-morphological structure; generate a language-independent semantic structure based on the syntactic tree; and identify a set of token attributes using the language-independent semantic structure; determine a category for the token based on the set of token attributes and the trained classification model; and generate a tagged representation of the natural language text, the tagged representation referencing the category for the token. - View Dependent Claims (15, 16, 17, 18, 19)
-
Specification