Automatic extraction of named entities from texts

US 9,588,960 B2
Filed: 10/07/2014
Issued: 03/07/2017
Est. Priority Date: 01/15/2014
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

identifying, by a processor, a set of training texts;

extracting, by the processor, a respective set of features for each of the training texts;

training, by the processor, a classification model using the training texts and the extracted features;

extracting, by the processor, a token from a natural language text;

identifying, by the processor, a set of token attributes associated with the token based on a semantic-syntactic analysis of the natural language text, wherein the set of token attributes comprises at least one of a lexical attribute, a syntactic attribute, or a semantic attribute, and wherein the semantic-syntactic analysis of the natural language text comprises;

generating, by the processor, a lexical-morphological structure of a sentence of the natural language text;

identifying, by the processor, a syntactic tree using the lexical-morphological structure;

generating, by the processor, a language-independent semantic structure based on the syntactic tree; and

identifying, by the processor, the set of token attributes using the language-independent semantic structure;

determining, by the processor, a category for the token based on the trained classification model and the set of token attributes; and

generating, by the processor, a tagged representation of at least part of the natural language text, the tagged representation referencing the category for the token.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed are systems, computer-readable mediums, and methods for extracting named entities from an untagged corpus of texts. Generating a set of attributes for each of the tokens based at least on a deep semantic-syntactic analysis. The set of attributes include lexical, syntactic, and semantic attributes. Selecting a subset of the attributes for each of the tokens. Retrieving classifier attributes and categories based on a trained model, wherein the classifier attributes are related to one or more categories. Comparing the subset of the attributes for each of the tokens with the classifier attributes. Classifying each of tokens to at least one of the categories based on the comparing. Generating tagged text based on the categorized tokens.

60 Citations

View as Search Results

19 Claims

1. A method comprising:
- identifying, by a processor, a set of training texts;
  
  extracting, by the processor, a respective set of features for each of the training texts;
  
  training, by the processor, a classification model using the training texts and the extracted features;
  
  extracting, by the processor, a token from a natural language text;
  
  identifying, by the processor, a set of token attributes associated with the token based on a semantic-syntactic analysis of the natural language text, wherein the set of token attributes comprises at least one of a lexical attribute, a syntactic attribute, or a semantic attribute, and wherein the semantic-syntactic analysis of the natural language text comprises;
  
  generating, by the processor, a lexical-morphological structure of a sentence of the natural language text;
  
  identifying, by the processor, a syntactic tree using the lexical-morphological structure;
  
  generating, by the processor, a language-independent semantic structure based on the syntactic tree; and
  
  identifying, by the processor, the set of token attributes using the language-independent semantic structure;
  
  determining, by the processor, a category for the token based on the trained classification model and the set of token attributes; and
  
  generating, by the processor, a tagged representation of at least part of the natural language text, the tagged representation referencing the category for the token.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, further comprising:
    - ranking the set of token attributes; and
      
      identifying a first subset of the set of token attributes based on the ranking.
  - 3. The method of claim 2, further comprising:
    - determining a first rating of the trained classification model using the first subset of token attributes;
      
      identifying a second subset of the set of token attributes, wherein the second subset comprises the first subset of token attributes and an additional token attribute;
      
      determining a second rating of the trained classification model using the second subset of token attributes; and
      
      selecting one of the first subset of token attributes or the second subset of token attributes based on the first rating and the second rating.
  - 4. The method of claim 3, wherein the first rating is based on at least one of:
    - a precision score, a recall score, or an F-score.
  - 5. The method of claim 2, further comprising:
    - determining a first rating of the trained classification model using the first subset of token attributes;
      
      identifying a second subset of the set of token attributes, wherein a number of token attributes in the second subset of token attributes is less than a number of token attributes in the first subset of token attributes;
      
      determining a second rating of the trained classification model using the second subset of token attributes; and
      
      selecting one of the first subset of token attributes or the second subset of token attributes based on the first rating and the second rating.
  - 6. The method of claim 5, wherein the first rating is based on at least one of:
    - a precision score, a recall score, or an F-score.
  - 7. The method of claim 1, further comprising combining a first token attribute and a second token attribute to form a third token attribute.
  - 8. The method of claim 1, wherein identifying the syntactic tree further comprises:
    - identifying a plurality of syntactic links in the natural language text using the lexical-morphological structure;
      
      identifying a plurality of syntactic trees based on the syntactic links;
      
      determining integral ratings of the plurality of syntactic trees; and
      
      identifying the syntactic tree based on the integral ratings.

9. A system comprising:
- a memory to store a natural language text; and
  
  a processor, operatively coupled to the memory, to;
  
  identify a set of training texts;
  
  extract a respective set of features for each of the training texts;
  
  train a classification model using the training texts and the extracted features;
  
  extract a token from the natural language text;
  
  identify a set of token attributes associated with the token based on a semantic-syntactic analysis of the natural language text, wherein the set of token attributes comprises at least one of a lexical attribute, a syntactic attribute, or a semantic attribute, and wherein to perform the semantic-syntactic analysis of the natural language text, the processor is to;
  
  generate a lexical-morphological structure of a sentence of the natural language text;
  
  identify a syntactic tree using the lexical-morphological structure;
  
  generate a language-independent semantic structure based on the syntactic tree; and
  
  identify the set of token attributes using the language-independent semantic structure;
  
  determine a category for the token based on the trained classification model and the set of token attributes; and
  
  generate a tagged representation of the natural language text, the tagged representation referencing the category for the token.
- View Dependent Claims (10, 11, 12, 13)
- - 10. The system of claim 9, wherein the processor is further to:
    - determine a first rating of the trained classification model using a first subset of the set of token attributes;
      
      identify a second subset of the set of token attributes, wherein the second subset comprises the first subset of token attributes and an additional token attribute;
      
      determine a second rating of the trained classification model using the second subset of token attributes; and
      
      select one of the first subset of token attributes or the second subset of token attributes based on the first rating and the second rating.
  - 11. The system of claim 9, wherein to identify the syntactic tree comprises:
    - identifying a plurality of syntactic links in the natural language text using the lexical-morphological structure;
      
      identifying a plurality of syntactic trees based on the syntactic links;
      
      determining integral ratings of the plurality of syntactic trees; and
      
      identifying the syntactic tree based on the integral ratings.
  - 12. The system of claim 10, wherein the first rating is based on at least one of:
    - a precision score, a recall score, or an F-score.
  - 13. The system of claim 9, wherein the processor is further to:
    - rank the set of token attributes; and
      
      identify a first subset of the set of token attributes based on the ranking.

14. A non-transitory computer readable medium having executable instructions stored thereon, the instructions causing a processor to:
- identify a set of training texts;
  
  extract a respective set of features for each of the training texts;
  
  train a classification model using the training texts and the extracted features;
  
  extract a token from a natural language text;
  
  identify a set of token attributes associated with the token based on a semantic-syntactic analysis of the natural language text, wherein the set of token attributes comprises at least one of a lexical attribute, a syntactic attribute, or a semantic attribute, and wherein to perform the semantic-syntactic analysis of the natural language text, the processor is to;
  
  generate a lexical-morphological structure of a sentence of the natural language text;
  
  identify a syntactic tree using the lexical-morphological structure;
  
  generate a language-independent semantic structure based on the syntactic tree; and
  
  identify a set of token attributes using the language-independent semantic structure;
  
  determine a category for the token based on the set of token attributes and the trained classification model; and
  
  generate a tagged representation of the natural language text, the tagged representation referencing the category for the token.
- View Dependent Claims (15, 16, 17, 18, 19)
- - 15. The non-transitory computer-readable medium of claim 14, further comprising executable instructions causing the processor to:
    - determine a first rating of the trained classification model using a first subset of the set of token attributes;
      
      identify a second subset of the set of token attributes, wherein the second subset comprises the first subset of token attributes and an additional token attribute;
      
      determine a second rating of the trained classification model using the second subset of token attributes; and
      
      select one of the first subset of token attributes or the second subset of token attributes based on the first rating and the second rating.
  - 16. The non-transitory computer-readable medium of claim 14, further comprising executable instructions causing the processor to:
    - identify a plurality of syntactic links in the natural language text using the lexical-morphological structure;
      
      identify a plurality of syntactic trees based on the syntactic links;
      
      determine integral ratings of the plurality of syntactic trees; and
      
      identify the syntactic tree based on the integral ratings.
  - 17. The non-transitory computer-readable medium of claim 14, wherein to identify the syntactic tree comprises:
    - identifying a plurality of syntactic links in the natural language text using the lexical-morphological structure;
      
      identifying a plurality of syntactic trees based on the syntactic links;
      
      determining integral ratings of the plurality of syntactic trees; and
      
      identifying the syntactic tree based on the integral ratings.
  - 18. The non-transitory computer-readable medium of claim 15, wherein the first rating is based on at least one of:
    - a precision score, a recall score, or an F-score.
  - 19. The non-transitory computer-readable medium of claim 14, further comprising executable instructions causing the processor to:
    - rank the set of token attributes; and
      
      identify a first subset of the set of token attributes based on the ranking.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
ABBYY Development LLC
Original Assignee
ABBYY InfoPoisk LLC
Inventors
Nekhay, Ilya
Primary Examiner(s)
Yang, Qian

Application Number

US14/508,419
Publication Number

US 20150199333A1
Time in Patent Office

882 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 40/211   Syntactic parsing, e.g. bas...

G06F 40/268   Morphological analysis

G06F 40/295   Named entity recognition

G06F 40/30   Semantic analysis

Automatic extraction of named entities from texts

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

60 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic extraction of named entities from texts

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

60 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links