Method and system for automating training of named entity recognition in natural language processing

US 10,558,754 B2
Filed: 03/29/2017
Issued: 02/11/2020
Est. Priority Date: 09/15/2016
Status: Active Grant

First Claim

Patent Images

1. A method to automate training named entity recognition in natural language processing to build configurable entity definitions, the method comprising:

receiving at least one input document or one or more entities through an administration module;

defining a domain for each of the received entities or the at least one input document through the administration module;

determining one or more entities corresponding to a domain specific entity in the at least one input document;

generating a training file;

via the training file, picking a right parser;

via the training file, extracting content from the input document;

via the training file, labeling entity ambiguity, whereby a single training file is used to pick the right parser, extract content from the input document, and label entity ambiguity;

collecting and maintaining, through a knowledge engine, at least one user action in a knowledge repository, wherein the collecting comprises resolution of the entity ambiguity and comprises;

displaying a plurality of confirmation blocks containing excerpts appearing in the input document, wherein the excerpts contain an unclassified, ambiguous named entity associated with the entity ambiguity and surrounding text as the surrounding text appears in the input document, wherein the unclassified, ambiguous named entity is ambiguous because its domain overlaps with more than one domain,displaying a proposed specific domain for the excerpts, wherein a single proposed specific domain is displayed for more than one of the excerpts,within the confirmation blocks, displaying user interface elements for confirmation or rejection of a given excerpt out of the excerpts as belonging to the single proposed specific domain,receiving activation of one user interface element of the user interface elements, thereby resolving the entity ambiguity by indicating that text in the given excerpt out of the excerpts does or does not belong to the single proposed specific domain, andupdating the knowledge engine with the resolved entity ambiguity;

predicting, through the knowledge engine, one or more labelled ambiguous entities;

fetching, through a training pipeline execution engine, data stored on a document store; and

associating, through the training pipeline execution engine, each entity with one or more documents based on the fetched data from the document store to build configurable entity definitions;

wherein the act of generating the training file comprises;

extracting text from the input document;

determining a definition of the extracted text to be ambiguous or unambiguous; and

based on whether the definition of the extracted text is determined to be ambiguous or unambiguous, switching between (a) and (b);

(a) adding the extracted text to the training file when the definition is determined to be unambiguous, and(b) prompting a user to resolve ambiguity, and adding the resolution of the ambiguity to the training file when the definition is determined to be ambiguous.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system automates training named entity recognition in natural language processing to build configurable entity definitions includes receiving input documents or entities through an administration module and defining a domain for each entity. Further, one or more entities corresponding to the domain specific entity in the received documents are determined and a training file to one of pick a right parser, extract content and label the entity ambiguity is generated. One or more user actions are collected and maintained at a repository through a knowledge engine. Still further, one or more labelled ambiguous words are predicted and the knowledge engine is updated. Data may be fetched, through a training pipeline execution engine and each entity may be associated with one or more documents based on the fetched data from the document store to build configurable entity definitions.

54 Citations

View as Search Results

17 Claims

1. A method to automate training named entity recognition in natural language processing to build configurable entity definitions, the method comprising:
- receiving at least one input document or one or more entities through an administration module;
  
  defining a domain for each of the received entities or the at least one input document through the administration module;
  
  determining one or more entities corresponding to a domain specific entity in the at least one input document;
  
  generating a training file;
  
  via the training file, picking a right parser;
  
  via the training file, extracting content from the input document;
  
  via the training file, labeling entity ambiguity, whereby a single training file is used to pick the right parser, extract content from the input document, and label entity ambiguity;
  
  collecting and maintaining, through a knowledge engine, at least one user action in a knowledge repository, wherein the collecting comprises resolution of the entity ambiguity and comprises;
  
  displaying a plurality of confirmation blocks containing excerpts appearing in the input document, wherein the excerpts contain an unclassified, ambiguous named entity associated with the entity ambiguity and surrounding text as the surrounding text appears in the input document, wherein the unclassified, ambiguous named entity is ambiguous because its domain overlaps with more than one domain,displaying a proposed specific domain for the excerpts, wherein a single proposed specific domain is displayed for more than one of the excerpts,within the confirmation blocks, displaying user interface elements for confirmation or rejection of a given excerpt out of the excerpts as belonging to the single proposed specific domain,receiving activation of one user interface element of the user interface elements, thereby resolving the entity ambiguity by indicating that text in the given excerpt out of the excerpts does or does not belong to the single proposed specific domain, andupdating the knowledge engine with the resolved entity ambiguity;
  
  predicting, through the knowledge engine, one or more labelled ambiguous entities;
  
  fetching, through a training pipeline execution engine, data stored on a document store; and
  
  associating, through the training pipeline execution engine, each entity with one or more documents based on the fetched data from the document store to build configurable entity definitions;
  
  wherein the act of generating the training file comprises;
  
  extracting text from the input document;
  
  determining a definition of the extracted text to be ambiguous or unambiguous; and
  
  based on whether the definition of the extracted text is determined to be ambiguous or unambiguous, switching between (a) and (b);
  
  (a) adding the extracted text to the training file when the definition is determined to be unambiguous, and(b) prompting a user to resolve ambiguity, and adding the resolution of the ambiguity to the training file when the definition is determined to be ambiguous.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1,wherein the entities are defined in the knowledge repository;
    - andwherein a trained model is generated through the training file.
  - 3. The method of claim 1, wherein the knowledge repository comprises an ontology knowledge base.
  - 4. The method of claim 1, wherein the entities are ranked through a relevance score.
  - 5. The method of claim 1, wherein:
    - the user action is collected through a user interface associated with at least one of a web and mobile application.
  - 6. The method of claim 1, wherein the knowledge engine is updated at pre-defined intervals.
  - 7. The method of claim 1, wherein updating the knowledge engine with the resolved entity ambiguity comprises updating the knowledge engine with information that permits resolution of later-encountered ambiguities.
  - 8. The method of claim 1, wherein the fetched data from the document store comprises one or more entity definition documents, the respective entity definition documents defining a usage of an entity in different contexts.

9. A system to automate training named entity recognition in natural language processing to build configurable entity definitions comprising:
- a cluster computing network with one or more communicatively coupled nodes;
  
  a distributed messaging system;
  
  a distributed data warehouse;
  
  a data processing engine;
  
  an analytical engine;
  
  at least one processor; and
  
  at least one memory unit operatively coupled to the at least one processor communicatively coupled over the cluster computing network and having instructions stored thereon that, when executed by the at least one processor, cause the at least one processor to;
  
  receive at least one input document or entity through an administration module associated with the distributed messaging system;
  
  define a domain for each of the received entity or the at least one input document through the administration module in a document set;
  
  determine one or more entities, through the data processing engine, corresponding to a domain specific entity in the document set;
  
  generate, through the analytical engine, a training file;
  
  via the training file, pick a right parser;
  
  via the training file, extract content from the input document;
  
  via the training file, label entity ambiguity, whereby a single training file is used to pick the right parser, extract content from the input document, and label entity ambiguity;
  
  collect and maintain, through a knowledge engine associated with the distributed data warehouse, at least one user action in a knowledge repository, wherein the collecting comprises resolution of the entity ambiguity and comprises;
  
  displaying a plurality of confirmation blocks containing excerpts appearing in the input document, wherein the excerpts contain an unclassified, ambiguous named entity associated with the entity ambiguity and surrounding text as the surrounding text appears in the input document, and wherein the unclassified, ambiguous named entity is ambiguous because its domain overlaps with more than one domain;
  
  displaying a proposed specific domain for the excerpts, wherein a single proposed specific domain is displayed for more than one of the excerpts,within the confirmation blocks, displaying user interface elements for confirmation or rejection of a given excerpt out of the excerpts as belonging to the single proposed specific domain,receiving activation of one user interface element of the user interface elements, thereby resolving the entity ambiguity by indicating that text in the given excerpt out of the excerpts does or does not belong to the single proposed specific domain, andupdating the knowledge engine, through the cluster computing network, with the resolved entity ambiguity;
  
  predict, through the knowledge engine and the analytical engine, one or more labelled ambiguous entities;
  
  fetch, through a training pipeline execution engine, data stored on a document store; and
  
  associate, through the training pipeline execution engine, each entity with one or more documents based on a trained model generated through the training file and the fetched data from the document store to build configurable entity definitions;
  
  wherein the analytical engine is configured to generate the training file by;
  
  extracting text from the input document;
  
  determining a definition of the extracted text to be ambiguous or unambiguous; and
  
  based on whether the definition of the extracted text is determined to be ambiguous or unambiguous, switching between (a) and (b);
  
  (a) adding the extracted text to the training file when the definition is determined to be unambiguous, and(b) prompting a user to resolve ambiguity, and adding the resolution of the ambiguity to the training file when the definition is determined to be ambiguous.
- View Dependent Claims (10, 11, 12, 13, 14, 15)
- - 10. The system of claim 9, wherein the entities are defined in a knowledge repository.
  - 11. The system of claim 10, wherein the knowledge repository is an ontology knowledge base.
  - 12. The system of claim 9, wherein the entities are ranked through a relevance score.
  - 13. The system of claim 9, wherein the user action is collected through a user interface associated with at least one of a web and mobile application.
  - 14. The system of claim 9, wherein the knowledge engine is updated at pre-defined intervals.
  - 15. The system of claim 9, wherein the cluster computing network is associated with a varying number of nodes.

16. One or more machine-readable media comprising machine-executable instructions causing a machine to perform a method automating training named entity recognition in natural language processing to process configurable named entity definitions defining respective defined named entities, the method comprising:
- receiving one or more input documents and the named entity definitions, wherein the named entity definitions specify defined named entities as named people, named places, or named things, and the input documents comprise unclassified occurrences of the defined named entities;
  
  receiving specific domain definitions of specific domains for classifying the defined named entities;
  
  in the input documents, identifying a named entity as a named entity ambiguity, wherein the identifying comprises determining that the named entity ambiguity belongs to any one of a plurality of candidate domains out of the specific domains;
  
  generating a training file to pick a right parser, extract content and label the named entity ambiguity, whereby a single training file is used to pick the right parser, extract content, and label entity ambiguity;
  
  collecting a resolution of the named entity ambiguity for the named entity ambiguity through a user action via a user interface, wherein the resolution indicates a particular one of the candidate domains, and wherein the collecting comprises;
  
  displaying a plurality of confirmation blocks containing excerpts appearing in the input documents, wherein the excerpts contain an unclassified, ambiguous named entity associated with the named entity ambiguity and surrounding text as the surrounding text appears in an input document, and wherein the unclassified, ambiguous named entity is ambiguous because its domain overlaps with more than one domain;
  
  displaying a proposed specific domain out of the specific domains for the excerpts, wherein a single proposed specific domain is displayed for more than one of the excerpts,within the confirmation blocks, displaying user interface elements for confirmation or rejection of a given excerpt out of the excerpts as belonging to the single proposed specific domain, andreceiving activation of one user interface element of the user interface elements, thereby resolving the named entity ambiguity by indicating that text in the given excerpt out of the excerpts does or does not belong to the single proposed specific domain;
  
  sending the collected resolution of the named entity ambiguity, the named entity ambiguity, and context for the named entity ambiguity to a knowledge engine for incorporation into the configurable named entity definitions;
  
  storing the collected resolution of the named entity ambiguity and incorporating the collected resolution of the named entity into the configurable named entity definitions; and
  
  after incorporating the collected resolution into the configurable named entity definitions, predicting, through the knowledge engine, a correct classification of an unclassified named entity in the input documents;
  
  wherein the act of generating the training file comprises;
  
  extracting text from the input document;
  
  determining a definition of the extracted text to be ambiguous or unambiguous; and
  
  based on whether the definition of the extracted text is determined to be ambiguous or unambiguous, switching between (a) and (b);
  
  (a) adding the extracted text to the training file when the definition is determined to be unambiguous, and(b) prompting a user to resolve ambiguity, and adding the resolution of the ambiguity to the training file when the definition is determined to be ambiguous.
- View Dependent Claims (17)
- - 17. The one or more machine-readable media of claim 16, wherein defining specific domains for the defined named entities is based on a received input of one or more domains associated with the configurable named entity definitions or a domain definition document associated with an administration module.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Infosys Limited
Original Assignee
Infosys Limited
Inventors
Razack, Abdul, Dasgupta, Sudipto, Rao, Mayoor, Kuriakose, John
Primary Examiner(s)
Wozniak, James S

Application Number

US15/473,424
Publication Number

US 20180075013A1
Time in Patent Office

1,049 Days
Field of Search

704 1, 704 9, 707737, 707738, 707755, 707771
US Class Current
CPC Class Codes

G06F 40/295   Named entity recognition

G06F 40/30   Semantic analysis

G06N 20/00   Machine learning

G06N 5/02   Knowledge representation; S...

G06N 5/022   Knowledge engineering; Know...

Method and system for automating training of named entity recognition in natural language processing

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

54 Citations

17 Claims

Specification

Use Cases

Quick Links

Others

Method and system for automating training of named entity recognition in natural language processing

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

54 Citations

17 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others