System and method for entity extraction from semi-structured text documents

US 10,489,439 B2
Filed: 04/14/2016
Issued: 11/26/2019
Est. Priority Date: 04/14/2016
Status: Active Grant

First Claim

Patent Images

1. An automated method for extracting entities from a text document comprising:

for at least a section of a text document,extracting a first set of entities in predefined classes of entity from the at least a section, the extraction of the first set of entities comprising at least one of a rule-based extraction method and a probabilistic extraction method;

identifying a location of each of the extracted entities in the at least a section of the document;

clustering at least a subset of the extracted entities in the first set into clusters, based on the identified locations of the entities in the document;

identifying complete clusters of entities and incomplete clusters of entities from the clusters, based on correlations observed between sequences of entities in the clusters and a number of the classes of entity within each entity cluster;

learning patterns for extracting new entities based on the complete clusters; and

extracting new entities from the incomplete clusters based on the learned patterns,wherein the extracting of the first set of entities, identifying complete clusters, learning patterns, and extracting new entities are performed with a processor device.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for extracting entities from a text document includes, for at least a section of a text document, providing a first set of entities extracted from the at least a section, clustering at least a subset of the extracted entities in the first set into clusters, based on locations of the entities in the document. Complete ones of the clusters of entities are identified. Patterns for extracting new entities are learned based on the complete clusters. New entities are extracted from incomplete clusters based on the learned patterns.

Citations

20 Claims

1. An automated method for extracting entities from a text document comprising:
- for at least a section of a text document,extracting a first set of entities in predefined classes of entity from the at least a section, the extraction of the first set of entities comprising at least one of a rule-based extraction method and a probabilistic extraction method;
  
  identifying a location of each of the extracted entities in the at least a section of the document;
  
  clustering at least a subset of the extracted entities in the first set into clusters, based on the identified locations of the entities in the document;
  
  identifying complete clusters of entities and incomplete clusters of entities from the clusters, based on correlations observed between sequences of entities in the clusters and a number of the classes of entity within each entity cluster;
  
  learning patterns for extracting new entities based on the complete clusters; and
  
  extracting new entities from the incomplete clusters based on the learned patterns,wherein the extracting of the first set of entities, identifying complete clusters, learning patterns, and extracting new entities are performed with a processor device.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The method of claim 1, wherein the text document is a resume.
  - 3. The method of claim 1, further comprising segmenting the document into sections and performing the extracting of entities, identifying complete clusters, learning patterns and extracting new entities for one of the sections.
  - 4. The method of claim 1, wherein the extraction of the first set of entities includes accessing a lexicon of entities to identify text sequences in the section which each match a respective entity in the lexicon.
  - 5. The method of claim 1, wherein the extracted entities are each labeled with an entity class from the predefined classes of entity classes.
  - 6. The method of claim 1, wherein the clustering includes ordering the extracted entities based on their locations in the document, initializing a cluster with a first of the ordered entities, adding a next entity to the first cluster if the distance to a representative location in the cluster is less than a threshold distance, and recomputing the cluster representative location, otherwise if the distance is greater than the threshold, initializing a next cluster.
  - 7. The method of claim 1, wherein the identifying complete clusters of entities from the clusters includes identifying clusters which include at least a threshold number of entity classes.
  - 8. The method of claim 7, wherein the threshold number is determined by identifying correlations between entities of the different classes occurring in a set of documents or document sections.
  - 9. The method of claim 7, wherein a threshold number of entity classes is defined for each section for a cluster in that section to be considered complete.
  - 10. The method of claim 1, wherein the learning patterns comprises training a CRF model based on the complete clusters and the extracting new entities based on the learned patterns comprises predicting new entities in incomplete clusters based on the trained CRF model.
  - 11. The method of claim 1, wherein the learning patterns comprises, for each of the clusters in the set of complete clusters, for a window of text which includes the cluster, chunking the text using a set of rules to generate a sequence of chunks and extracting features of the chunks in the sequence, the features being used to learn the patterns.
  - 12. The method of claim 1, further comprising, after extracting new entities based on the learned patterns, identifying new complete clusters which include the new entities and repeating the learning of patterns and extracting the new entities with the new complete clusters.
  - 13. The method of claim 1, further comprising, after extracting new entities based on the learned patterns, if incomplete clusters remain, applying at least one of:
    - a) a back-off model trained on information extracted from other documents, andb) pseudo-relevance feedback,to identify additional new entities.
  - 14. The method of claim 13, wherein the back-off model is a CRF model.
  - 15. The method of claim 1, further comprising outputting the extracted new entities or information based thereon.
  - 16. A computer program product comprising a non-transitory storage medium storing instructions, which when executed on a computer, causes the computer to perform the method of claim 1.
  - 17. A system comprising memory which stores instructions for performing the method of claim 1 and a processor in communication with the memory for executing the instructions.

18. A system for extracting entities from text documents comprising:
- a first entity extraction component for extracting a first set of entities from at least a section of a text document, each of the extracted entities being in one of a predefined set of entity classes;
  
  a second entity extraction component for extraction of new entities from the at least the section of the text document, the second entity extraction component comprising;
  
  a clustering component for clustering at least a subset of the extracted entities in the first set into clusters, based on locations of the entities in the document and their entity classes,a cluster completeness component for identifying complete clusters of entities and incomplete clusters of entities from the clusters, based on correlations observed between sequences of entity classes in the clusters, anda pattern recognition component for learning patterns of the entity classes for extracting new entities based on the complete clusters and extracting new entities from the incomplete clusters based on the learned patterns; and
  
  a processor for implementing the first and second entity extraction components, clustering component, cluster completeness component, and pattern recognition component.
- View Dependent Claims (19)
- - 19. The system of claim 18, further comprising at least one of:
    - a segmentation component for segmenting the document into sections;
      
      a chunking component which for each of the clusters in the set of complete clusters, for a window of text which includes the cluster, chunks the text using a set of rules to generate a sequence of chunks and extracting features of the chunks in the sequence, the features being used to learn the patterns; and
      
      an output component which outputs the extracted new entities or information based thereon.

20. A method for extracting entities from a resume comprising:
- segmenting the resume into sections;
  
  extracting a first set of entities and respective entity class labels from the section with at least one of grammar rules, a probabilistic model, and a lexicon;
  
  clustering at least a subset of the extracted entities in the first set into clusters, based on locations of the entities in the resume;
  
  identifying complete clusters of entities and incomplete clusters of entities from the clusters, based on correlations observed between sequences of entity class labels in the clusters;
  
  learning patterns for extracting new entities based on the class labels of the entities in the complete clusters;
  
  extracting new entities from the incomplete clusters, based on the learned patterns; and
  
  outputting information based on the extracted new entities in the resume,wherein the segmenting, extracting the first set of entities, clustering, identifying complete clusters, learning patterns, and extracting new entities are performed with a processor device.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Calapodescu, Ioan, Guerin, Nicolas, Jacques, Fanchon
Primary Examiner(s)
Channavajjala, Srirama

Application Number

US15/098,856
Publication Number

US 20170300565A1
Time in Patent Office

1,321 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/278   Data partitioning, e.g. hor...

G06F 16/30   of unstructured textual dat...

G06F 16/3325   Reformulation based on resu...

G06F 16/3344   using natural language anal...

G06F 16/35   Clustering; Classification

G06F 16/353   into predefined classes

G06F 16/93   Document management systems

G06N 20/00   Machine learning

G06N 7/01   Probabilistic graphical mod...

System and method for entity extraction from semi-structured text documents

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for entity extraction from semi-structured text documents

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links