HIGH-PRECISION LIMITED SUPERVISION RELATIONSHIP EXTRACTOR

US 20160098645A1
Filed: 10/02/2014
Published: 04/07/2016
Est. Priority Date: 10/02/2014
Status: Abandoned Application

First Claim

Patent Images

1. A method for automatically extracting relationships from unstructured text, the method comprising:

selecting a relationship type describing a relationship between a subject having an entity type and an object having an object type;

locating mentions of the object type in a selected document;

for each mention located in the selected document, predicting a probability that the mention satisfies the relationship type using a statistical model built using automatically labeled training data; and

extracting one or more relationships satisfying the relationship type from the selected document.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Automatic relationship extraction is provided. A machine learning approach using statistical entity-type prediction and relationship predication models built from large unlabeled datasets is interactively combined with minimal human intervention and a light pattern-based approach to extract relationships from unstructured, semi-structured, and structured documents. Training data is collected from a collection of unlabeled documents by matching ground truths for a known entity from existing fact databases with text in the documents describing the known entity and corresponding models are built for one or more relationship types. For a modeled relationship-type, text chunks of interest are found in a document. A machine learning classifier predicts the probability that one of the text chunks is the entity being sought. The combined machine learning and light pattern-based approach provides both improved recall and high precision through filtering and allows constraining and normalization of the extracted relationships.

Citations

20 Claims

1. A method for automatically extracting relationships from unstructured text, the method comprising:
- selecting a relationship type describing a relationship between a subject having an entity type and an object having an object type;
  
  locating mentions of the object type in a selected document;
  
  for each mention located in the selected document, predicting a probability that the mention satisfies the relationship type using a statistical model built using automatically labeled training data; and
  
  extracting one or more relationships satisfying the relationship type from the selected document.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1 further comprising the acts of:
    - aggregating the extracted relationships; and
      
      applying a pattern-based model to the aggregated relationships.
  - 3. The method of claim 1 further comprising the acts of:
    - computing one or more features for each mention; and
      
      supplying the features as inputs the statistical prediction.
  - 4. The method of claim 1 further comprising the act of determining whether each mention satisfies the relationship type based on a comparison of the probability to a threshold associated with the relationship type.
  - 5. The method of claim 1 further comprising the act of varying the selection threshold based on a feature of the mention.
  - 6. The method of claim 1 further comprising the act of determining whether each mention satisfies the relationship type based on a comparison of the probability to a threshold associated with the relationship type.
  - 7. The method of claim 1 further comprising the acts of:
    - taking snapshots of documents from a document collection; and
      
      selecting the document for processing from the snapshots.
  - 8. The method of claim 1 further comprising the act of training a statistical model with a large quantity of training data automatically labeled using existing facts from a knowledge graph.
  - 9. The method of claim 8 wherein the act of training a statistical model with a large quantity of training data automatically labeled using existing facts from a knowledge graph further comprises the act of collecting a large quantity of training data automatically labeled using existing facts from a knowledge graph.
  - 10. The method of claim 9 wherein the act of collecting a large quantity of training data automatically labeled using existing facts from a knowledge graph further comprises the acts of:
    - selecting existing facts from a knowledge graph, each existing fact specifying a fact subject having an entity type, a fact object having an object type, and a fact predicate participating in a fact relationship;
      
      locating documents describing the subject of each existing fact;
      
      detecting mentions having an object type that matches the object type of the fact object; and
      
      automatically labeling training data as positive or negative based on a comparison of each mention with the fact object.
  - 11. The method of claim 10 wherein the act of automatically labeling training data as positive or negative based on a comparison of each mention with the fact object further comprises the acts of:
    - comparing the fact object to each mention;
      
      using mentions that do not match the fact object to provide negative training data; and
      
      using mentions that do match the fact object to provide positive training data.
  - 12. The method of claim 1 wherein the act of training a statistical model with a large quantity of training data automatically labeled using existing facts from a knowledge graph further comprises the acts of:
    - building a statistical model using a portion of the automatically labeled training data;
      
      generating predicted classifications by applying the statistical model to the remaining portion of the automatically labeled training data;
      
      displaying a small number of predicted classifications for annotation by a user;
      
      receiving annotations for the small number of predicted classifications from the user;
      
      updating the automatically labeled training data according to the annotations received from the user; and
      
      retraining the statistical model using the updated training data.
  - 13. The method of claim 1 further comprising the act of tuning a selection threshold for the statistical model based on input from a user.

14. A relationship extractor implemented using a computer, the relationship extractor comprising:
- a natural language processor operable to identify mentions of a subject of a selected subject type or objects of a selected object type specified in a selected relationship type appearing in a document describing the subject;
  
  a classifier operable to predict a probability that each object identified by the natural language processor satisfies the selected relationship type with the subject using a statistical model built from a large set of automatically labeled training data; and
  
  a post processor operable to aggregate objects associated with the selected relationship type, apply a pattern-based model to the aggregated objects, select one or more objects from the aggregated objects meeting selected criteria as a participants in relationships of the selected relationship type with the subject, and produce a final set of one or more relationships of the selected relationship type.
- View Dependent Claims (15, 16, 17, 18, 19)
- - 15. The relationship extractor of claim 14 further comprising:
    - a fact extractor operable to retrieve known facts for the selected relationship type from an existing knowledge graph;
      
      the natural language processor further operable to extract training data from documents containing the known facts until a large set of training data for the selected relationship type is collected; and
      
      a training classifier operable to build an initial model for the relationship type from at least a portion of the large set of training data.
  - 16. The relationship extractor of claim 15 further comprising an interactive validation system operable to display a small subset of predictions made using the initial model to a user, receive input from the user indicating whether each prediction in the subset is correct or incorrect, and train a statistical model based on the input from the user.
  - 17. The relationship extractor of claim 14 further comprising a page type classifier operable to determine a page type for a document and select the document for processing if the page type matches the subject type of the selected relationship type.
  - 18. The relationship extractor of claim 17 wherein the page type classifier is further operable to select the document for processing if the page type matches one of the page types from a set of page types associated with the relationship type.
  - 19. The relationship extractor of claim 14 wherein the natural language processor is further operable to extract one or more features corresponding to a mention for use in a feature vector supplied as an input to the classifier or the training classifier.

20. A computer readable medium containing computer executable instructions which, when executed by a computer, perform a method of extracting facts from free and semi-structured text using distant supervision, the method comprising:
- collecting a known facts from an existing knowledge graph corresponding to a relationship type describing a relationship between a subject having an entity type and an object having an object type;
  
  automatically labeling training data extracted from documents corresponding to the known facts;
  
  training a statistical model with a large quantity of automatically labeled training data;
  
  displaying a small number of classification predictions generated using the automatically labeled training data for annotation by a user;
  
  retraining the statistical model based on the annotations received from the user;
  
  locating mentions of the object type in a selected document;
  
  predicting a probability that each mention satisfies the relationship type using the statistical model; and
  
  extracting one or more relationships satisfying the relationship type from the selected document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Sharma, Ashish, Zhang, Jianwen, Alonichau, Siarhei, Yoo, Woonyeon, Wang, Yujing

Application Number

US14/504,507
Publication Number

US 20160098645A1
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/313   Selection or weighting of t...

G06F 16/36   Creation of semantic tools,...

G06F 40/289   Phrasal analysis, e.g. fini...

G06N 20/00   Machine learning

G06N 7/01   Probabilistic graphical mod...

HIGH-PRECISION LIMITED SUPERVISION RELATIONSHIP EXTRACTOR

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

HIGH-PRECISION LIMITED SUPERVISION RELATIONSHIP EXTRACTOR

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links