Learning syntactic patterns for automatic discovery of causal relations from text

US 8,244,730 B2
Filed: 05/29/2007
Issued: 08/14/2012
Est. Priority Date: 05/30/2006
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-based method for extracting relationships from textual data, comprising the steps of:

receiving, from a first distributed data source, training data comprising three or more words describing relationships between an action and an object;

collecting textual data including the received training data from the first distributed data source;

generating a dependency tree describing relationships between words of the textual data from a syntactic pattern extracted from the collected textual data;

inserting satellite links and word order data into the dependency tree, the satellite links identifying links within the text data in addition to a basic lexical path and the word order data describing the order of words in the syntactic pattern;

obtaining additional text data by scanning a second distributed data source, wherein the additional text data is not used in generating the dependency tree;

extracting target causal relationships between one or more actions and one or more objects in the additional text data obtained from the second distributed data source by comparing the additional text data to the dependency tree and using the word order data;

determining the validity of the target relationships;

training a classifier to automatically determine the validity of target relationships in addition to the target relationships previously determined to be valid based at least in part on the target relationships previously determined to be valid; and

storing the target relationships determined to be valid in a computer storage media.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention provides a method for extracting relationships between words in textual data. Initially, training relationship data, such as word triplets describing a cause-effect relationship, is received and used to collect additional textual data including the training relationship data. Distributed data collection is used to receive the training data and collect the additional textual data, allowing a broad range of data to be acquired from multiple sources. Syntactic patterns are extracted from the additional textual data and a distributed data source is scanned to extract additional relationship data describing one or more causal relationships using the extracted syntactic patterns. The extracted additional relationship data is then stored, and can be validated by a supervised learning algorithm before storage and used to train a classifier for automatic validation of additional relationship data.

Citations

19 Claims

1. A computer-based method for extracting relationships from textual data, comprising the steps of:
- receiving, from a first distributed data source, training data comprising three or more words describing relationships between an action and an object;
  
  collecting textual data including the received training data from the first distributed data source;
  
  generating a dependency tree describing relationships between words of the textual data from a syntactic pattern extracted from the collected textual data;
  
  inserting satellite links and word order data into the dependency tree, the satellite links identifying links within the text data in addition to a basic lexical path and the word order data describing the order of words in the syntactic pattern;
  
  obtaining additional text data by scanning a second distributed data source, wherein the additional text data is not used in generating the dependency tree;
  
  extracting target causal relationships between one or more actions and one or more objects in the additional text data obtained from the second distributed data source by comparing the additional text data to the dependency tree and using the word order data;
  
  determining the validity of the target relationships;
  
  training a classifier to automatically determine the validity of target relationships in addition to the target relationships previously determined to be valid based at least in part on the target relationships previously determined to be valid; and
  
  storing the target relationships determined to be valid in a computer storage media.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the training data comprises one or more triplets of words describing an object, an action and an effect of the action on the object.
  - 3. The method of claim 1, wherein the step of generating the dependency tree comprises preprocessing the textual data to resolve pronouns.
  - 4. The method of claim 1, wherein the validity of the target relationships is determined using a supervised learning algorithm to examine the features of the target relationships.
  - 5. The method of claim 4, wherein the supervised learning algorithm applies a set of predetermined or manually provided rules describing the validity of causal relationships in lexical patterns to classify the target relationships as valid or invalid.
  - 6. The method of claim 1, wherein the validity of the target relationships is determined by presenting the target relationships to a user for manual specification of validity.
  - 7. The method of claim 1, wherein the classifier is trained based at least in part on the target relationships determined to be valid using sparse binary logistic regression.

8. A system for extracting relationships from textual data, the system comprising:
- an input device for receiving training data and first textual data from a first distributed data source;
  
  a data store, adapted to communicate with the input device, the data store for storing representations of the received training data;
  
  an extraction module, adapted to communicate with the data store and the input device for extracting a syntactic pattern from the first textual data using the received training data, generating a dependency tree describing relationships between words of the textual data from the syntactic pattern and inserting satellite links and word order data into the dependency tree, the satellite links identifying links within the text data in addition to a basic lexical path and the word order data describing the order of words in the syntactic pattern;
  
  a communication module adapted to communicate with the extraction module and a second distributed data source, the communication module for retrieving additional textual data from the second distributed data source and extracting causal relationships between one or more actions and one or more objects from the additional text data obtained from the second distributed data source by comparing the additional text data to the dependency tree and using the word order data, wherein the additional text data is not used in generating the dependency tree; and
  
  a classifier adapted to communicate with the extraction module, the communication module and the data store for classifying relationships described by the dependency tree and for determining whether the target causal relationships extracted from the second distributed text data source are valid or invalid, wherein the classifier is trained to automatically determine the validity of target relationships in addition to the target relationships previously determined to be valid based at least in part on the target relationships previously determined to be valid.
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The system of claim 8, wherein the extraction module further preprocesses the text data to identify pronouns in the textual data.
  - 10. The system of claim 8, wherein the target relationships are classified as valid or invalid using a supervised learning algorithm to examine the features of the target relationships.
  - 11. The system of claim 10, wherein the supervised learning algorithm applies a set of predetermined or manually provided rules describing the validity of causal relationships in lexical patterns to classify the target relationships as valid or invalid.
  - 12. The system of claim 8, wherein the classifier classifies the target relationships by presenting the target relationships to a user for manual specification of validity.
  - 13. The system of claim 8, wherein the classifier is trained based at least in part on the target relationships classified as valid using sparse binary logistic regression.

14. A computer program product, comprising a non-transitory computer readable medium storing computer executable code for extracting relationships from textual data, the computer executable code performing the steps of:
- receiving, from a first distributed data source, training data comprising three or more words describing relationships between an action and an object;
  
  collecting textual data including the received training data from the first distributed data source;
  
  generating a dependency tree describing relationships between words of the textual data from a syntactic pattern extracted from the collected textual data;
  
  inserting satellite links and word order data into the dependency tree, the satellite links identifying links within the text data in addition to a basic lexical path and the word order data describing the order of words in the syntactic pattern;
  
  obtaining additional text data by scanning a second distributed data source, wherein the additional text data is not used in generating the dependency tree;
  
  extracting a target causal relationship between one or more actions and one or more objects in the additional text data obtained from the second distributed data source by comparing the additional text data to the dependency tree and using the word order data;
  
  determining the validity of the target relationship; and
  
  responsive to a determination of the validity of the target relationship;
  
  training a classifier to automatically determine the validity of target relationships in addition to the target relationship previously determined to be valid based on the target relationship; and
  
  storing the target relationship in the non-transitory computer-readable storage medium.
- View Dependent Claims (15, 16, 17, 18, 19)
- - 15. The computer program product of claim 14, wherein the step of generating the dependency tree comprises the step of:
    - preprocessing the textual data to resolve pronouns.
  - 16. The computer program product of claim 14, wherein the validity of the target relationship is determined using a supervised learning algorithm to examine the features of the target relationship.
  - 17. The computer program product of claim 16, wherein the supervised learning algorithm applies a set of predetermined or manually provided rules describing the validity of causal relationships in lexical patterns to classify the target relationship as valid or invalid.
  - 18. The computer program product of claim 14, wherein the validity of the target relationships is determined by presenting the target relationships to a user for manual specification of validity.
  - 19. The computer program product of claim 14, wherein the classifier is trained based at least in part on the target relationship using sparse binary logistic regression.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Honda Motor Co., Ltd. (Honda Motor Company)
Original Assignee
Honda Motor Co., Ltd. (Honda Motor Company)
Inventors
Gupta, Rakesh
Primary Examiner(s)
Alam, Shahid
Assistant Examiner(s)
CONYERS, DAWAUNE A

Application Number

US11/754,966
Publication Number

US 20070282814A1
Time in Patent Office

1,904 Days
Field of Search

704/1, 707/737, 707/739, 707/742
US Class Current

707/737
CPC Class Codes

G06F 40/20 Natural language analysis s...

Learning syntactic patterns for automatic discovery of causal relations from text

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Learning syntactic patterns for automatic discovery of causal relations from text

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links