Learning syntactic patterns for automatic discovery of causal relations from text
First Claim
1. A computer-based method for extracting relationships from textual data, comprising the steps of:
- receiving, from a first distributed data source, training data comprising three or more words describing relationships between an action and an object;
collecting textual data including the received training data from the first distributed data source;
generating a dependency tree describing relationships between words of the textual data from a syntactic pattern extracted from the collected textual data;
inserting satellite links and word order data into the dependency tree, the satellite links identifying links within the text data in addition to a basic lexical path and the word order data describing the order of words in the syntactic pattern;
obtaining additional text data by scanning a second distributed data source, wherein the additional text data is not used in generating the dependency tree;
extracting target causal relationships between one or more actions and one or more objects in the additional text data obtained from the second distributed data source by comparing the additional text data to the dependency tree and using the word order data;
determining the validity of the target relationships;
training a classifier to automatically determine the validity of target relationships in addition to the target relationships previously determined to be valid based at least in part on the target relationships previously determined to be valid; and
storing the target relationships determined to be valid in a computer storage media.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention provides a method for extracting relationships between words in textual data. Initially, training relationship data, such as word triplets describing a cause-effect relationship, is received and used to collect additional textual data including the training relationship data. Distributed data collection is used to receive the training data and collect the additional textual data, allowing a broad range of data to be acquired from multiple sources. Syntactic patterns are extracted from the additional textual data and a distributed data source is scanned to extract additional relationship data describing one or more causal relationships using the extracted syntactic patterns. The extracted additional relationship data is then stored, and can be validated by a supervised learning algorithm before storage and used to train a classifier for automatic validation of additional relationship data.
-
Citations
19 Claims
-
1. A computer-based method for extracting relationships from textual data, comprising the steps of:
-
receiving, from a first distributed data source, training data comprising three or more words describing relationships between an action and an object; collecting textual data including the received training data from the first distributed data source; generating a dependency tree describing relationships between words of the textual data from a syntactic pattern extracted from the collected textual data; inserting satellite links and word order data into the dependency tree, the satellite links identifying links within the text data in addition to a basic lexical path and the word order data describing the order of words in the syntactic pattern; obtaining additional text data by scanning a second distributed data source, wherein the additional text data is not used in generating the dependency tree; extracting target causal relationships between one or more actions and one or more objects in the additional text data obtained from the second distributed data source by comparing the additional text data to the dependency tree and using the word order data; determining the validity of the target relationships; training a classifier to automatically determine the validity of target relationships in addition to the target relationships previously determined to be valid based at least in part on the target relationships previously determined to be valid; and storing the target relationships determined to be valid in a computer storage media. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system for extracting relationships from textual data, the system comprising:
-
an input device for receiving training data and first textual data from a first distributed data source; a data store, adapted to communicate with the input device, the data store for storing representations of the received training data; an extraction module, adapted to communicate with the data store and the input device for extracting a syntactic pattern from the first textual data using the received training data, generating a dependency tree describing relationships between words of the textual data from the syntactic pattern and inserting satellite links and word order data into the dependency tree, the satellite links identifying links within the text data in addition to a basic lexical path and the word order data describing the order of words in the syntactic pattern; a communication module adapted to communicate with the extraction module and a second distributed data source, the communication module for retrieving additional textual data from the second distributed data source and extracting causal relationships between one or more actions and one or more objects from the additional text data obtained from the second distributed data source by comparing the additional text data to the dependency tree and using the word order data, wherein the additional text data is not used in generating the dependency tree; and a classifier adapted to communicate with the extraction module, the communication module and the data store for classifying relationships described by the dependency tree and for determining whether the target causal relationships extracted from the second distributed text data source are valid or invalid, wherein the classifier is trained to automatically determine the validity of target relationships in addition to the target relationships previously determined to be valid based at least in part on the target relationships previously determined to be valid. - View Dependent Claims (9, 10, 11, 12, 13)
-
-
14. A computer program product, comprising a non-transitory computer readable medium storing computer executable code for extracting relationships from textual data, the computer executable code performing the steps of:
-
receiving, from a first distributed data source, training data comprising three or more words describing relationships between an action and an object; collecting textual data including the received training data from the first distributed data source; generating a dependency tree describing relationships between words of the textual data from a syntactic pattern extracted from the collected textual data; inserting satellite links and word order data into the dependency tree, the satellite links identifying links within the text data in addition to a basic lexical path and the word order data describing the order of words in the syntactic pattern; obtaining additional text data by scanning a second distributed data source, wherein the additional text data is not used in generating the dependency tree; extracting a target causal relationship between one or more actions and one or more objects in the additional text data obtained from the second distributed data source by comparing the additional text data to the dependency tree and using the word order data; determining the validity of the target relationship; and responsive to a determination of the validity of the target relationship; training a classifier to automatically determine the validity of target relationships in addition to the target relationship previously determined to be valid based on the target relationship; and storing the target relationship in the non-transitory computer-readable storage medium. - View Dependent Claims (15, 16, 17, 18, 19)
-
Specification