Applying a structured language model to information extraction
First Claim
1. A method of training an information extraction system to extract information from a natural language input, comprising:
- initializing a structured language model with syntactically annotated training data, the annotated training data including a parse tree for a sentence having syntactic labels comprising a frame label indicating an overall action being referred to by the sentence and slot labels identifying attributes of the action;
training the structured language model by generating parses with the initialized structured language model using annotated training data that has semantic constituent labels with semantic constituent boundaries identified, wherein the structured language model is trained as a match constrained parser which generates a set of syntactic parses for a given word string that all match the constituent boundaries specified by the semantic parse, by determining whether unlabeled constituents that define the semantic parse are included in a set of constituents that define the syntactic parse, wherein any parses that do not match the constituent boundaries are discarded;
replacing the syntactic labels in the parse tree with joint syntactic and semantic labels based on the generated parses excluding the discarded parses; and
retraining the structured language model in which the structured language model generates parses that are constrained to identically match the semantic constituent labels of the joint syntactic and semantic labels and constrained to match all of the semantic constituent boundaries.
1 Assignment
0 Petitions
Accused Products
Abstract
One feature of the present invention uses the parsing capabilities of a structured language model in the information extraction process. During training, the structured language model is first initialized with syntactically annotated training data. The model is then trained by generating parses on semantically annotated training data enforcing annotated constituent boundaries. The syntactic labels in the parse trees generated by the parser are then replaced with joint syntactic and semantic labels. The model is then trained by generating parses on the semantically annotated training data enforcing the semantic tags or labels found in the training data. The trained model can then be used to extract information from test data using the parses generated by the model.
-
Citations
19 Claims
-
1. A method of training an information extraction system to extract information from a natural language input, comprising:
-
initializing a structured language model with syntactically annotated training data, the annotated training data including a parse tree for a sentence having syntactic labels comprising a frame label indicating an overall action being referred to by the sentence and slot labels identifying attributes of the action; training the structured language model by generating parses with the initialized structured language model using annotated training data that has semantic constituent labels with semantic constituent boundaries identified, wherein the structured language model is trained as a match constrained parser which generates a set of syntactic parses for a given word string that all match the constituent boundaries specified by the semantic parse, by determining whether unlabeled constituents that define the semantic parse are included in a set of constituents that define the syntactic parse, wherein any parses that do not match the constituent boundaries are discarded; replacing the syntactic labels in the parse tree with joint syntactic and semantic labels based on the generated parses excluding the discarded parses; and retraining the structured language model in which the structured language model generates parses that are constrained to identically match the semantic constituent labels of the joint syntactic and semantic labels and constrained to match all of the semantic constituent boundaries. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. An information extraction system comprising:
-
a semantic schema comprising a multilevel template with a root level and a leaf level, the root level having a frame label identifying an action to be performed, the leaf level having slots identifying attributes of the action; and a computer processor that is a component of a computing device that receives a natural language input and generates a candidate parse by parsing the natural language input with a structured language model that generates hypothesis parses of a portion of the natural language input by applying the multilevel template and accepting only those hypothesis parses that completely match the structure of the multilevel template, wherein the structured language model is trained by generating a set of syntactic parses for a given word string that all match semantic constituent boundaries, by determining whether unlabeled constituents that define a semantic parse is included in a set of constituents that define the syntactic parse, wherein parses that do not match the semantic constituent boundaries are discarded, and replacing syntactic labels in a parse tree with joint syntactic and semantic labels based on the generated parses excluding the discarded parses, and retraining the structured language model by generating parses that are constrained to identically match the semantic constituent labels of the joint syntactic and semantic labels and constrained to match all of the semantic constituent boundaries. - View Dependent Claims (10, 11, 12, 13, 14)
-
-
15. A method of utilizing an information extraction system to extract information from a natural language input, comprising:
-
receiving a natural language input; utilizing a computer processor, that is a component of a computer, to generate a set of parses from the natural language input based on a template indicative of information to be extracted from the natural language input, wherein the set of parses are generated using a structured language model trained by generating a set of syntactic parses for a given word string that all match semantic constituent boundaries, by determining whether unlabeled constituents that define a semantic parse is included in a set of constituents that define the syntactic parse, wherein parses that do not match the semantic constituent boundaries are discarded, and replacing syntactic labels in a parse tree with joint syntactic and semantic labels based on the generated parses excluding the discarded parses, and retraining the structured language model by generating parses that are constrained to identically match the semantic constituent labels of the joint syntactic and semantic labels and constrained to match all of the semantic constituent boundaries; calculating a probability for each of the generated parses, comprising summing the probability over all parses having a common semantic parse; ranking the set of parses based on the probability; and outputting one or more of the parses based on the ranking. - View Dependent Claims (16, 17, 18, 19)
-
Specification