Applying a structured language model to information extraction

US 8,706,491 B2
Filed: 08/24/2010
Issued: 04/22/2014
Est. Priority Date: 05/20/2002
Status: Expired due to Fees

First Claim

Patent Images

1. A method of training an information extraction system to extract information from a natural language input, comprising:

initializing a structured language model with syntactically annotated training data, the annotated training data including a parse tree for a sentence having syntactic labels comprising a frame label indicating an overall action being referred to by the sentence and slot labels identifying attributes of the action;

training the structured language model by generating parses with the initialized structured language model using annotated training data that has semantic constituent labels with semantic constituent boundaries identified, wherein the structured language model is trained as a match constrained parser which generates a set of syntactic parses for a given word string that all match the constituent boundaries specified by the semantic parse, by determining whether unlabeled constituents that define the semantic parse are included in a set of constituents that define the syntactic parse, wherein any parses that do not match the constituent boundaries are discarded;

replacing the syntactic labels in the parse tree with joint syntactic and semantic labels based on the generated parses excluding the discarded parses; and

retraining the structured language model in which the structured language model generates parses that are constrained to identically match the semantic constituent labels of the joint syntactic and semantic labels and constrained to match all of the semantic constituent boundaries.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

One feature of the present invention uses the parsing capabilities of a structured language model in the information extraction process. During training, the structured language model is first initialized with syntactically annotated training data. The model is then trained by generating parses on semantically annotated training data enforcing annotated constituent boundaries. The syntactic labels in the parse trees generated by the parser are then replaced with joint syntactic and semantic labels. The model is then trained by generating parses on the semantically annotated training data enforcing the semantic tags or labels found in the training data. The trained model can then be used to extract information from test data using the parses generated by the model.

Citations

19 Claims

1. A method of training an information extraction system to extract information from a natural language input, comprising:
- initializing a structured language model with syntactically annotated training data, the annotated training data including a parse tree for a sentence having syntactic labels comprising a frame label indicating an overall action being referred to by the sentence and slot labels identifying attributes of the action;
  
  training the structured language model by generating parses with the initialized structured language model using annotated training data that has semantic constituent labels with semantic constituent boundaries identified, wherein the structured language model is trained as a match constrained parser which generates a set of syntactic parses for a given word string that all match the constituent boundaries specified by the semantic parse, by determining whether unlabeled constituents that define the semantic parse are included in a set of constituents that define the syntactic parse, wherein any parses that do not match the constituent boundaries are discarded;
  
  replacing the syntactic labels in the parse tree with joint syntactic and semantic labels based on the generated parses excluding the discarded parses; and
  
  retraining the structured language model in which the structured language model generates parses that are constrained to identically match the semantic constituent labels of the joint syntactic and semantic labels and constrained to match all of the semantic constituent boundaries.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1wherein constraining the parses to match the semantic constituent boundaries and labels includes constraining the parses to a constraint, C, defined as a span of an input sentence together with a set of allowable non-terminal tags for the span, where:
    - C=[l,r,Q]
      
      in which l is a left boundary of the constraint, r is a right boundary of the constraint, and Q is the set of allowable non-terminal tags for the constraint,wherein a given generated parse P is only accepted if the set of semantic constituent boundaries and labels is identical to the semantic constituents that define the generated parse P.
  - 3. The method of claim 1 wherein initializing comprises at least one of:
    - initializing the structured language model with syntactically annotated training data parsed from in-domain sentences; and
      
      initializing the structured language model with syntactically annotated training data parsed from out-of-domain sentences.
  - 4. The method of claim 1wherein the syntactically annotated training data includes parts-of-speech labels;
    - andwherein a portion of the parses are discarded based on a violation of a structured language model schema.
  - 5. The method of claim 1 wherein generating parses comprises:
    - generating syntactic parses with syntactic labels, wherein the syntactic parses are constrained to match the semantic constituent boundaries; and
      
      generating semantic parses with semantic labels, wherein the semantic parses are constrained to match the semantic constituent labels in the annotated training data.
  - 6. The method of claim 1 wherein generating parses comprises:
    - generating the parses as binary parse trees having root levels and slot levels;
      
      discarding a portion of the binary parse trees that violates a semantic language model hypothesis.
  - 7. The method of claim 1 wherein generating parses comprises:
    - generating the parses in a left-to-right fashion;
      
      associating headwords with nodes in the parses; and
      
      utilizing the headwords as historical contexts to predict words in the parses.
  - 8. The method of claim 1 wherein generating parses comprises:
    - generating parses in a bottom-up fashion;
      
      labeling nodes in the parses with part of speech tags;
      
      prepending the parses with beginning markers; and
      
      appending the parses with ending markers.

9. An information extraction system comprising:
- a semantic schema comprising a multilevel template with a root level and a leaf level, the root level having a frame label identifying an action to be performed, the leaf level having slots identifying attributes of the action; and
  
  a computer processor that is a component of a computing device that receives a natural language input and generates a candidate parse by parsing the natural language input with a structured language model that generates hypothesis parses of a portion of the natural language input by applying the multilevel template and accepting only those hypothesis parses that completely match the structure of the multilevel template, wherein the structured language model is trained by generating a set of syntactic parses for a given word string that all match semantic constituent boundaries, by determining whether unlabeled constituents that define a semantic parse is included in a set of constituents that define the syntactic parse, wherein parses that do not match the semantic constituent boundaries are discarded, and replacing syntactic labels in a parse tree with joint syntactic and semantic labels based on the generated parses excluding the discarded parses, and retraining the structured language model by generating parses that are constrained to identically match the semantic constituent labels of the joint syntactic and semantic labels and constrained to match all of the semantic constituent boundaries.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. The system of claim 9, wherein the computer processor adds part-of-speech tags to words in the parse, and maps the parse to the multilevel template based at least in part on the part-of-speech tags.
  - 11. The system of claim 9, wherein the structured language model is trained on annotated training data that includes a part of speech tag at each node, and wherein the computer processor replaces syntactic labels with joint syntactic and semantic labels.
  - 12. The system of claim 9, wherein the computer processor utilizes headwords in the structured language model as historical contexts to predict a next word in a multilevel parse tree.
  - 13. The system of claim 9, wherein each node of annotated training data is associated with a word from a natural language input, and wherein each node is also associated with a k-prefix that identifies previous words in the natural language input.
  - 14. The system of claim 9, wherein the parse comprises a two-level semantic parse, and wherein the parse is generated based at least in part on a deleted interpolation probability.

15. A method of utilizing an information extraction system to extract information from a natural language input, comprising:
- receiving a natural language input;
  
  utilizing a computer processor, that is a component of a computer, to generate a set of parses from the natural language input based on a template indicative of information to be extracted from the natural language input, wherein the set of parses are generated using a structured language model trained by generating a set of syntactic parses for a given word string that all match semantic constituent boundaries, by determining whether unlabeled constituents that define a semantic parse is included in a set of constituents that define the syntactic parse, wherein parses that do not match the semantic constituent boundaries are discarded, and replacing syntactic labels in a parse tree with joint syntactic and semantic labels based on the generated parses excluding the discarded parses, and retraining the structured language model by generating parses that are constrained to identically match the semantic constituent labels of the joint syntactic and semantic labels and constrained to match all of the semantic constituent boundaries;
  
  calculating a probability for each of the generated parses, comprising summing the probability over all parses having a common semantic parse;
  
  ranking the set of parses based on the probability; and
  
  outputting one or more of the parses based on the ranking.
- View Dependent Claims (16, 17, 18, 19)
- - 16. The method of claim 15, wherein the natural language input comprises an action, and wherein the method further comprises initializing a word-predictor, a tagger, and a parser with the parsed natural language input.
  - 17. The method of claim 16, wherein the action is associated with scheduling a meeting.
  - 18. The method of claim 15, wherein extracting the information comprises:
    - filling in slots associated with the natural language input; and
      
      proposing a set of n syntactic binary parses for each sentence in the natural language input.
  - 19. The method of claim 15, wherein parsing the natural language input comprises:
    - associating headwords with nodes in a syntactic parse; and
      
      enforcing the syntactic parse to match a label constraint.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Chelba, Ciprian, Mahajan, Milind
Primary Examiner(s)
Shah, Paras D

Application Number

US12/862,001
Publication Number

US 20100318348A1
Time in Patent Office

1,337 Days
Field of Search

704/9.231, 704/236, 704/257, 704/247, 704/252
US Class Current

704/257
CPC Class Codes

G06F 40/00   Handling natural language d...

G06F 40/205   Parsing

G06F 40/211   Syntactic parsing, e.g. bas...

G06F 40/237   Lexical tools

G06F 40/30   Semantic analysis

G06F 40/40   Processing or translation o...

G06F 40/56   Natural language generation

G10L 15/00   Speech recognition G10L17/0...

G10L 15/04   Segmentation; Word boundary...

G10L 15/05   Word boundary detection

G10L 15/18   using natural language mode...

G10L 15/1822   Parsing for meaning underst...

G10L 15/22   Procedures used during a sp...

Applying a structured language model to information extraction

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Applying a structured language model to information extraction

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links