Method and system for bootstrapping statistical processing into a rule-based natural language parser

US 5,752,052 A
Filed: 06/24/1994
Issued: 05/12/1998
Est. Priority Date: 06/24/1994
Status: Expired due to Term

First Claim

Patent Images

1. A method in a computer system for bootstrapping statistical processing into a rule-based parser for parsing input strings of natural language text using a set of conditioned rules, the method comprising the steps of:

(a) operating the parser such that the parser attempts to apply a subset of every applicable rule of the parser to each input string;

(b) compiling statistics indicating the likelihood of success of each rule of the parser, based on the success of each rule when applied in step (a); and

(c) operating the parser such that the parser applies at least one of the rules of the parser in descending order of the likelihood of success indicated by the statistics compiled in step (b).

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for bootstrapping statistical processing into a rule-based natural language parser is provided. In a preferred embodiment, a statistical bootstrapping software facility optimizes the operation of a robust natural language parser that uses a set of lexicon entries to determine possible parts of speech of words from an input string and a set of rules to combine words from the input string into syntactic structures. The facility first operates the parser in a statistics compilation mode, in which, for each of many sample input strings, the parser attempts to apply all applicable rules and lexicon entries. While the parser is operating in the statistics compilation mode, the facility compiles statistics indicating the likelihood of success of each rule and lexicon entry, based on the success of each rule and lexicon entry when applied in the statistics compilation mode. After a sufficient body of likelihood of success statistics have been compiled, the facility operates the parser in an efficient parsing mode, in which the facility uses the compiled statistics to optimize the operation of the parser. In order to parse an input string in the efficient parsing mode, the facility causes the parser to apply applicable rules and lexicon entries in the descending order of the likelihood of their success as indicated by the statistics compiled in the statistics compilation mode.

Citations

31 Claims

1. A method in a computer system for bootstrapping statistical processing into a rule-based parser for parsing input strings of natural language text using a set of conditioned rules, the method comprising the steps of:
- (a) operating the parser such that the parser attempts to apply a subset of every applicable rule of the parser to each input string;
  
  (b) compiling statistics indicating the likelihood of success of each rule of the parser, based on the success of each rule when applied in step (a); and
  
  (c) operating the parser such that the parser applies at least one of the rules of the parser in descending order of the likelihood of success indicated by the statistics compiled in step (b).
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1 wherein step (a) includes the step of operating the parser such that the parser attempts to apply every applicable rule.
  - 3. The method of claim 1 wherein step (c) includes ceasing to apply rules when the parse is complete.
  - 4. The method of claim 1 wherein step (c) includes ceasing to apply rules when the parse is unlikely to complete.
  - 5. The method of claim 1 wherein step (b) compiles statistics indicating separate likelihoods of success for each rule of the parser corresponding to different conditions under which the rule has been applied in step (a), and wherein step (c) operates the parser to apply the rules of the parser in descending order of, in the case of each rule, the likelihood of success corresponding to the condition most similar to the condition in which the rule is to be applied.
  - 6. The method of claim 1 wherein step (b) includes the step of storing the number of times each rule succeeded when applied in step (a).
  - 7. The method of claim 6 wherein step (b) further includes the step of storing the number of times the parser attempted to apply each rule in step (a).

8. A method in a computer system for bootstrapping statistical processing into a rule-based natural language parser to efficiently parse a principal input string of natural language text using a plurality of sample input strings of natural language text representative of strings to be parsed by the natural language parser, the natural language parser forming one or more parse results from an input string comprised of words by applying rules from a set of conditioned rules that each combine words or groups of words that have already been combined, certain subsets of the set of rules being applicable when parsing particular input strings, comprising the steps of:
- for each sample input string;
  
  exhaustively parsing the sample input string by applying each applicable rule of the set of rules to form one or more parse results, andif a single parse result was formed by exhaustively parsing the sample input string, updating for each rule that combined words or groups of words that have already been combined in the parse result an indication of the number of times that the rule combined words or groups of words that had already been combined; and
  
  efficiently parsing the principal input string by applying applicable rules from the set of rules in the decreasing order of their likelihood of success as indicated by the updated indications of the number of times that each rule combined words or groups of words that had already been combined.
- View Dependent Claims (9)
- - 9. The method of claim 8, further including the steps of:
    - initializing indications of the number of times that each rule has been applied; and
      
      for each sample input string, if a single parse result was formed by exhaustively parsing the sample string, for each applied rule, updating the indication of the number of times that the rule has been applied; and
      
      for each applicable rule, determining the likelihood of success of the rule by dividing the indicated number of times that the rule combined words or groups of words that had already been combined by the indicated number of times that the rule has been applied.

10. A method in a computer system for bootstrapping statistical processing into a rule-based natural language parser to efficiently parse a principal input string using a plurality of sample input strings representative of strings to be parsed by the natural language parser, the natural language parser for forming one or more parse results from an input string comprised of words by applying rules from a set of conditioned rules that each combine words or already combined groups of words, certain subsets of the set of rules being applicable when parsing particular input strings, comprising the steps of:
- for each rule, initializing a plurality of indications of the number of times that the rule has succeeded, each of the plurality of indications corresponding to a characteristic of the sample input string under which the rule has succeeded;
  
  for each sample input string;
  
  exhaustively parsing the sample input string by applying each applicable rule of the set of rules to produce one or more parse results, andif a single parse result was formed by exhaustively parsing the sample input string, updating for each rule that combined words or already combined groups of words in the parse result an indication of the number of times that the rule succeeded that corresponds to a characteristic of the sample input string; and
  
  efficiently parsing the principal input string by applying applicable rules to the principal input string from the set of rules in the decreasing order of their likelihood of success as indicated by updated indications of the number of times that each rule succeeded that corresponds to a characteristic of the principal input string.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The method of claim 10 wherein the initializing step initializes, for each rule, indications of the number of times that the rule has succeeded that correspond to different numbers of words combined by the rule;
    - and wherein the updating step updates an indication of the number of times that the rule has succeeded that corresponds to the number of words of the sample input string combined by the rule; and
      
      wherein the step of efficiently parsing applies rules from the set of rules in the decreasing order of their likelihood of success as indicated by updated indications of the number of times that each rule succeeded that correspond to the number of words of the principal input string that the rule would combine if applied.
  - 12. The method of claim 10 wherein the initializing step initializes, for each rule, indications of the number of times that the rule has succeeded that correspond to different numbers of words between the words combined by the rule and the end of the input string;
    - and wherein the updating step updates an indication of the number of times that the rule has succeeded that corresponds to the number of words of the sample input string between the words combined by the rule and the end of the sample input string; and
      
      wherein the efficient parsing step applies rules from the set of rules in the decreasing order of their likelihood of success as indicated by updated indications of the number of times that each rule succeeded that correspond to the number of words in the principal input string between the words the rule would combine if applied and the end of the principal input string.
  - 13. The method of claim 10 wherein the initializing step initializes, for each rule, indications of the number of times that the rule has succeeded that correspond to different minimum numbers of groups of words combined by earlier-applied rules between the words combined by the rule and the end of the input string;
    - and wherein the updating step updates an indication of the number of times that the rule has succeeded that corresponds to the minimum number of groups of words combined by earlier-applied rules of the sample input string between the words combined by the rule and the end of the sample input string; and
      
      wherein the efficient parsing step applies rules from the set of rules in the decreasing order of their likelihood of success as indicated by updated indications of the number of times that each rule succeeded that correspond to the minimum number of groups of words combined by earlier-applied rules in the principal input string between the words that the rule would combine if applied and the end of the principal input string.
  - 14. The method of claim 10 wherein the initializing step initializes, for each rule, indications of the number of times that the rule has succeeded that correspond to the identity of at least one subordinate rule that combined a group of words that the rule further combines with other words or groups of words;
    - and wherein the updating step updates an indication of the number of times that the rule has succeeded that correspond to the identity of a subordinate rule that combined a group of words of the sample input string that the rule further combines with other words or groups of words of the sample input string; and
      
      wherein the efficient parsing step applies rules from the set of rules in the decreasing order of their likelihood of success as indicated by updated indications of the number of times that each rule succeeded that correspond to the identity of a subordinate rule that combined a group of words of the principal input string that the rule further combines with other words or groups of words of the principal input string.
  - 15. The method of claim 10 wherein the initializing step initializes, for each rule, indications of the number of times that the rule has succeeded that correspond to different linguistic features of one or more words combined by the rule;
    - and wherein the updating step updates an indication of the number of times that the rule has succeeded that corresponds to a feature of a word of the sample input string combined by the rule; and
      
      wherein the efficient parsing step applies rules from the set of rules in the decreasing order of their likelihood of success as indicated by updated indications of the number of times that each rule succeeded that corresponds to a feature of a word of the principal input string combined by the rule.

16. A method in a computer system for compiling data useful to expedite the parsing of natural language text from a particular genre by a natural language parser applying a set of rules, the method comprising the steps of:
- (a) exhaustively parsing sample input strings representative of the genre by attempting to apply every rule in the set of rules;
  
  (b) compiling statistics indicating the frequency with which rules in the set of rules contribute to a successful parse of the sample input strings in step (a); and
  
  (c) based on the compiled statistics, storing the relative probabilities that each rule in the set of rules will contribute to a successful expedited parse.

17. A method in a computer system for efficiently parsing input strings using a parser that utilizes a set of lexicon entries and a set of rules, each lexicon entry of the set of lexicon entries and each rule of the set of rules either succeeding or failing each time it is applied, certain subsets both of the set of rules and the set of lexicon entries being applicable when parsing particular input strings, the method comprising the steps of:
- (a) applying all applicable lexicon entries in the set of lexicon entries and all applicable rules in the set of rules to parse each of a first set of input strings;
  
  (b) assembling statistics indicating the relative level of success of each lexicon entry in the set of lexicon entries and of each rule in the set of rules when applied in step (a); and
  
  (c) applying lexicon entries in the set of lexicon entries and rules in the set of rules in the decreasing order of the relative levels of success of the rules and lexicon entries indicated by the statistics assembled in step (b) to parse each of a second set of input strings.
- View Dependent Claims (18)
- - 18. The method of claim 17, further including the step of normalizing the assembled statistics indicating the relative level of success of each lexicon entry against the assembled statistics indicating the relative level of success of each rule, such that the statistics indicating the relative level of success of each lexicon entry are directly comparable to the statistics indicating the relative level of success of each rule.

19. A method in a computer system for accurately parsing a principal input string of natural language text using a set of sample input strings of natural language text and a set of rules each applicable to a subset of all possible input strings, having conditions, and specifying the generation of a syntactic characterization of at least a portion of an input string, each of the input strings having one or more lexical characterizations, the method comprising the steps of:
- for each sample input string in the set of sample input strings;
  
  for each rule applicable to the sample input string;
  
  determining whether the conditions of the rule are satisfied, andif the conditions of the rule are satisfied, generating a syntactic characterization of at least a portion of the sample input string as specified by the rule to represent the combination of lexical characterizations and/or existing syntactic characterizations, andif exactly one target syntactic characterization of the entire sample input string is generated, updating success indicators for the rules for which syntactic characterizations are generated whose combination is represented directly or indirectly by the one target syntactic characterization; and
  
  for a principal input string, until a target syntactic characterization of the entire input string is generated;
  
  identifying the applicable rule most likely to produce a syntactic characterization whose combination is represented directly or indirectly by one target syntactic characterization of the entire principal input string, based on the updated success statistics,determining whether the conditions of the identified rule are satisfied, andif the conditions of the identified rule are satisfied, generating a syntactic characterization of at least a portion of the sample input string as specified by the rule to represent the combination of existing lexical characterizations and/or syntactic characterizations.

20. A method in a computer system for reiteratively enhancing a first set of statistics used by a rule-based parser for parsing input strings of natural language text using a set of conditioned rules, the first set of statistics indicating the likelihood of success of each rule of the parser, the method comprising the steps of:
- (a) operating the parser such that the parser applies at least one of the rules of the parser in descending order of the likelihood of success indicated by the first set of statistics;
  
  (b) compiling a second set of statistics indicating the likelihood of success of each rule of the parser, based on the success of each rule when applied in step (a); and
  
  (c) operating the parser such that the parser applies at least one of the rules of the parser in descending order of the likelihood of success indicated by the second set of statistics compiled in step (b).
- View Dependent Claims (21)
- - 21. The method of claim 20, further including the steps of:
    - (d) compiling a second set of statistics indicating the likelihood of success of each rule of the parser, based on the success of each rule when applied in step (c); and
      
      (e) operating the parser such that the parser applies at least one of the rules of the parser in descending order of the likelihood of success indicated by the second set of statistics compiled in step (d).

22. A computer-based apparatus for parsing natural language input strings using a successively refined set of statistics indicating the likelihood of success of each of a group of conditioned rules used by the apparatus, each rule either succeeding or failing each time it is applied, certain subsets of the set of rules being applicable during the parsing of particular input strings, comprising:
- a parser for applying the rules;
  
  a statistics memory for storing the set of statistics indicating the relative likelihood of success of each rule in the group of rules;
  
  a parser controller for causing the parser to apply rules in the set of rules in the decreasing order of the relative likelihoods of success of the rules indicated by the statistics stored in the rule success statistics memory to parser each of a plurality of input strings; and
  
  a statistics refining subsystem for replacing the set of statistics stored in the statistics memory with statistics reflecting the level of success of the rules applied most recently by the parser.

23. A computer-based apparatus for efficiently parsing natural language input strings containing words, the apparatus comprising:
- a parser for applying a set of conditioned rules, each rule of the set of rules either succeeding or failing each time it is applied by the parser, certain subsets of the set of rules being applicable during the parsing of particular input strings;
  
  an exhaustive mode parser controller for directing the parser to apply all applicable rules in the set of rules to parse each of a first set of input strings;
  
  a rule success statistics memory for storing statistics indicating the relative level of success of each rule in the set of rules when applied under the direction of the exhaustive mode parser controller; and
  
  an efficient mode parser controller for directing the parser to apply rules in the set of rules in the decreasing order of the relative levels of success of the rules indicated by the statistics stored in the rule success statistics memory to parse each of a second set of input strings.
- View Dependent Claims (24, 25)
- - 24. The computer-based apparatus of claim 23, further including a parse terminator for terminating the parse of an input string from the second set under the control of the efficient mode parser controller when the parse is unlikely to complete, based upon the number of rules below a threshold relative level of success, as indicated by the statistics stored in the rule success statistics memory, that have been applied by the efficient mode parser controller.
  - 25. The computer-based apparatus of claim 24 wherein the parse terminator terminates a parse when the number of rules below the threshold level of success that have been applied exceeds the product of a threshold number of rules and the number of words in the input string.

26. A computer-based apparatus for efficiently parsing natural language text input strings, the apparatus comprising:
- a parser for applying a set of lexicon entries and a set of conditioned rules, each lexicon entry of the set of lexicon entries and each rule of the set of rules either succeeding or failing each time it is applied by the parser, certain subsets both of the set of rules and the set of lexicon entries being applicable when parsing particular input strings;
  
  an exhaustive mode parser controller for directing the parser to apply all applicable lexicon entries in the set of lexicon entries and all applicable rules in the set of rules to parse each of a first set of input strings;
  
  a success statistics memory for storing statistics indicating the relative level of success of each lexicon entry in the set of lexicon entries and of each rule in the set of rules when applied under the direction of the exhaustive mode parser controller; and
  
  an efficient mode parser controller for directing the parser to apply lexicon entries in the set of lexicon entries and rules in the set of rules in the decreasing order of the relative levels of success of the lexicon entries and rules indicated by the statistics stored in the success statistics memory to parse each of a second set of input strings.
- View Dependent Claims (27)
- - 27. The computer-based apparatus of claim 26, further including a statistics normalizer for normalizing the statistics stored in the success statistics memory indicating the relative level of success of each lexicon entry against the statistics stored in the success statistics memory indicating the relative level of success of each rule, such that the statistics indicating the relative level of success of each lexicon entry are directly comparable to the statistics indicating the relative level of success of each rule.

28. A computer-based apparatus for compiling data useful to expedite the parsing of natural language text from a particular genre by a natural language parser applying a set of conditioned rules, comprising:
- an exhaustive parser for exhaustively parsing sample input strings representative of the genre by attempting to apply every rule in the set of rules;
  
  a statistics compilation subsystem for compiling statistics indicating the frequency with which rules in the set of rules contribute to a successful parse of the sample input strings by the exhaustive parser; and
  
  a rule success probability memory for storing for use during an optimized parse, based on the statistics compiled by the statistics compilation subsystem, the relative probabilities that each rule in the set of rules will contribute to a successful expedited parse if applied.

29. A computer-based apparatus for efficiently parsing a plurality of principal natural language input strings using a plurality of sample natural language input strings, comprising:
- a natural language parser for forming one or more parse trees from an input string comprised of words by applying rules from a set of conditioned rules that each combine words or already combined groups of words, certain subsets of the set of rules being applicable when parsing particular input strings;
  
  an exhaustive mode parser controller for directing the parser to apply all applicable rules in the set of rules to parse each of the sample input strings;
  
  a rule success indicator memory for storing, for each rule a plurality of indications of the number of times that the rule has succeeded, each of the plurality of indications corresponding to a characteristic of the sample input string under which the rule has succeeded when applied to parse sample input strings under the direction of the exhaustive mode parser controller; and
  
  an efficient mode parser controller for directing the parser to parse each principal input string by applying rules in the set of rules in the decreasing order of the relative levels of success of the rules indicated by an updated indication of the number of times that each rule succeeded that corresponds to a characteristic of the principal input string.

30. A computer-readable medium whose contents cause a computer system to bootstrap statistical processing into a rule-based parser for parsing input strings of natural language text using a set of conditioned rules by performing the steps of:
- (a) operating the parser such that the parser attempts to apply a subset of every applicable rule of the parser to each input string;
  
  (b) compiling statistics indicating the likelihood of success of each rule of the parser, based on the success of each rule when applied in step (a); and
  
  (c) operating the parser such that the parser applies at least one of the rules of the parser in descending order of the likelihood of success indicated by the statistics compiled in step (b).

31. A computer-readable medium whose contents cause a computer system to bootstrap statistical processing into a rule-based natural language parser to efficiently parse a principal input string of natural language text using a plurality of sample input strings of natural language text representative of strings to be parsed by the natural language parser, the natural language parser forming one or more parse results from an input string comprised of words by applying rules from a set of conditioned rules that each combine words or groups of words that have already been combined, certain subsets of the set of rules being applicable when parsing particular input strings, by performing the steps of:
- for each sample input string;
  
  exhaustively parsing the sample input string by applying each applicable rule of the set of rules to form one or more parse results, andif a single parse result was formed by exhaustively parsing the sample input string, updating for each rule that combined words or groups of words that have already been combined in the parse result an indication of the number of times that the rule combined words or groups of words that had already been combined; and
  
  efficiently parsing the principal input string by applying applicable rules from the set of rules in the decreasing order of their likelihood of success as indicated by the updated indications of the number of times that each rule combined words or groups of words that had already been combined.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Heidorn, George E., Richardson, Stephen Darrow
Primary Examiner(s)
Hayes, Gail O.
Assistant Examiner(s)
Kyle, Charles

Application Number

US08/265,845
Time in Patent Office

1,418 Days
Field of Search

364/419.08, 364/419.01, 364/419.02
US Class Current

704/9
CPC Class Codes

G10L 15/1815   Semantic context, e.g. disa...

G10L 15/193   Formal grammars, e.g. finit...

G10L 15/197   Probabilistic grammars, e.g...

Method and system for bootstrapping statistical processing into a rule-based natural language parser

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

31 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for bootstrapping statistical processing into a rule-based natural language parser

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

31 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links