Method and system for bootstrapping statistical processing into a rule-based natural language parser
First Claim
1. A method in a computer system for bootstrapping statistical processing into a rule-based parser for parsing input strings of natural language text using a set of conditioned rules, the method comprising the steps of:
- (a) operating the parser such that the parser attempts to apply a subset of every applicable rule of the parser to each input string;
(b) compiling statistics indicating the likelihood of success of each rule of the parser, based on the success of each rule when applied in step (a); and
(c) operating the parser such that the parser applies at least one of the rules of the parser in descending order of the likelihood of success indicated by the statistics compiled in step (b).
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for bootstrapping statistical processing into a rule-based natural language parser is provided. In a preferred embodiment, a statistical bootstrapping software facility optimizes the operation of a robust natural language parser that uses a set of lexicon entries to determine possible parts of speech of words from an input string and a set of rules to combine words from the input string into syntactic structures. The facility first operates the parser in a statistics compilation mode, in which, for each of many sample input strings, the parser attempts to apply all applicable rules and lexicon entries. While the parser is operating in the statistics compilation mode, the facility compiles statistics indicating the likelihood of success of each rule and lexicon entry, based on the success of each rule and lexicon entry when applied in the statistics compilation mode. After a sufficient body of likelihood of success statistics have been compiled, the facility operates the parser in an efficient parsing mode, in which the facility uses the compiled statistics to optimize the operation of the parser. In order to parse an input string in the efficient parsing mode, the facility causes the parser to apply applicable rules and lexicon entries in the descending order of the likelihood of their success as indicated by the statistics compiled in the statistics compilation mode.
-
Citations
31 Claims
-
1. A method in a computer system for bootstrapping statistical processing into a rule-based parser for parsing input strings of natural language text using a set of conditioned rules, the method comprising the steps of:
-
(a) operating the parser such that the parser attempts to apply a subset of every applicable rule of the parser to each input string; (b) compiling statistics indicating the likelihood of success of each rule of the parser, based on the success of each rule when applied in step (a); and (c) operating the parser such that the parser applies at least one of the rules of the parser in descending order of the likelihood of success indicated by the statistics compiled in step (b). - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method in a computer system for bootstrapping statistical processing into a rule-based natural language parser to efficiently parse a principal input string of natural language text using a plurality of sample input strings of natural language text representative of strings to be parsed by the natural language parser, the natural language parser forming one or more parse results from an input string comprised of words by applying rules from a set of conditioned rules that each combine words or groups of words that have already been combined, certain subsets of the set of rules being applicable when parsing particular input strings, comprising the steps of:
-
for each sample input string; exhaustively parsing the sample input string by applying each applicable rule of the set of rules to form one or more parse results, and if a single parse result was formed by exhaustively parsing the sample input string, updating for each rule that combined words or groups of words that have already been combined in the parse result an indication of the number of times that the rule combined words or groups of words that had already been combined; and efficiently parsing the principal input string by applying applicable rules from the set of rules in the decreasing order of their likelihood of success as indicated by the updated indications of the number of times that each rule combined words or groups of words that had already been combined. - View Dependent Claims (9)
-
-
10. A method in a computer system for bootstrapping statistical processing into a rule-based natural language parser to efficiently parse a principal input string using a plurality of sample input strings representative of strings to be parsed by the natural language parser, the natural language parser for forming one or more parse results from an input string comprised of words by applying rules from a set of conditioned rules that each combine words or already combined groups of words, certain subsets of the set of rules being applicable when parsing particular input strings, comprising the steps of:
-
for each rule, initializing a plurality of indications of the number of times that the rule has succeeded, each of the plurality of indications corresponding to a characteristic of the sample input string under which the rule has succeeded; for each sample input string; exhaustively parsing the sample input string by applying each applicable rule of the set of rules to produce one or more parse results, and if a single parse result was formed by exhaustively parsing the sample input string, updating for each rule that combined words or already combined groups of words in the parse result an indication of the number of times that the rule succeeded that corresponds to a characteristic of the sample input string; and efficiently parsing the principal input string by applying applicable rules to the principal input string from the set of rules in the decreasing order of their likelihood of success as indicated by updated indications of the number of times that each rule succeeded that corresponds to a characteristic of the principal input string. - View Dependent Claims (11, 12, 13, 14, 15)
-
-
16. A method in a computer system for compiling data useful to expedite the parsing of natural language text from a particular genre by a natural language parser applying a set of rules, the method comprising the steps of:
-
(a) exhaustively parsing sample input strings representative of the genre by attempting to apply every rule in the set of rules; (b) compiling statistics indicating the frequency with which rules in the set of rules contribute to a successful parse of the sample input strings in step (a); and (c) based on the compiled statistics, storing the relative probabilities that each rule in the set of rules will contribute to a successful expedited parse.
-
-
17. A method in a computer system for efficiently parsing input strings using a parser that utilizes a set of lexicon entries and a set of rules, each lexicon entry of the set of lexicon entries and each rule of the set of rules either succeeding or failing each time it is applied, certain subsets both of the set of rules and the set of lexicon entries being applicable when parsing particular input strings, the method comprising the steps of:
-
(a) applying all applicable lexicon entries in the set of lexicon entries and all applicable rules in the set of rules to parse each of a first set of input strings; (b) assembling statistics indicating the relative level of success of each lexicon entry in the set of lexicon entries and of each rule in the set of rules when applied in step (a); and (c) applying lexicon entries in the set of lexicon entries and rules in the set of rules in the decreasing order of the relative levels of success of the rules and lexicon entries indicated by the statistics assembled in step (b) to parse each of a second set of input strings. - View Dependent Claims (18)
-
-
19. A method in a computer system for accurately parsing a principal input string of natural language text using a set of sample input strings of natural language text and a set of rules each applicable to a subset of all possible input strings, having conditions, and specifying the generation of a syntactic characterization of at least a portion of an input string, each of the input strings having one or more lexical characterizations, the method comprising the steps of:
-
for each sample input string in the set of sample input strings; for each rule applicable to the sample input string; determining whether the conditions of the rule are satisfied, and if the conditions of the rule are satisfied, generating a syntactic characterization of at least a portion of the sample input string as specified by the rule to represent the combination of lexical characterizations and/or existing syntactic characterizations, and if exactly one target syntactic characterization of the entire sample input string is generated, updating success indicators for the rules for which syntactic characterizations are generated whose combination is represented directly or indirectly by the one target syntactic characterization; and for a principal input string, until a target syntactic characterization of the entire input string is generated; identifying the applicable rule most likely to produce a syntactic characterization whose combination is represented directly or indirectly by one target syntactic characterization of the entire principal input string, based on the updated success statistics, determining whether the conditions of the identified rule are satisfied, and if the conditions of the identified rule are satisfied, generating a syntactic characterization of at least a portion of the sample input string as specified by the rule to represent the combination of existing lexical characterizations and/or syntactic characterizations.
-
-
20. A method in a computer system for reiteratively enhancing a first set of statistics used by a rule-based parser for parsing input strings of natural language text using a set of conditioned rules, the first set of statistics indicating the likelihood of success of each rule of the parser, the method comprising the steps of:
-
(a) operating the parser such that the parser applies at least one of the rules of the parser in descending order of the likelihood of success indicated by the first set of statistics; (b) compiling a second set of statistics indicating the likelihood of success of each rule of the parser, based on the success of each rule when applied in step (a); and (c) operating the parser such that the parser applies at least one of the rules of the parser in descending order of the likelihood of success indicated by the second set of statistics compiled in step (b). - View Dependent Claims (21)
-
-
22. A computer-based apparatus for parsing natural language input strings using a successively refined set of statistics indicating the likelihood of success of each of a group of conditioned rules used by the apparatus, each rule either succeeding or failing each time it is applied, certain subsets of the set of rules being applicable during the parsing of particular input strings, comprising:
-
a parser for applying the rules; a statistics memory for storing the set of statistics indicating the relative likelihood of success of each rule in the group of rules; a parser controller for causing the parser to apply rules in the set of rules in the decreasing order of the relative likelihoods of success of the rules indicated by the statistics stored in the rule success statistics memory to parser each of a plurality of input strings; and a statistics refining subsystem for replacing the set of statistics stored in the statistics memory with statistics reflecting the level of success of the rules applied most recently by the parser.
-
-
23. A computer-based apparatus for efficiently parsing natural language input strings containing words, the apparatus comprising:
-
a parser for applying a set of conditioned rules, each rule of the set of rules either succeeding or failing each time it is applied by the parser, certain subsets of the set of rules being applicable during the parsing of particular input strings; an exhaustive mode parser controller for directing the parser to apply all applicable rules in the set of rules to parse each of a first set of input strings; a rule success statistics memory for storing statistics indicating the relative level of success of each rule in the set of rules when applied under the direction of the exhaustive mode parser controller; and an efficient mode parser controller for directing the parser to apply rules in the set of rules in the decreasing order of the relative levels of success of the rules indicated by the statistics stored in the rule success statistics memory to parse each of a second set of input strings. - View Dependent Claims (24, 25)
-
-
26. A computer-based apparatus for efficiently parsing natural language text input strings, the apparatus comprising:
-
a parser for applying a set of lexicon entries and a set of conditioned rules, each lexicon entry of the set of lexicon entries and each rule of the set of rules either succeeding or failing each time it is applied by the parser, certain subsets both of the set of rules and the set of lexicon entries being applicable when parsing particular input strings; an exhaustive mode parser controller for directing the parser to apply all applicable lexicon entries in the set of lexicon entries and all applicable rules in the set of rules to parse each of a first set of input strings; a success statistics memory for storing statistics indicating the relative level of success of each lexicon entry in the set of lexicon entries and of each rule in the set of rules when applied under the direction of the exhaustive mode parser controller; and an efficient mode parser controller for directing the parser to apply lexicon entries in the set of lexicon entries and rules in the set of rules in the decreasing order of the relative levels of success of the lexicon entries and rules indicated by the statistics stored in the success statistics memory to parse each of a second set of input strings. - View Dependent Claims (27)
-
-
28. A computer-based apparatus for compiling data useful to expedite the parsing of natural language text from a particular genre by a natural language parser applying a set of conditioned rules, comprising:
-
an exhaustive parser for exhaustively parsing sample input strings representative of the genre by attempting to apply every rule in the set of rules; a statistics compilation subsystem for compiling statistics indicating the frequency with which rules in the set of rules contribute to a successful parse of the sample input strings by the exhaustive parser; and a rule success probability memory for storing for use during an optimized parse, based on the statistics compiled by the statistics compilation subsystem, the relative probabilities that each rule in the set of rules will contribute to a successful expedited parse if applied.
-
-
29. A computer-based apparatus for efficiently parsing a plurality of principal natural language input strings using a plurality of sample natural language input strings, comprising:
-
a natural language parser for forming one or more parse trees from an input string comprised of words by applying rules from a set of conditioned rules that each combine words or already combined groups of words, certain subsets of the set of rules being applicable when parsing particular input strings; an exhaustive mode parser controller for directing the parser to apply all applicable rules in the set of rules to parse each of the sample input strings; a rule success indicator memory for storing, for each rule a plurality of indications of the number of times that the rule has succeeded, each of the plurality of indications corresponding to a characteristic of the sample input string under which the rule has succeeded when applied to parse sample input strings under the direction of the exhaustive mode parser controller; and an efficient mode parser controller for directing the parser to parse each principal input string by applying rules in the set of rules in the decreasing order of the relative levels of success of the rules indicated by an updated indication of the number of times that each rule succeeded that corresponds to a characteristic of the principal input string.
-
-
30. A computer-readable medium whose contents cause a computer system to bootstrap statistical processing into a rule-based parser for parsing input strings of natural language text using a set of conditioned rules by performing the steps of:
-
(a) operating the parser such that the parser attempts to apply a subset of every applicable rule of the parser to each input string; (b) compiling statistics indicating the likelihood of success of each rule of the parser, based on the success of each rule when applied in step (a); and (c) operating the parser such that the parser applies at least one of the rules of the parser in descending order of the likelihood of success indicated by the statistics compiled in step (b).
-
-
31. A computer-readable medium whose contents cause a computer system to bootstrap statistical processing into a rule-based natural language parser to efficiently parse a principal input string of natural language text using a plurality of sample input strings of natural language text representative of strings to be parsed by the natural language parser, the natural language parser forming one or more parse results from an input string comprised of words by applying rules from a set of conditioned rules that each combine words or groups of words that have already been combined, certain subsets of the set of rules being applicable when parsing particular input strings, by performing the steps of:
-
for each sample input string; exhaustively parsing the sample input string by applying each applicable rule of the set of rules to form one or more parse results, and if a single parse result was formed by exhaustively parsing the sample input string, updating for each rule that combined words or groups of words that have already been combined in the parse result an indication of the number of times that the rule combined words or groups of words that had already been combined; and efficiently parsing the principal input string by applying applicable rules from the set of rules in the decreasing order of their likelihood of success as indicated by the updated indications of the number of times that each rule combined words or groups of words that had already been combined.
-
Specification