Grammar compression
First Claim
Patent Images
1. A method for compressing a grammar, the method comprising:
- receiving a grammar to be compressed by using a computer, the grammar comprising a set of rules, each rule comprising a set of token classes, wherein a token class is a logical grouping of tokens, and a token is a string of one or more characters;
parsing the grammar to identify the set of rules within the grammar and the set of token classes within each rule;
eliminating, from the grammar, all but one of any duplicate rules identified from parsing the grammar, wherein duplicate rules include rules having the same token classes in the same order;
identifying, from the set of token classes within each remaining rule, a set of unimportant token classes separate from a set of important token classes, where the set of unimportant token classes are eligible for compression;
analyzing the set of unimportant token classes to identify two or more token classes within the set of unimportant token classes that are similar;
merging the two or more token classes within the set of unimportant token classes identified from the currently received grammar to generate a merged token class by removing duplicate tokens and combining remaining tokens from the two or more token classes; and
substituting the merged token class in the grammar for the two or more token classes that were merged to generate the merged token class to generate a compressed grammar.
2 Assignments
0 Petitions
Accused Products
Abstract
Compression of extensive, rule-based grammars used to facilitate search queries is provided herein. Rule-based grammars include a list of rules that each comprise a sequence of token classes. Each token class is a logical grouping of tokens, and each token is a string of characters. A grammar is parsed to identify rules and token classes. Unimportant token classes are identified and sets of unimportant token classes are merged to generated merged token classes. A compressed grammar is generated by substituting the merged token classes into the grammar for corresponding unimportant token classes used to generate the merged token classes.
-
Citations
19 Claims
-
1. A method for compressing a grammar, the method comprising:
-
receiving a grammar to be compressed by using a computer, the grammar comprising a set of rules, each rule comprising a set of token classes, wherein a token class is a logical grouping of tokens, and a token is a string of one or more characters; parsing the grammar to identify the set of rules within the grammar and the set of token classes within each rule; eliminating, from the grammar, all but one of any duplicate rules identified from parsing the grammar, wherein duplicate rules include rules having the same token classes in the same order; identifying, from the set of token classes within each remaining rule, a set of unimportant token classes separate from a set of important token classes, where the set of unimportant token classes are eligible for compression; analyzing the set of unimportant token classes to identify two or more token classes within the set of unimportant token classes that are similar; merging the two or more token classes within the set of unimportant token classes identified from the currently received grammar to generate a merged token class by removing duplicate tokens and combining remaining tokens from the two or more token classes; and substituting the merged token class in the grammar for the two or more token classes that were merged to generate the merged token class to generate a compressed grammar. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. One or more computer-storage media devices embodying computer-useable instructions that, when employed by a computing device, cause the computing device to perform a method comprising:
-
receiving a grammar usable by a search engine device to route search queries to corresponding domains of information to find and return information for the search queries, the grammar comprising a plurality of rules, each rule comprising a sequence of token classes, wherein each token class is a logical grouping of tokens, and a token is a string of one or more characters; parsing the grammar to identify the plurality of rules and token classes; eliminating, from the grammar, all but one of any duplicate rules identified from parsing the grammar, wherein duplicate rules include rules having the same token classes in the same order; identifying, from the token classes, two or more unimportant token classes that are eligible for compression and at least one important token class that is not eligible for compression; breaking at least one rule into a plurality of sub-rules based on important token classes and removing sub-rules containing only important token classes, wherein each sub-rule includes a portion of the token classes from the at least one rule; analyzing the plurality of sub-rules to identify at least one set of sub-rules as compression candidates, wherein the at least one set of sub-rules contains unimportant token classes; analyzing the unimportant token classes in the at least one set of sub-rules to identify two or more unimportant token classes for compression; merging the two or more unimportant token classes in the at least one set of sub-rules identified for compression from the currently received grammar to generate a merged token class by removing duplicate tokens and combining the remaining tokens from the two or more unimportant token classes; and generating a compressed grammar by substituting the merged token class in the grammar for the two or more unimportant token classes that were merged to generate the merged token class.
-
-
10. One or more computer-storage media devices embodying computer-useable instructions that, when employed by a computing device, cause the computing device to perform a method comprising:
-
receiving a grammar usable by a search engine device to route search queries to corresponding domains of information to find and return information for the search queries, the grammar comprising a plurality of rules, each rule comprising a sequence of token classes used to describe search queries, each token class comprising a logical grouping of tokens, each token comprising a string of one or more characters; parsing the grammar to identify the plurality of rules and token classes; eliminating, from the grammar, any duplicate rules identified from parsing the grammar; assigning a score to each rule indicative of an importance of each rule to the grammar, wherein the score for each rule is based at least in part on the frequency with which each rule corresponds with search queries contained in query logs; identifying one or more rules as important rules based on the one or more rules having a high score indicative of a high importance to the grammar; removing the one or more important rules from consideration for compression; identifying, from the token classes, two or more unimportant token classes that are eligible for compression and at least one important token class that is not eligible for compression; breaking at least one rule into a plurality of sub-rules based on important token classes, wherein each sub-rule includes a portion of the token classes from the at least one rule and each sub-rule begins and ends with an important token class and wherein a beginning token class and ending token class in each rule is treated as an important token class for purposes of breaking each rule into the plurality of sub-rules; identifying one or more sub-rules containing only important token classes; removing the one or more sub-rules containing only important token classes from consideration for compression; eliminating, from the grammar, any duplicate sub-rules identified; analyzing the plurality of sub-rules to identify at least one set of sub-rules as compression candidates; analyzing the unimportant token classes in the at least one set of sub-rules to identify two or more unimportant token classes for compression; merging the two or more unimportant token classes from the at least one set of sub-rules to generate a merged token class; substituting the merged token class in the grammar for the two or more unimportant token classes that were merged to generate the merged token class; and eliminating any duplicate sub-rules and any duplicate rules after substituting the merged token classes in the grammar to generate a compressed grammar. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19)
-
Specification