GRAMMAR COMPRESSION

US 20090327256A1
Filed: 06/26/2008
Published: 12/31/2009
Est. Priority Date: 06/26/2008
Status: Active Grant

First Claim

Patent Images

1. A method for compressing a grammar, the method comprising:

receiving a grammar, the grammar comprising a plurality of rules and the rules comprising a plurality of token classes;

parsing the grammar to identify the plurality of rules within the grammar and the plurality of token classes within the plurality of rules;

identifying, from the plurality of token classes, two or more unimportant token classes that are eligible for compression;

analyzing the two or more unimportant classes to identify at least one subset of two or more unimportant token classes as a candidate subset for compression;

merging the two or more unimportant token classes from the candidate subset to generate a merged token class; and

substituting the merged token class in the grammar for the two or more unimportant token classes from the candidate subset to generate a compressed grammar.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Compression of extensive, rule-based grammars used to facilitate search queries is provided herein. Rule-based grammars includes a list of rules that each comprise a sequence of token classes. Each token class is a logical grouping of tokens, and each token is a string of characters. A grammar is parsed to identify rules and token classes. Unimportant token classes are identified and sets of unimportant token classes are merged to generated merged token classes. A compressed grammar is generated by substituting the merged token classes into the grammar for corresponding unimportant token classes used to generate the merged token classes.

Citations

20 Claims

1. A method for compressing a grammar, the method comprising:
- receiving a grammar, the grammar comprising a plurality of rules and the rules comprising a plurality of token classes;
  
  parsing the grammar to identify the plurality of rules within the grammar and the plurality of token classes within the plurality of rules;
  
  identifying, from the plurality of token classes, two or more unimportant token classes that are eligible for compression;
  
  analyzing the two or more unimportant classes to identify at least one subset of two or more unimportant token classes as a candidate subset for compression;
  
  merging the two or more unimportant token classes from the candidate subset to generate a merged token class; and
  
  substituting the merged token class in the grammar for the two or more unimportant token classes from the candidate subset to generate a compressed grammar.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the grammar comprises at least one of:
    - a manually-generated grammar; and
      
      an automatically-generated grammar.
  - 3. The method of claim 1, wherein the method compresses multiple grammars to generate the compressed grammar, and wherein receiving a grammar comprises receiving multiple grammars.
  - 4. The method of claim 1, wherein a token class is identified as being important or unimportant based on user input.
  - 5. The method of claim 1, wherein a token class is automatically or algorithmically identified as being important or unimportant.
  - 6. The method of claim 1, wherein analyzing the two or more unimportant classes to identify at least one subset of two or more unimportant token classes as a candidate subset for compression comprises employing a similarity function to identify similar unimportant token classes
  - 7. The method of claim 1, wherein merging the two or more unimportant token classes from the candidate subset to generate a merged token class comprises generating a duplicate-free union of tokens included in each of the two or more unimportant token classes from the candidate subset.

8. One or more computer-storage media embodying computer-useable instructions that, when employed by a computing device, cause the computing device to perform a method comprising:
- receiving a grammar usable by a search engine to route search queries to corresponding domains of information to find and return information for the search queries, the grammar comprising a plurality of rules, each rule comprising a sequence of token classes;
  
  parsing the grammar to identify the plurality of rules and token classes;
  
  identifying, from the token classes, two or more unimportant token classes that are eligible for compression and at least one important token class that is not eligible for compression;
  
  breaking at least one rule into a plurality of sub-rules based on important token classes, wherein each sub-rule includes a portion of the token classes from the at least one rule;
  
  analyzing the plurality of sub-rules to identify at least one set of sub-rules as compression candidates;
  
  analyzing the unimportant token classes in the at least one set of sub-rules to identify two or more unimportant token classes for compression;
  
  merging the two or more unimportant token classes from the at least one set of sub-rules to generate a merged token class; and
  
  generating a compressed grammar by substituting the merged token class in the grammar for the two or more unimportant token classes that were merged to generate the merged token class.

9. One or more computer-storage media embodying computer-useable instructions that, when employed by a computing device, cause the computing device to perform a method comprising:
- receiving a grammar usable by a search engine to route search queries to corresponding domains of information to find and return information for the search queries, the grammar comprising a plurality of rules, each rule comprising a sequence of token classes used to describe search queries, each token class comprising a logical grouping of tokens, each token comprising a string of one or more characters;
  
  parsing the grammar to identify the plurality of rules and token classes;
  
  eliminating, from the grammar, any duplicate rules identified from parsing the grammar;
  
  assigning a score to each rule indicative of an importance of each rule to the grammar, wherein the score for each rule is based at least in part on the frequency with which each rule corresponds with search queries contained in query logs;
  
  identifying one or more rules as important rules based on the one or more rules having a high score indicative of a high importance to the grammar;
  
  removing the one or more important rules from consideration for compression.identifying, from the token classes, two or more unimportant token classes that are eligible for compression and at least one important token class that is not eligible for compression;
  
  breaking at least one rule into a plurality of sub-rules based on important token classes, wherein each sub-rule begins and ends with an important token class and wherein a beginning token class and ending token class in each rule is treated as an important token class for purposes of breaking each rule into the plurality of sub-rules;
  
  identifying one or more sub-rules containing only important token classes;
  
  removing the one or more sub-rules containing only important token classes from consideration for compression;
  
  eliminating, from the grammar, any duplicate sub-rules identified;
  
  analyzing the plurality of sub-rules to identify at least one set of sub-rules as compression candidates;
  
  analyzing the unimportant token classes in the at least one set of sub-rules to identify two or more unimportant token classes for compression;
  
  merging the two or more unimportant token classes from the at least one set of sub-rules to generate a merged token class;
  
  substituting the merged token class in the grammar for the two or more unimportant token classes that were merged to generate the merged token class; and
  
  eliminating any duplicate sub-rules and any duplicate rules after substituting the merged token classes in the grammar to generate a compressed grammar.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 10. The one or more computer-storage media of claim 9, wherein the grammar comprises at least one of:
    - a manually-generated grammar; and
      
      an automatically-generated grammar.
  - 11. The one or more computer-storage media of claim 9, wherein the method compresses multiple grammars to generate the compressed grammar, and wherein receiving a grammar comprises receiving the multiple grammars.
  - 12. The one or more computer-storage media of claim 9, wherein a token class is identified as unimportant or important based on at least one of the following:
    - user input identifying the token class as being important or important;
      
      a frequency with which the token class appears in the grammar;
      
      scores of rules in which the token class appears;
      
      underlying data information or additional corpus; and
      
      an application to which the grammar is to be used.
  - 13. The one or more computer-storage media of claim 9, wherein analyzing the plurality of sub-rules to identify the at least one set of sub-rules as compression candidates comprises identifying a set of two or more sub-rules that begin with the same token class as the other sub-rules in the set.
  - 14. The one or more computer-storage media of claim 9, wherein analyzing the plurality of sub-rules to identify the at least one set of sub-rules as compression candidates comprises identifying a set of two or more sub-rules that begin with the same token class as the other sub-rules in the set and end with the same token class as the other sub-rules in the set.
  - 15. The one or more computer-storage media of claim 9, wherein analyzing the plurality of sub-rules to identify the at least one set of sub-rules as compression candidates comprises identifying at least one sub-rule as an important sub-rule and removing the important sub-rule from consideration from compression.
  - 16. The one or more computer-storage media of claim 15, wherein at least one sub-rule is identified as an important sub-rule based on at least one of the following:
    - user input identifying the sub-rule as being important;
      
      a frequency with which the sub-rule appears in the grammar;
      
      underlying data information or additional corpus; and
      
      a frequency with which the sub-rule corresponds with search queries in query logs.
  - 17. The one or more computer-storage media of claim 9, wherein analyzing the unimportant token classes in the at least one set of sub-rules to identify two or more unimportant token classes for compression comprises employing a similarity function to identify similar unimportant token classes
  - 18. The one or more computer-storage media of claim 9, wherein merging the two or more unimportant token classes from the at least one set of sub-rules to generate a merged token class comprises generating a duplicate-free union of tokens included in each of the two or more unimportant token classes from the at least one set of sub-rules.
  - 19. The one or more computer-storage media of claim 9, wherein substituting the merged token class in the grammar for the two or more unimportant token classes that were merged to generate the merged token class comprises substituting the merged token class for all instances within the grammar of the two or more unimportant token classes that were merged to generate the merged token class.
  - 20. The one or more computer-storage media of claim 9, wherein substituting the merged token class in the grammar for the two or more unimportant token classes that were merged to generate the merged token class comprises substituting the merged token class only for instances within the at least one set of sub-rules of the two or more unimportant token classes that were merged to generate the merged token class.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
NTOULAS, ALEXANDROS, LIU, WEI, PAPARIZOS, STELIOS, NAIR, AJAY, ANDERSON, CHRISTOPHER WALTER, VEMURI, NAGA SRINIVAS

Granted Patent

US 8,027,957 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/334 Query execution G06F16/335 ...

G06F 40/211 Syntactic parsing, e.g. bas...

GRAMMAR COMPRESSION

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

GRAMMAR COMPRESSION

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links