Grammar compression

US 8,447,736 B2
Filed: 08/30/2011
Issued: 05/21/2013
Est. Priority Date: 06/26/2008
Status: Expired due to Fees

First Claim

Patent Images

1. A method for compressing a grammar, the method comprising:

receiving a grammar to be compressed by using a computer, the grammar comprising a set of rules, each rule comprising a set of token classes, wherein a token class is a logical grouping of tokens, and a token is a string of one or more characters;

parsing the grammar to identify the set of rules within the grammar and the set of token classes within each rule;

eliminating, from the grammar, all but one of any duplicate rules identified from parsing the grammar, wherein duplicate rules include rules having the same token classes in the same order;

identifying, from the set of token classes within each remaining rule, a set of unimportant token classes separate from a set of important token classes, where the set of unimportant token classes are eligible for compression;

analyzing the set of unimportant token classes to identify two or more token classes within the set of unimportant token classes that are similar;

merging the two or more token classes within the set of unimportant token classes identified from the currently received grammar to generate a merged token class by removing duplicate tokens and combining remaining tokens from the two or more token classes; and

substituting the merged token class in the grammar for the two or more token classes that were merged to generate the merged token class to generate a compressed grammar.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Compression of extensive, rule-based grammars used to facilitate search queries is provided herein. Rule-based grammars include a list of rules that each comprise a sequence of token classes. Each token class is a logical grouping of tokens, and each token is a string of characters. A grammar is parsed to identify rules and token classes. Unimportant token classes are identified and sets of unimportant token classes are merged to generated merged token classes. A compressed grammar is generated by substituting the merged token classes into the grammar for corresponding unimportant token classes used to generate the merged token classes.

Citations

19 Claims

1. A method for compressing a grammar, the method comprising:
- receiving a grammar to be compressed by using a computer, the grammar comprising a set of rules, each rule comprising a set of token classes, wherein a token class is a logical grouping of tokens, and a token is a string of one or more characters;
  
  parsing the grammar to identify the set of rules within the grammar and the set of token classes within each rule;
  
  eliminating, from the grammar, all but one of any duplicate rules identified from parsing the grammar, wherein duplicate rules include rules having the same token classes in the same order;
  
  identifying, from the set of token classes within each remaining rule, a set of unimportant token classes separate from a set of important token classes, where the set of unimportant token classes are eligible for compression;
  
  analyzing the set of unimportant token classes to identify two or more token classes within the set of unimportant token classes that are similar;
  
  merging the two or more token classes within the set of unimportant token classes identified from the currently received grammar to generate a merged token class by removing duplicate tokens and combining remaining tokens from the two or more token classes; and
  
  substituting the merged token class in the grammar for the two or more token classes that were merged to generate the merged token class to generate a compressed grammar.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein the grammar comprises a manually-generated grammar.
  - 3. The method of claim 1, wherein the grammar comprises an automatically-generated grammar.
  - 4. The method of claim 1, wherein the method compresses multiple grammars to generate the compressed grammar, and wherein receiving a grammar comprises receiving multiple grammars.
  - 5. The method of claim 1, wherein the token class is identified as being important or unimportant based on user input.
  - 6. The method of claim 1, wherein the token class is automatically or algorithmically identified as being important or unimportant.
  - 7. The method of claim 1, wherein analyzing the set of unimportant classes comprises employing a similarity function to identify similar unimportant token classes.
  - 8. The method of claim 1, wherein merging the two or more unimportant token classes from the candidate subset to generate a merged token class comprises generating a duplicate-free union of tokens included in each of the two or more unimportant token classes from the candidate subset.

9. One or more computer-storage media devices embodying computer-useable instructions that, when employed by a computing device, cause the computing device to perform a method comprising:
- receiving a grammar usable by a search engine device to route search queries to corresponding domains of information to find and return information for the search queries, the grammar comprising a plurality of rules, each rule comprising a sequence of token classes, wherein each token class is a logical grouping of tokens, and a token is a string of one or more characters;
  
  parsing the grammar to identify the plurality of rules and token classes;
  
  eliminating, from the grammar, all but one of any duplicate rules identified from parsing the grammar, wherein duplicate rules include rules having the same token classes in the same order;
  
  identifying, from the token classes, two or more unimportant token classes that are eligible for compression and at least one important token class that is not eligible for compression;
  
  breaking at least one rule into a plurality of sub-rules based on important token classes and removing sub-rules containing only important token classes, wherein each sub-rule includes a portion of the token classes from the at least one rule;
  
  analyzing the plurality of sub-rules to identify at least one set of sub-rules as compression candidates, wherein the at least one set of sub-rules contains unimportant token classes;
  
  analyzing the unimportant token classes in the at least one set of sub-rules to identify two or more unimportant token classes for compression;
  
  merging the two or more unimportant token classes in the at least one set of sub-rules identified for compression from the currently received grammar to generate a merged token class by removing duplicate tokens and combining the remaining tokens from the two or more unimportant token classes; and
  
  generating a compressed grammar by substituting the merged token class in the grammar for the two or more unimportant token classes that were merged to generate the merged token class.

10. One or more computer-storage media devices embodying computer-useable instructions that, when employed by a computing device, cause the computing device to perform a method comprising:
- receiving a grammar usable by a search engine device to route search queries to corresponding domains of information to find and return information for the search queries, the grammar comprising a plurality of rules, each rule comprising a sequence of token classes used to describe search queries, each token class comprising a logical grouping of tokens, each token comprising a string of one or more characters;
  
  parsing the grammar to identify the plurality of rules and token classes;
  
  eliminating, from the grammar, any duplicate rules identified from parsing the grammar;
  
  assigning a score to each rule indicative of an importance of each rule to the grammar, wherein the score for each rule is based at least in part on the frequency with which each rule corresponds with search queries contained in query logs;
  
  identifying one or more rules as important rules based on the one or more rules having a high score indicative of a high importance to the grammar;
  
  removing the one or more important rules from consideration for compression;
  
  identifying, from the token classes, two or more unimportant token classes that are eligible for compression and at least one important token class that is not eligible for compression;
  
  breaking at least one rule into a plurality of sub-rules based on important token classes, wherein each sub-rule includes a portion of the token classes from the at least one rule and each sub-rule begins and ends with an important token class and wherein a beginning token class and ending token class in each rule is treated as an important token class for purposes of breaking each rule into the plurality of sub-rules;
  
  identifying one or more sub-rules containing only important token classes;
  
  removing the one or more sub-rules containing only important token classes from consideration for compression;
  
  eliminating, from the grammar, any duplicate sub-rules identified;
  
  analyzing the plurality of sub-rules to identify at least one set of sub-rules as compression candidates;
  
  analyzing the unimportant token classes in the at least one set of sub-rules to identify two or more unimportant token classes for compression;
  
  merging the two or more unimportant token classes from the at least one set of sub-rules to generate a merged token class;
  
  substituting the merged token class in the grammar for the two or more unimportant token classes that were merged to generate the merged token class; and
  
  eliminating any duplicate sub-rules and any duplicate rules after substituting the merged token classes in the grammar to generate a compressed grammar.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 11. The one or more computer-storage media devices of claim 10, wherein the grammar comprises at least one of:
    - a manually-generated grammar; and
      
      an automatically-generated grammar.
  - 12. The one or more computer-storage media devices of claim 10, wherein the method compresses multiple grammars to generate the compressed grammar, and wherein receiving a grammar comprises receiving the multiple grammars.
  - 13. The one or more computer-storage media devices of claim 10, wherein a token class is identified as unimportant or important based on at least one of the following:
    - user input identifying the token class as being important or important;
      
      a frequency with which the token class appears in the grammar;
      
      scores of rules in which the token class appears;
      
      underlying data information or additional corpus; and
      
      an application to which the grammar is to be used.
  - 14. The one or more computer-storage media devices of claim 10, wherein analyzing the plurality of sub-rules to identify the at least one set of sub-rules as compression candidates comprises identifying a set of two or more sub-rules that begin with the same token class as the other sub-rules in the set.
  - 15. The one or more computer-storage media devices of claim 10, wherein analyzing the plurality of sub-rules to identify the at least one set of sub-rules as compression candidates comprises identifying a set of two or more sub-rules that begin with the same token class as the other sub-rules in the set and end with the same token class as the other sub-rules in the set.
  - 16. The one or more computer-storage media devices of claim 10, wherein analyzing the plurality of sub-rules to identify the at least one set of sub-rules as compression candidates comprises identifying at least one sub-rule as an important sub-rule and removing the important sub-rule from consideration from compression.
  - 17. The one or more computer-storage media devices of claim 16, wherein at least one sub-rule is identified as an important sub-rule based on at least one of the following:
    - user input identifying the sub-rule as being important;
      
      a frequency with which the sub-rule appears in the grammar;
      
      underlying data information or additional corpus; and
      
      a frequency with which the sub-rule corresponds with search queries in query logs.
  - 18. The one or more computer-storage media devices of claim 10, wherein analyzing the unimportant token classes in the at least one set of sub-rules to identify two or more unimportant token classes for compression comprises employing a similarity function to identify similar unimportant token classes.
  - 19. The one or more computer-storage media devices of claim 10, wherein merging the two or more unimportant token classes from the at least one set of sub-rules to generate a merged token class comprises generating a duplicate-free union of tokens included in each of the two or more unimportant token classes from the at least one set of sub-rules.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Paparizos, Stelios, Anderson, Christopher Walter, Liu, Wei, Nair, Ajay, Ntoulas, Alexandros, Vemuri, Naga Srinivas
Primary Examiner(s)
Saeed, Usmaan
Assistant Examiner(s)
VO, CECILE H

Application Number

US13/221,227
Publication Number

US 20110313993A1
Time in Patent Office

630 Days
Field of Search

707/665, 707/693, 707/706, 341/50, 341/51, 717/142, 717/143, 375/240, 715/242
US Class Current

707/665
CPC Class Codes

G06F 16/334 Query execution G06F16/335 ...

G06F 40/211 Syntactic parsing, e.g. bas...

Grammar compression

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Grammar compression

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links