Method and system for generating grammar rules

US 8,793,261 B2
Filed: 10/17/2001
Issued: 07/29/2014
Est. Priority Date: 10/17/2000
Status: Expired due to Fees

First Claim

Patent Images

1. An information retrieval method for use with documents, including:

parsing a plurality of documents stored in a digital document database on computer-accessible storage media to identify key terms of each document based on sentence structure;

extracting a plurality of n-grams from each document, wherein one or more of the n-grams include spaces and partial words;

extracting a frequency of each n-gram in each document;

extracting a frequency of each n-gram in the plurality of documents;

assigning a novelty score to each of the n-grams in each corresponding document, said novelty score representing and being based on the extracted frequency of the n-gram in the document and the extracted frequency of the n-gram in the plurality of documents;

determining which of the extracted n-grams are in each identified key term;

assigning a weight to each key term based the novelty scores assigned to at the extracted n-grams in the key term;

generating the domain-specific grammar rules for a speech recognition engine, said grammar rules including said key terms in association with respective probabilities based on the weights of the key terms, wherein the key terms define phrases that are likely to be spoken from the plurality of documents, and the grammar rules define which of the phrases are likely to follow others of the phrases with the likelihoods defined by the probabilities;

determining an importance score for each said key terms in each document based on how many of the documents include the key term and the frequency of the key term in the document;

parsing a search query to determine at least one search term wherein said search query is spoken and converted into text data representing the at least one search term by said speech recognition engine;

matching said at least one search term against the key terms of the documents to select a subset of the key terms and determine matching documents corresponding to the subset of the key terms;

generating a document fitness value for each matching document based on a subset of the importance scores corresponding to the subset of the key terms;

ranking said matching documents according to their fitness values; and

presenting said matching documents according to said ranking.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An information retrieval system including a natural language parser (3) for parsing documents of a document space (1) to identify key terms of each document based on linguistic structure, and for parsing a search query to determine the search term, a feature extractor (4) for determining an importance score for terms of the document space based on distribution of the terms in the document space, an index term generator (5) for generating index terms using the key terms identified by the parser and the extractor and having an importance score above a threshold level, and a query clarifier (16) for selecting from the index terms, on the basis of the search term, index terms for selecting a document from the document space. A speech recognition engine (12) generates the query, and a bi-gram language module (6) generates grammar rules for the speech recognition engine using the index terms.

19 Citations

View as Search Results

15 Claims

1. An information retrieval method for use with documents, including:
- parsing a plurality of documents stored in a digital document database on computer-accessible storage media to identify key terms of each document based on sentence structure;
  
  extracting a plurality of n-grams from each document, wherein one or more of the n-grams include spaces and partial words;
  
  extracting a frequency of each n-gram in each document;
  
  extracting a frequency of each n-gram in the plurality of documents;
  
  assigning a novelty score to each of the n-grams in each corresponding document, said novelty score representing and being based on the extracted frequency of the n-gram in the document and the extracted frequency of the n-gram in the plurality of documents;
  
  determining which of the extracted n-grams are in each identified key term;
  
  assigning a weight to each key term based the novelty scores assigned to at the extracted n-grams in the key term;
  
  generating the domain-specific grammar rules for a speech recognition engine, said grammar rules including said key terms in association with respective probabilities based on the weights of the key terms, wherein the key terms define phrases that are likely to be spoken from the plurality of documents, and the grammar rules define which of the phrases are likely to follow others of the phrases with the likelihoods defined by the probabilities;
  
  determining an importance score for each said key terms in each document based on how many of the documents include the key term and the frequency of the key term in the document;
  
  parsing a search query to determine at least one search term wherein said search query is spoken and converted into text data representing the at least one search term by said speech recognition engine;
  
  matching said at least one search term against the key terms of the documents to select a subset of the key terms and determine matching documents corresponding to the subset of the key terms;
  
  generating a document fitness value for each matching document based on a subset of the importance scores corresponding to the subset of the key terms;
  
  ranking said matching documents according to their fitness values; and
  
  presenting said matching documents according to said ranking.
- View Dependent Claims (2, 3, 4, 5, 9, 10, 11, 12, 13, 14, 15)
- - 2. An information retrieval method as claimed in claim 1, wherein said matching includes selecting one of said key terms that include said search term, and said generating the document fitness value includes generating a relevant fitness value for each selected key term on the basis of respective importance scores of words in a longest-common substring of the search term and the selected key term.
  - 3. An information retrieval method as claimed in claim 2, wherein said matching includes using the selected key terms having a predetermined characteristic to determine said matching documents.
  - 4. An information retrieval method as claimed in claim 3, wherein said predetermined characteristic is having said relevant fitness value above a predetermined threshold.
  - 5. A method as claimed in claim 1, wherein said parsing is executed by a natural language parser.
  - 9. A method as claimed in claim 1, wherein each novelty score is determined on the basis of:
    - p_ij, which is the probability of the occurrence of n-gram i in document j determined from the extracted frequency of the n-gram i in the document j;
      
      q_ij, which is the probability of occurrence of the n-gram i elsewhere in said documents determined from the extracted frequency of the n-gram i in the plurality of documents and the extracted frequency of the n-gram i in the document j;
      
      t_ij, which is the probability of occurrence of the n-gram i in said documents determined from the extracted frequency of the n-gram i in the plurality of documents;
      
      S_j, which is the total count of n-grams in the document j; and
      
      S, which is Σ
      
      S_j,if p_ij≧
      
      q_ij.
  - 10. A method as claimed in claim 9, wherein each novelty score is determined to be zero if p_ij<
    - q_ij.
  - 11. A method as claimed in claim 10, wherein each novelty score is determined on the basis of the following:
  - 12. A method as claimed in claim 1, wherein the plurality of n-grams extracted from each document are of the same length n.
  - 13. A method as claimed in claim 1, wherein a natural language parser executes said parsing, and said key terms are linguistically important terms of each document.
  - 14. A method as claimed in claim 13, wherein said parser generates key-centered phrase structure frames for sentences of each document, and generates at least one frame relation graph that is parsed to determine the frames representative of the sentences of each document, said frames including said key terms.
  - 15. A method as claimed in claim 1, wherein generating the grammar rules comprises generating a list of phrases including said key terms and said respective weights, and inputting said list as a bi-gram array with said weights representing said probabilities, to generate said grammar rules for said speech recognition engine.

6. A machine-readable medium having stored thereon instructions for information retrieval comprising machine-executable code which when executed by at least one machine, causes the machine to:
- parse a plurality of documents stored in a digital document database on a computer-accessible storage media to identify key terms of each document based on sentence structure;
  
  extract a plurality of n-grams from each document, wherein one or more of the n-grams include spaces and partial words;
  
  extract a frequency of each n-gram in each document;
  
  extract a frequency of each n-gram in the plurality of documents;
  
  assign a novelty score each of the n-grams in each corresponding document, said novelty score representing and being based on the extracted frequency of the n-gram in the document and the extracted frequency of the n-gram in the plurality of documents;
  
  determine which of the extracted n-grams are in each identified key term;
  
  assign a weight to each key term based the novelty scores assigned to at the extracted n-grams in the key term;
  
  generate the domain-specific grammar rules for a speech recognition engine, said grammar rules including said key terms in association with respective probabilities based on the weights of the key terms, wherein the key terms define phrases that are likely to be spoken from the plurality of documents, and the grammar rules define which of the phrases are likely to follow others of the phrases with the likelihoods defined by the probabilities;
  
  determine an importance score for each said key term in each document based on how many of the documents include the key term and the frequency of the key term in the document;
  
  parse a search query to determine at least one search term wherein said search query is spoken and converted into text data representing the at least one search term by said speech recognition engine;
  
  match said at least one search term against the key terms of the documents to select a subset of the key terms and determine matching documents corresponding to the subset of the key terms;
  
  generate a document fitness value for each matching document based on a subset of the importance scores corresponding to the subset of the key terms;
  
  rank said matching documents according to their fitness values; and
  
  present said matching documents according to said ranking.

7. An information retrieval system, comprising:
- an extraction system including a computer system having data processing logic configured to provide;
  
  a parser for parsing a plurality of documents stored in a digital document database on computer-accessible storage media to identify key terms of each document based on sentence structure;
  
  a feature extractor for;
  
  extracting a plurality of n-grams from each document, wherein one or more of the n-grams include spaces and partial words;
  
  extracting a frequency of each n-gram in each document;
  
  extracting a frequency of each n-gram in the plurality of documents;
  
  assigning a novelty score to each of the n-grams in corresponding documents, said novelty score representing and being based on the extracted frequency of the n-gram in the document and the extracted frequency of the n-gram in the plurality of documents,determining which of the extracted n-grams are in each identified key term, and assigning a weight to each key term based on the novelty scores assigned to the extracted n-grams in the key term; and
  
  a grammar generator for generating domain-specific grammar rules for a speech recognition engine, said grammar rules including said key terms in association with respective probabilities based on the weights of the key terms; and
  
  a computer system having data processing logic configured to provide;
  
  the speech recognition engine for generating a digital search query based on the grammar rules;
  
  a natural language parser for parsing said digital search query to determine at least one search term;
  
  a feature extractor for determining an importance score for each said key terms in each document based on how many of the documents include the key term and the frequency of the key term in the document;
  
  a query clarifier for selecting from said key terms, on the basis of said at least one search term, a subset of the key terms for selecting at least one document corresponding to at least one of the subset of key terms from said document database; and
  
  a document retrieval module for generating a document fitness value for each selected document based on a subset of the importance scores corresponding to the subset of the key terms, and for ranking said selected documents according to their fitness values; and
  
  an interface for presenting said selected documents according to said ranking.

8. A method of generating index terms for documents, the method comprising:
- using a computer system having data processing logic configured to parse documents stored in a digital document database on a computer-accessible storage media to identify key terms of each document based on sentence structure;
  
  using a feature extractor of the computer system to determine an importance score for said key terms based on a distribution of language independent n-grams over said documents and to assign a novelty score to each n-gram of each document based on the probability of the occurrence of the n-gram in the document and the probability of occurrence elsewhere in said documents;
  
  retaining said key terms having an importance score above a predetermined threshold as said index terms; and
  
  generating grammar rules, for a speech recognition engine, using said index terms.wherein the novelty score is determined on the basis of

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Telstra Corporation Limited (Telstra Group Ltd.)
Original Assignee
Telstra Corporation Limited (Telstra Group Ltd.)
Inventors
Jiang, Jason, Starkie, Bradford Craig, Raskutti, Bhavani Laxman
Primary Examiner(s)
Mofiz, Apu
Assistant Examiner(s)
DAYE, CHELCIE L

Application Number

US10/399,587
Publication Number

US 20040044952A1
Time in Patent Office

4,668 Days
Field of Search

707/3, 707/5, 707/6, 707/100, 707/102, 707/750, 704/10, 704/200, 704/231, 704/251, 704/9
US Class Current

707/750
CPC Class Codes

G06F 16/313   Selection or weighting of t...

G06F 16/3334   Selection or weighting of t...

G06F 16/3344   using natural language anal...

G06F 16/93   Document management systems

G06F 40/205   Parsing

G06F 40/40   Processing or translation o...

G10L 15/00   Speech recognition G10L17/0...

Method and system for generating grammar rules

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

19 Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for generating grammar rules

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

19 Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links