Method and system for generating grammar rules
First Claim
1. An information retrieval method for use with documents, including:
- parsing a plurality of documents stored in a digital document database on computer-accessible storage media to identify key terms of each document based on sentence structure;
extracting a plurality of n-grams from each document, wherein one or more of the n-grams include spaces and partial words;
extracting a frequency of each n-gram in each document;
extracting a frequency of each n-gram in the plurality of documents;
assigning a novelty score to each of the n-grams in each corresponding document, said novelty score representing and being based on the extracted frequency of the n-gram in the document and the extracted frequency of the n-gram in the plurality of documents;
determining which of the extracted n-grams are in each identified key term;
assigning a weight to each key term based the novelty scores assigned to at the extracted n-grams in the key term;
generating the domain-specific grammar rules for a speech recognition engine, said grammar rules including said key terms in association with respective probabilities based on the weights of the key terms, wherein the key terms define phrases that are likely to be spoken from the plurality of documents, and the grammar rules define which of the phrases are likely to follow others of the phrases with the likelihoods defined by the probabilities;
determining an importance score for each said key terms in each document based on how many of the documents include the key term and the frequency of the key term in the document;
parsing a search query to determine at least one search term wherein said search query is spoken and converted into text data representing the at least one search term by said speech recognition engine;
matching said at least one search term against the key terms of the documents to select a subset of the key terms and determine matching documents corresponding to the subset of the key terms;
generating a document fitness value for each matching document based on a subset of the importance scores corresponding to the subset of the key terms;
ranking said matching documents according to their fitness values; and
presenting said matching documents according to said ranking.
2 Assignments
0 Petitions
Accused Products
Abstract
An information retrieval system including a natural language parser (3) for parsing documents of a document space (1) to identify key terms of each document based on linguistic structure, and for parsing a search query to determine the search term, a feature extractor (4) for determining an importance score for terms of the document space based on distribution of the terms in the document space, an index term generator (5) for generating index terms using the key terms identified by the parser and the extractor and having an importance score above a threshold level, and a query clarifier (16) for selecting from the index terms, on the basis of the search term, index terms for selecting a document from the document space. A speech recognition engine (12) generates the query, and a bi-gram language module (6) generates grammar rules for the speech recognition engine using the index terms.
19 Citations
15 Claims
-
1. An information retrieval method for use with documents, including:
-
parsing a plurality of documents stored in a digital document database on computer-accessible storage media to identify key terms of each document based on sentence structure; extracting a plurality of n-grams from each document, wherein one or more of the n-grams include spaces and partial words; extracting a frequency of each n-gram in each document; extracting a frequency of each n-gram in the plurality of documents; assigning a novelty score to each of the n-grams in each corresponding document, said novelty score representing and being based on the extracted frequency of the n-gram in the document and the extracted frequency of the n-gram in the plurality of documents; determining which of the extracted n-grams are in each identified key term; assigning a weight to each key term based the novelty scores assigned to at the extracted n-grams in the key term; generating the domain-specific grammar rules for a speech recognition engine, said grammar rules including said key terms in association with respective probabilities based on the weights of the key terms, wherein the key terms define phrases that are likely to be spoken from the plurality of documents, and the grammar rules define which of the phrases are likely to follow others of the phrases with the likelihoods defined by the probabilities; determining an importance score for each said key terms in each document based on how many of the documents include the key term and the frequency of the key term in the document; parsing a search query to determine at least one search term wherein said search query is spoken and converted into text data representing the at least one search term by said speech recognition engine; matching said at least one search term against the key terms of the documents to select a subset of the key terms and determine matching documents corresponding to the subset of the key terms; generating a document fitness value for each matching document based on a subset of the importance scores corresponding to the subset of the key terms; ranking said matching documents according to their fitness values; and presenting said matching documents according to said ranking. - View Dependent Claims (2, 3, 4, 5, 9, 10, 11, 12, 13, 14, 15)
-
-
6. A machine-readable medium having stored thereon instructions for information retrieval comprising machine-executable code which when executed by at least one machine, causes the machine to:
-
parse a plurality of documents stored in a digital document database on a computer-accessible storage media to identify key terms of each document based on sentence structure; extract a plurality of n-grams from each document, wherein one or more of the n-grams include spaces and partial words; extract a frequency of each n-gram in each document; extract a frequency of each n-gram in the plurality of documents; assign a novelty score each of the n-grams in each corresponding document, said novelty score representing and being based on the extracted frequency of the n-gram in the document and the extracted frequency of the n-gram in the plurality of documents; determine which of the extracted n-grams are in each identified key term; assign a weight to each key term based the novelty scores assigned to at the extracted n-grams in the key term; generate the domain-specific grammar rules for a speech recognition engine, said grammar rules including said key terms in association with respective probabilities based on the weights of the key terms, wherein the key terms define phrases that are likely to be spoken from the plurality of documents, and the grammar rules define which of the phrases are likely to follow others of the phrases with the likelihoods defined by the probabilities; determine an importance score for each said key term in each document based on how many of the documents include the key term and the frequency of the key term in the document; parse a search query to determine at least one search term wherein said search query is spoken and converted into text data representing the at least one search term by said speech recognition engine; match said at least one search term against the key terms of the documents to select a subset of the key terms and determine matching documents corresponding to the subset of the key terms; generate a document fitness value for each matching document based on a subset of the importance scores corresponding to the subset of the key terms; rank said matching documents according to their fitness values; and present said matching documents according to said ranking.
-
-
7. An information retrieval system, comprising:
-
an extraction system including a computer system having data processing logic configured to provide; a parser for parsing a plurality of documents stored in a digital document database on computer-accessible storage media to identify key terms of each document based on sentence structure; a feature extractor for; extracting a plurality of n-grams from each document, wherein one or more of the n-grams include spaces and partial words; extracting a frequency of each n-gram in each document; extracting a frequency of each n-gram in the plurality of documents; assigning a novelty score to each of the n-grams in corresponding documents, said novelty score representing and being based on the extracted frequency of the n-gram in the document and the extracted frequency of the n-gram in the plurality of documents, determining which of the extracted n-grams are in each identified key term, and assigning a weight to each key term based on the novelty scores assigned to the extracted n-grams in the key term; and a grammar generator for generating domain-specific grammar rules for a speech recognition engine, said grammar rules including said key terms in association with respective probabilities based on the weights of the key terms; and a computer system having data processing logic configured to provide; the speech recognition engine for generating a digital search query based on the grammar rules; a natural language parser for parsing said digital search query to determine at least one search term; a feature extractor for determining an importance score for each said key terms in each document based on how many of the documents include the key term and the frequency of the key term in the document; a query clarifier for selecting from said key terms, on the basis of said at least one search term, a subset of the key terms for selecting at least one document corresponding to at least one of the subset of key terms from said document database; and a document retrieval module for generating a document fitness value for each selected document based on a subset of the importance scores corresponding to the subset of the key terms, and for ranking said selected documents according to their fitness values; and an interface for presenting said selected documents according to said ranking.
-
-
8. A method of generating index terms for documents, the method comprising:
-
using a computer system having data processing logic configured to parse documents stored in a digital document database on a computer-accessible storage media to identify key terms of each document based on sentence structure; using a feature extractor of the computer system to determine an importance score for said key terms based on a distribution of language independent n-grams over said documents and to assign a novelty score to each n-gram of each document based on the probability of the occurrence of the n-gram in the document and the probability of occurrence elsewhere in said documents; retaining said key terms having an importance score above a predetermined threshold as said index terms; and generating grammar rules, for a speech recognition engine, using said index terms. wherein the novelty score is determined on the basis of
-
Specification