Template bootstrapping for domain-adaptable natural language generation
First Claim
Patent Images
1. A computer implemented method comprising:
- a) receiving by a computer comprising a processor and a memory a set of original templates and storing the set of original templates in the memory;
b) accessing by a computer a set of databases comprising a large corpus of documents and searching by a search engine the set of databases based on the set of original templates;
c) identifying by the search engine a set of candidate sentences from a set of documents in the corpus by using a similarity measure to determine a similarity score, wherein the similarity measure comprises extracting a first set of tokens from at least one template from the set of original templates and extracting a second set of tokens from at least one candidate sentence from the set of candidate sentences, the first set of tokens and the second set of tokens each comprising a set of token-level 1 to token-level n grams, and further comprises comparing the extracted first set of tokens with the extracted second set of tokens by determining a first value representing an intersection of the extracted first and second sets of tokens, and dividing that first value by a second value derived by applying a minimum function to the extracted first and second sets of tokens to determine the similarity score;
d) automatically eliminating candidate sentences from the set of candidate sentences based upon a similarity score threshold to arrive at a reduced set of candidate sentences determined to be syntactically similar to the at least one template; and
e) processing the reduced set of candidate sentences to generate a set of natural language generation templates that, when processed by a computer and combined with a set of determined words or phrases, generate natural language text.
5 Assignments
0 Petitions
Accused Products
Abstract
The present invention relates to a system and method for bootstrapping templates for use in natural language sentence generation. More specifically, the present invention relates to identifying a set of candidate sentences from a large corpus based on a set of original templates by using a similarity measure. The set of candidate sentences are then processed or cleaned to generate a set of templates for use in natural language sentence generation.
107 Citations
20 Claims
-
1. A computer implemented method comprising:
-
a) receiving by a computer comprising a processor and a memory a set of original templates and storing the set of original templates in the memory; b) accessing by a computer a set of databases comprising a large corpus of documents and searching by a search engine the set of databases based on the set of original templates; c) identifying by the search engine a set of candidate sentences from a set of documents in the corpus by using a similarity measure to determine a similarity score, wherein the similarity measure comprises extracting a first set of tokens from at least one template from the set of original templates and extracting a second set of tokens from at least one candidate sentence from the set of candidate sentences, the first set of tokens and the second set of tokens each comprising a set of token-level 1 to token-level n grams, and further comprises comparing the extracted first set of tokens with the extracted second set of tokens by determining a first value representing an intersection of the extracted first and second sets of tokens, and dividing that first value by a second value derived by applying a minimum function to the extracted first and second sets of tokens to determine the similarity score; d) automatically eliminating candidate sentences from the set of candidate sentences based upon a similarity score threshold to arrive at a reduced set of candidate sentences determined to be syntactically similar to the at least one template; and e) processing the reduced set of candidate sentences to generate a set of natural language generation templates that, when processed by a computer and combined with a set of determined words or phrases, generate natural language text. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system for bootstrapping a set of templates for generating natural language sentences, the system comprising:
-
a) at least one database comprising a corpus of documents; b) a computer comprising a processor and a memory, the memory containing a set of executable code executable by the processor; c) a search controller configured to receive a set of original templates and generate a query based on the set of original templates; d) a search engine adapted to receive the query from the search controller and search the corpus of documents using the query based on the set of original templates to identify a set of candidate sentences from the corpus of documents; e) a template analyzer adapted to; i) select a set of similar sentences from the identified set of candidate sentences by using a similarity measure to determine a similarity score for each selected sentence, wherein the similarity measure comprises extracting a first set of tokens from at least one template from the set of original templates and extracting a second set of tokens from at least one candidate sentence from the set of candidate sentences, the first set of tokens and the second set of tokens each comprising a set of token-level 1 to token-level n grams, and further comprises comparing the extracted first set of tokens with the extracted second set of tokens by determining a first value representing an intersection of the extracted first and second sets of tokens, and dividing that first value by a second value derived by applying a minimum function to the extracted first and second sets of tokens to determine the similarity score; ii) automatically eliminate candidate sentences from the set of candidate sentences based upon a similarity score threshold to arrive at a reduced set of candidate sentences determined to be syntactically similar to the at least one template; and iii) generate a set of natural language generation templates based at least in part on the similarity scores that, when processed by a computer and combined with a set of determined words or phrases, generate natural language text. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification