Template bootstrapping for domain-adaptable natural language generation

US 10,095,692 B2
Filed: 05/29/2015
Issued: 10/09/2018
Est. Priority Date: 11/29/2012
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method comprising:

a) receiving by a computer comprising a processor and a memory a set of original templates and storing the set of original templates in the memory;

b) accessing by a computer a set of databases comprising a large corpus of documents and searching by a search engine the set of databases based on the set of original templates;

c) identifying by the search engine a set of candidate sentences from a set of documents in the corpus by using a similarity measure to determine a similarity score, wherein the similarity measure comprises extracting a first set of tokens from at least one template from the set of original templates and extracting a second set of tokens from at least one candidate sentence from the set of candidate sentences, the first set of tokens and the second set of tokens each comprising a set of token-level 1 to token-level n grams, and further comprises comparing the extracted first set of tokens with the extracted second set of tokens by determining a first value representing an intersection of the extracted first and second sets of tokens, and dividing that first value by a second value derived by applying a minimum function to the extracted first and second sets of tokens to determine the similarity score;

d) automatically eliminating candidate sentences from the set of candidate sentences based upon a similarity score threshold to arrive at a reduced set of candidate sentences determined to be syntactically similar to the at least one template; and

e) processing the reduced set of candidate sentences to generate a set of natural language generation templates that, when processed by a computer and combined with a set of determined words or phrases, generate natural language text.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention relates to a system and method for bootstrapping templates for use in natural language sentence generation. More specifically, the present invention relates to identifying a set of candidate sentences from a large corpus based on a set of original templates by using a similarity measure. The set of candidate sentences are then processed or cleaned to generate a set of templates for use in natural language sentence generation.

107 Citations

View as Search Results

20 Claims

1. A computer implemented method comprising:
- a) receiving by a computer comprising a processor and a memory a set of original templates and storing the set of original templates in the memory;
  
  b) accessing by a computer a set of databases comprising a large corpus of documents and searching by a search engine the set of databases based on the set of original templates;
  
  c) identifying by the search engine a set of candidate sentences from a set of documents in the corpus by using a similarity measure to determine a similarity score, wherein the similarity measure comprises extracting a first set of tokens from at least one template from the set of original templates and extracting a second set of tokens from at least one candidate sentence from the set of candidate sentences, the first set of tokens and the second set of tokens each comprising a set of token-level 1 to token-level n grams, and further comprises comparing the extracted first set of tokens with the extracted second set of tokens by determining a first value representing an intersection of the extracted first and second sets of tokens, and dividing that first value by a second value derived by applying a minimum function to the extracted first and second sets of tokens to determine the similarity score;
  
  d) automatically eliminating candidate sentences from the set of candidate sentences based upon a similarity score threshold to arrive at a reduced set of candidate sentences determined to be syntactically similar to the at least one template; and
  
  e) processing the reduced set of candidate sentences to generate a set of natural language generation templates that, when processed by a computer and combined with a set of determined words or phrases, generate natural language text.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1 further comprising sorting the set of candidate sentences based on the similarity score.
  - 3. The method of claim 1 further comprising identifying all sentences in the corpus by splitting each sentence from each other sentence for every document in the corpus.
  - 4. The method of claim 1 further comprising wherein the similarity measure comprises the formula:
  - 5. The method of claim 1 wherein the identifying further comprises identifying a set of syntactically similar sentences that are not identical to any template in the set of original templates and that comprise a set of semantic characteristics similar to the set of original templates.
  - 6. The method of claim 1 further comprising determining if the similarity score for a sentence and a template from the set of original templates is higher than a determined threshold and placing the sentence in the set of candidate sentences.
  - 7. The method of claim 1 wherein the identifying further comprises identifying a set of candidate sentences that relate to a topic similar to a topic associated with the set of original templates.
  - 8. The method of claim 1 further comprising wherein the set of original templates are manually generated for a domain.
  - 9. The method of claim 1 further comprising wherein the large corpus of documents is a news corpus.
  - 10. The method of claim 1 further comprising generating by a computer a set of natural language sentences based on the set of natural language templates.

11. A system for bootstrapping a set of templates for generating natural language sentences, the system comprising:
- a) at least one database comprising a corpus of documents;
  
  b) a computer comprising a processor and a memory, the memory containing a set of executable code executable by the processor;
  
  c) a search controller configured to receive a set of original templates and generate a query based on the set of original templates;
  
  d) a search engine adapted to receive the query from the search controller and search the corpus of documents using the query based on the set of original templates to identify a set of candidate sentences from the corpus of documents;
  
  e) a template analyzer adapted to;
  
  i) select a set of similar sentences from the identified set of candidate sentences by using a similarity measure to determine a similarity score for each selected sentence, wherein the similarity measure comprises extracting a first set of tokens from at least one template from the set of original templates and extracting a second set of tokens from at least one candidate sentence from the set of candidate sentences, the first set of tokens and the second set of tokens each comprising a set of token-level 1 to token-level n grams, and further comprises comparing the extracted first set of tokens with the extracted second set of tokens by determining a first value representing an intersection of the extracted first and second sets of tokens, and dividing that first value by a second value derived by applying a minimum function to the extracted first and second sets of tokens to determine the similarity score;
  
  ii) automatically eliminate candidate sentences from the set of candidate sentences based upon a similarity score threshold to arrive at a reduced set of candidate sentences determined to be syntactically similar to the at least one template; and
  
  iii) generate a set of natural language generation templates based at least in part on the similarity scores that, when processed by a computer and combined with a set of determined words or phrases, generate natural language text.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The system of claim 11 wherein the template analyzer is further adapted to sort the set of candidate sentences based on the similarity score.
  - 13. The system of claim 11 wherein the template analyzer is further adapted to identify all sentences in the corpus by splitting each sentence from each other sentence for every document in the corpus.
  - 14. The system of claim 11 wherein the similarity measure comprises the formula:
  - 15. The system of claim 11 wherein the search engine is further adapted to identify a set of syntactically similar sentences that are not identical to any template in the set of original templates and that comprise a set of semantic characteristics similar to the set of original templates.
  - 16. The system of claim 11 wherein the template analyzer is further adapted to determine if the similarity score for a sentence and a template from the set of original templates is higher than a determined threshold and to place the sentence in the set of candidate sentences.
  - 17. The system of claim 11 wherein the template analyzer is adapted to identify a set of candidate sentences that relate to a topic similar to a topic associated with the set of original templates.
  - 18. The system of claim 11 further comprising wherein the set of original templates are manually generated for a domain.
  - 19. The system of claim 11 further comprising wherein the large corpus of documents is a news corpus.
  - 20. The system of claim 11 wherein the template analyzer is further adapted to generate a set of natural language sentences based on the set of natural language templates.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Thomson Reuters Enterprise Centre GmbH (The Woodbridge Co. Ltd.)
Original Assignee
Thomson Reuters Global Resources
Inventors
Song, Dezhao, Howald, Blake, Schilder, Frank
Primary Examiner(s)
Thomas-Homescu, Anne L

Application Number

US14/726,119
Publication Number

US 20150261745A1
Time in Patent Office

1,229 Days
Field of Search
US Class Current
CPC Class Codes

G06F 40/56 Natural language generation

Template bootstrapping for domain-adaptable natural language generation

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

107 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Template bootstrapping for domain-adaptable natural language generation

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

107 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links