Method and apparatus for generation of text documents

US 20060155530A1
Filed: 12/14/2005
Published: 07/13/2006
Est. Priority Date: 12/14/2004
Status: Abandoned Application

First Claim

Patent Images

1. A method for the generation of text documents, comprising the steps of:

(a) collecting a set of text documents as training documents and selecting a language model including model parameters (21);

(b) training of the language model by using the training documents and the model parameters (22);

(c) generating new documents (24) by using said probabilities and by using additional words beyond the words contained in the training documents, the new documents comprising the same distribution of their length as the training documents; and

(d) determine if the deviations of the word frequency as a function of the word rank (Zipf'"'"'s law) and the growths of the vocabulary as a function of the number of terms (Heap'"'"'s law) are below user defined thresholds (42, 66) and accepting only new documents which fulfil this condition.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for the generation of large volumes of text documents comprises the steps of collecting a set of unstructured text documents as training documents and choosing a language model (21). New documents are generated by using the language model and its parameters and by using additional words beyond the words contained in the training documents (25). A n-gram model or a probabilistic deterministic context-free grammar (PCFG) model may be used as language model. For the generation of structured documents a language model for modelling the text is combined with a probabilistic deterministic finite automata (PDFA) for modelling the structure of the documents. The combined model is used to generate new documents from the scratch or by using the results of an analysis of a set of training documents. Since the models reflecting various essential features of a natural structured document collection, these features are adopted into the generated document collection (26) which is suited to evaluate the performance and scalability of natural language processing (NLP) algorithms.

21 Citations

View as Search Results

30 Claims

1. A method for the generation of text documents, comprising the steps of:
- (a) collecting a set of text documents as training documents and selecting a language model including model parameters (21);
  
  (b) training of the language model by using the training documents and the model parameters (22);
  
  (c) generating new documents (24) by using said probabilities and by using additional words beyond the words contained in the training documents, the new documents comprising the same distribution of their length as the training documents; and
  
  (d) determine if the deviations of the word frequency as a function of the word rank (Zipf'"'"'s law) and the growths of the vocabulary as a function of the number of terms (Heap'"'"'s law) are below user defined thresholds (42, 66) and accepting only new documents which fulfil this condition.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1 wherein step (a) comprises the step of selecting a larger set of text documents as training documents (44, 67) if step (d) indicates that the quality of the generated documents is not sufficient.
  - 3. The method of claim 1 wherein n-gram probabilities are used as language model.
  - 4. The method of claim 1 wherein a probabilistic deterministic context-free grammar (PCFG) is used as language model.
  - 5. The method of claim 1 wherein step (c) comprises the step of choosing new words by replacing word of the vocabulary of the training documents with new words where the replacement takes place with a probability that increases with decreasing the frequency rank of the words to be replaced.

6. A method for the modelling, analysis and generation of text documents, comprising the steps of:
- (a) collecting a set of text documents as training documents;
  
  (b) computing the n-gram probabilities of the words contained in the training documents (40);
  
  (c) generating new documents by using said probabilities (41) and by using additional words which are not contained in the training documents, the new documents comprising the same distribution of their length as the training documents; and
  
  (d) determine if the deviations of the word frequency as a function of the word rank (Zipf'"'"'s law) and the growths of the vocabulary as a function of the number of terms (Heap'"'"'s law) are below user defined thresholds (42).
- View Dependent Claims (7, 8, 9, 10)
- - 7. The method of claim 6 wherein step (a) comprises the step of increasing the training set if the quality of the new documents are not sufficient (44).
  - 8. The method of claim 6 wherein step (d) comprises the step of the modifying the user defined thresholds if the new documents are not acceptable.
  - 9. The method of claim 6 wherein step (c) comprises the step of adding new words by replacing words from the selected set of training documents with new words.
  - 10. The method of claim 6 wherein pre-computed model data are used to generate new documents (50, 51), the pre-computed model data contain the terms and the probabilities of the n-gram model.

11. A method for the modelling, analysis and generation of text documents comprising the steps of:
- (a) collecting a set of text documents as training documents;
  
  (b) selecting a probabilistic deterministic context-free grammar (PCFG) model having a finite set of of nonterminal symbols, a finite set of terminal symbols is disjoint from the set of nonterminal symbols, a finite set R of production rules and an objective function (60);
  
  (c) applying a modification to the grammar model for changing the terminal and nonterminal symbols of the training documents and to the structure elements of the training documents (61);
  
  (d) computing of the objective function for the training documents by using various approximations (62), and hold the modification if the objective function has increased (63);
  
  (e) repeating step (c) until the modification result in an increase of the objective function to a user defined threshold (64);
  
  (f) generating new documents by using the modified grammar model and by using additional words beyond the words contained in the training documents, the new documents comprising the same distribution of their length as the training documents (65); and
  
  (g) determine if the deviations of the word frequency as a function of the word rank (Zipf'"'"'s law) and the growths of the vocabulary as a function of the number of terms (Heap'"'"'s law) are below user defined thresholds (66).
- View Dependent Claims (12, 13, 14, 15)
- - 12. The method of claim 11 wherein step (a) comprises the step of selecting a larger set of text documents as training documents if step (f) indicates that the quality of the generated documents is not sufficient (67).
  - 13. The method of claim 11 wherein wherein step (d) comprises the step of modifying the user defined thresholds if the new document are not acceptable.
  - 14. The method of claim 11 wherein step (c) comprises the step of adding new words by replacing words from the selected set of training documents with new words.
  - 15. The method of claim 11 wherein a probabilistic deterministic context-free grammar (PCFG) is directly used for the generating new documents (70, 71).

16. A method for the generation of structured text documents comprising the steps of:
- (a) collecting a set of structured text documents as training documents;
  
  (b) selecting a language model for the unstructured text parts and training of the language model by using training documents and the model parameters (22);
  
  (c) describing the document structure of the training documents by using a selected markup language (80);
  
  (d) obtaining a probabilistic deterministic finite automata (PDFA) having a single state (80);
  
  (e) adding additional states to the probabilistic deterministic finite automata (PDFA) to match the states occurring in the training documents (81);
  
  (f) calculating the probabilities of the transitions between the states using the appropriate transition frequencies which are occurring in the training documents (82);
  
  (g) training of the language model for each text part identified by the selected markup language (83);
  
  (h) generating the document structure of new documents (84) by applying the probabilistic deterministic finite automata (PDFA);
  
  (i) generating the text parts of the new documents (84) by using said computed probabilities and by using additional words beyond the words contained in the training documents, the new documents comprising the same distribution of their length as the training documents; and
  
  (j) determine if the deviations of the word frequency as a function of the word rank (Zipf'"'"'s law) and the growths of the vocabulary as a function of the number of terms (Heap'"'"'s law) are below user defined thresholds and accepting only new documents which fulfil this condition (42, 66).
- View Dependent Claims (17, 18, 19, 20, 29, 30)
- - 17. The method of claim 16 wherein step (a) comprises the step of selecting a larger set of structured text documents as training documents if step (j) indicates that the quality of the generated documents is not sufficient (44, 67).
  - 18. The method of claim 16 wherein step (i) comprises the step of choosing new words by replacing word of the vocabulary of the training documents with new words where the replacement takes place with a probability that is increasing with decreasing the frequency rank of the words to be replaced.
  - 19. The method of claim 16 wherein n-gram probabilities are used as language model.
  - 20. The method of claim 16 wherein step (b) selects a probabilistic deterministic context-free grammar (PCFG) having a finite set of nonterminal symbols, a finite set of terminal symbols that is disjoint from the set of nonterminal symbols, a finite set R of production rules and an objective function;
    - and comprising the steps of (k) applying a modification to the grammar which changes the terminal and nonterminal symbols and to the structure elements of the training documents (61);
      
      (l) computing an objective function for the training documents (62) by using various approximations, hold the modification if the objective function has increased (63); and
      
      (m) repeating step (k) until the modification results in an increase of the objective function to a threshold defined by the user.
  - 29. A computer program comprising program code means for performing the steps of any one of the claims 16-21 when said program is run on a computer system.
  - 30. A computer program product comprising program code means stored on a computer readable medium for performing the steps of any one of the claims 16-21 when said program is run on a computer system.

21. A method for the generation of structured text documents comprising the steps of obtaining a deterministic finite automata (90) from a description of the structure of the text documents to be generated;
- creating a probabilistic deterministic finite automata (91) by associating the same probability to all transition functions of the deterministic finite automata; and
  
  generating new documents (92) by applying said probabilistic deterministic finite automata (PDFA) to firstly generate the structure of the new documents and secondly to generate a n-gram model or a probabilistic deterministic context-free grammar (PCFG) model to be used for generating the text parts of the new documents.
- View Dependent Claims (23, 24)
- - 23. The apparatus of claim 21 wherein n-gram probabilities are used as language model.
  - 24. The apparatus of claim 21 wherein a probabilistic deterministic context-free grammar (PCFG) is used as language model.

22. An apparatus for the generation of text documents, using a collection of text documents as training documents, comprising (a) means (41, 65) for generating new documents by using a language model and its model parameters and by using additional words beyond the words contained in the training documents, the new documents comprising the same distribution of their length as the training documents;
- and (b) means (42, 66) for determination if the deviations of the word frequency as a function of the word rank (Zipf'"'"'s law) and the growths of the vocabulary as a function of the number of terms (Heap'"'"'s law) are below predefined thresholds and for accepting only new documents which fulfil this condition.

25. An apparatus for the generation of structured text documents, using a set of structured text documents as training documents, comprising (a) means (80) for describing the document structure of the training documents by using a selected markup language;
- (b) a probabilistic deterministic finite automata (PDFA) having a single state (80) and means (81) for adding additional states to the probabilistic deterministic finite automata (PDFA) to match the states occurring in the training documents;
  
  (c) means (82) for calculating the probabilities of the transitions between the states using the appropriate transition frequencies which are occurring in the training documents;
  
  (d) means (83) for training a language model for each text part identified by the selected markup language;
  
  (e) means (84) for generating the document structure of new documents by using the probabilistic deterministic finite automata (PDFA);
  
  (f) means (84) for generating the text parts of the new documents by using said language model and its model parameters and by using additional words beyond the words contained in the training documents, the new documents comprising the same distribution of their length as the training documents; and
  
  (g) means for determination if the deviations of the word frequency as a function of the word rank (Zipf'"'"'s law and the growths of the vocabulary as a function of the number of terms (Heap'"'"'s law) are below user defined thresholds and for accepting only new documents which fulfil this condition.
- View Dependent Claims (26, 27)
- - 26. The apparatus of claim 25 wherein n-gram probabilities are used as language model.
  - 27. The apparatus of claim 25 wherein a probabilistic deterministic context-free grammar (PCFG) is used as language model.

28. An apparatus for the generation of structured text documents comprising means (90) for obtaining a deterministic finite automata from a description of the structure of the text documents to be generated;
- means (91) for creating a probabilistic deterministic finite automata by associating the same probability to all transition functions of the deterministic finite automata; and
  
  means (92) for generating new documents by applying said probabilistic deterministic finite automata (PDFA) to firstly generate the structure of the new documents and secondly to generate a n-gram model or a probabilistic deterministic context-free grammar (PCFG) to be used for generating the text parts of the new documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Codron, Matthieu, Altevogt, Peter, Seiffert, Roland

Application Number

US11/304,337
Publication Number

US 20060155530A1
Time in Patent Office

Days
Field of Search
US Class Current

704/9
CPC Class Codes

G06F 40/216 using statistical methods

Method and apparatus for generation of text documents

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

21 Citations

30 Claims

Specification

Use Cases

Quick Links

Others

Method and apparatus for generation of text documents

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

21 Citations

30 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others