Method and apparatus for generation of text documents
First Claim
1. A method for the generation of text documents, comprising the steps of:
- (a) collecting a set of text documents as training documents and selecting a language model including model parameters (21);
(b) training of the language model by using the training documents and the model parameters (22);
(c) generating new documents (24) by using said probabilities and by using additional words beyond the words contained in the training documents, the new documents comprising the same distribution of their length as the training documents; and
(d) determine if the deviations of the word frequency as a function of the word rank (Zipf'"'"'s law) and the growths of the vocabulary as a function of the number of terms (Heap'"'"'s law) are below user defined thresholds (42, 66) and accepting only new documents which fulfil this condition.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for the generation of large volumes of text documents comprises the steps of collecting a set of unstructured text documents as training documents and choosing a language model (21). New documents are generated by using the language model and its parameters and by using additional words beyond the words contained in the training documents (25). A n-gram model or a probabilistic deterministic context-free grammar (PCFG) model may be used as language model. For the generation of structured documents a language model for modelling the text is combined with a probabilistic deterministic finite automata (PDFA) for modelling the structure of the documents. The combined model is used to generate new documents from the scratch or by using the results of an analysis of a set of training documents. Since the models reflecting various essential features of a natural structured document collection, these features are adopted into the generated document collection (26) which is suited to evaluate the performance and scalability of natural language processing (NLP) algorithms.
21 Citations
30 Claims
-
1. A method for the generation of text documents, comprising the steps of:
-
(a) collecting a set of text documents as training documents and selecting a language model including model parameters (21);
(b) training of the language model by using the training documents and the model parameters (22);
(c) generating new documents (24) by using said probabilities and by using additional words beyond the words contained in the training documents, the new documents comprising the same distribution of their length as the training documents; and
(d) determine if the deviations of the word frequency as a function of the word rank (Zipf'"'"'s law) and the growths of the vocabulary as a function of the number of terms (Heap'"'"'s law) are below user defined thresholds (42, 66) and accepting only new documents which fulfil this condition. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method for the modelling, analysis and generation of text documents, comprising the steps of:
-
(a) collecting a set of text documents as training documents;
(b) computing the n-gram probabilities of the words contained in the training documents (40);
(c) generating new documents by using said probabilities (41) and by using additional words which are not contained in the training documents, the new documents comprising the same distribution of their length as the training documents; and
(d) determine if the deviations of the word frequency as a function of the word rank (Zipf'"'"'s law) and the growths of the vocabulary as a function of the number of terms (Heap'"'"'s law) are below user defined thresholds (42). - View Dependent Claims (7, 8, 9, 10)
-
-
11. A method for the modelling, analysis and generation of text documents comprising the steps of:
-
(a) collecting a set of text documents as training documents;
(b) selecting a probabilistic deterministic context-free grammar (PCFG) model having a finite set of of nonterminal symbols, a finite set of terminal symbols is disjoint from the set of nonterminal symbols, a finite set R of production rules and an objective function (60);
(c) applying a modification to the grammar model for changing the terminal and nonterminal symbols of the training documents and to the structure elements of the training documents (61);
(d) computing of the objective function for the training documents by using various approximations (62), and hold the modification if the objective function has increased (63);
(e) repeating step (c) until the modification result in an increase of the objective function to a user defined threshold (64);
(f) generating new documents by using the modified grammar model and by using additional words beyond the words contained in the training documents, the new documents comprising the same distribution of their length as the training documents (65); and
(g) determine if the deviations of the word frequency as a function of the word rank (Zipf'"'"'s law) and the growths of the vocabulary as a function of the number of terms (Heap'"'"'s law) are below user defined thresholds (66). - View Dependent Claims (12, 13, 14, 15)
-
-
16. A method for the generation of structured text documents comprising the steps of:
-
(a) collecting a set of structured text documents as training documents;
(b) selecting a language model for the unstructured text parts and training of the language model by using training documents and the model parameters (22);
(c) describing the document structure of the training documents by using a selected markup language (80);
(d) obtaining a probabilistic deterministic finite automata (PDFA) having a single state (80);
(e) adding additional states to the probabilistic deterministic finite automata (PDFA) to match the states occurring in the training documents (81);
(f) calculating the probabilities of the transitions between the states using the appropriate transition frequencies which are occurring in the training documents (82);
(g) training of the language model for each text part identified by the selected markup language (83);
(h) generating the document structure of new documents (84) by applying the probabilistic deterministic finite automata (PDFA);
(i) generating the text parts of the new documents (84) by using said computed probabilities and by using additional words beyond the words contained in the training documents, the new documents comprising the same distribution of their length as the training documents; and
(j) determine if the deviations of the word frequency as a function of the word rank (Zipf'"'"'s law) and the growths of the vocabulary as a function of the number of terms (Heap'"'"'s law) are below user defined thresholds and accepting only new documents which fulfil this condition (42, 66). - View Dependent Claims (17, 18, 19, 20, 29, 30)
-
-
21. A method for the generation of structured text documents comprising the steps of
obtaining a deterministic finite automata (90) from a description of the structure of the text documents to be generated; -
creating a probabilistic deterministic finite automata (91) by associating the same probability to all transition functions of the deterministic finite automata; and
generating new documents (92) by applying said probabilistic deterministic finite automata (PDFA) to firstly generate the structure of the new documents and secondly to generate a n-gram model or a probabilistic deterministic context-free grammar (PCFG) model to be used for generating the text parts of the new documents. - View Dependent Claims (23, 24)
-
-
22. An apparatus for the generation of text documents, using a collection of text documents as training documents, comprising
(a) means (41, 65) for generating new documents by using a language model and its model parameters and by using additional words beyond the words contained in the training documents, the new documents comprising the same distribution of their length as the training documents; - and
(b) means (42, 66) for determination if the deviations of the word frequency as a function of the word rank (Zipf'"'"'s law) and the growths of the vocabulary as a function of the number of terms (Heap'"'"'s law) are below predefined thresholds and for accepting only new documents which fulfil this condition.
- and
-
25. An apparatus for the generation of structured text documents, using a set of structured text documents as training documents, comprising
(a) means (80) for describing the document structure of the training documents by using a selected markup language; -
(b) a probabilistic deterministic finite automata (PDFA) having a single state (80) and means (81) for adding additional states to the probabilistic deterministic finite automata (PDFA) to match the states occurring in the training documents;
(c) means (82) for calculating the probabilities of the transitions between the states using the appropriate transition frequencies which are occurring in the training documents;
(d) means (83) for training a language model for each text part identified by the selected markup language;
(e) means (84) for generating the document structure of new documents by using the probabilistic deterministic finite automata (PDFA);
(f) means (84) for generating the text parts of the new documents by using said language model and its model parameters and by using additional words beyond the words contained in the training documents, the new documents comprising the same distribution of their length as the training documents; and
(g) means for determination if the deviations of the word frequency as a function of the word rank (Zipf'"'"'s law and the growths of the vocabulary as a function of the number of terms (Heap'"'"'s law) are below user defined thresholds and for accepting only new documents which fulfil this condition. - View Dependent Claims (26, 27)
-
-
28. An apparatus for the generation of structured text documents comprising
means (90) for obtaining a deterministic finite automata from a description of the structure of the text documents to be generated; -
means (91) for creating a probabilistic deterministic finite automata by associating the same probability to all transition functions of the deterministic finite automata; and
means (92) for generating new documents by applying said probabilistic deterministic finite automata (PDFA) to firstly generate the structure of the new documents and secondly to generate a n-gram model or a probabilistic deterministic context-free grammar (PCFG) to be used for generating the text parts of the new documents.
-
Specification