Method for generating descriptors for the classification of texts
First Claim
1. A method of generating descriptors for natural language texts, using a plurality of training texts having a plurality of words, comprising the steps of:
- extracting words from a text during a training phase on the basis of the training texts;
predetermining a minimum structure of said descriptors;
breaking down words in the text into shorter word segments, wherein each shorter word segment within a longer word segment must meet said minimum structure for said breaking down to be permitted; and
matching said word segments that remain in the text against each other to generate a list of descriptors.
1 Assignment
0 Petitions
Accused Products
Abstract
The proposed method for generating descriptors for the classification of texts provides a breakdown of more complex word forms by way of matching with the entirety of word forms occurring within a compilation of training texts. No morphological or linguistic knowledge base is necessary for the preferably cyclically continued breakdown, nor for the accompanying drawing up of stop word prefix and suffix lists. Simple morphological knowledge is provided by prescribing minimum requirements with respect to the form of descriptors and text sections. The method is particularly flexible and can be easily adapted to new applications. The method is also very error-tolerant and thus particularly suited for the classification of digitized texts which are produced from written texts by means of character recognition methods or from spoken texts by means of language recognition methods.
-
Citations
8 Claims
-
1. A method of generating descriptors for natural language texts, using a plurality of training texts having a plurality of words, comprising the steps of:
-
extracting words from a text during a training phase on the basis of the training texts; predetermining a minimum structure of said descriptors; breaking down words in the text into shorter word segments, wherein each shorter word segment within a longer word segment must meet said minimum structure for said breaking down to be permitted; and matching said word segments that remain in the text against each other to generate a list of descriptors. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
Specification