Cross-language text classification
First Claim
Patent Images
1. A method of performing text classification based on language-independent text features, the method comprising:
- performing, by a processor, a first syntactic and semantic analysis of a training natural language text to produce a first plurality of language-independent semantic structures representing a plurality of sentences of the training natural language text;
producing, based on the first plurality of language-independent semantic structures, a text classifier model;
performing a second syntactic and semantic analysis of an input natural language text to produce a second plurality of language-independent semantic structures representing a plurality of sentences of the input natural language text;
extracting, using the second plurality of language-independent semantic structures, a set of features, wherein at least one feature references a semantic class of a language-independent semantic hierarchy comprising a plurality of semantic classes, in which the semantic class exhibits one or more properties inherited from its parent semantic class;
applying the text classifier model to the set of features to produce a classification spectrum comprising a plurality of weight values, wherein each weight value reflects a degree of association of the input natural language text with a particular category of natural language texts; and
associating the input natural language text with one or more categories using the classification spectrum.
5 Assignments
0 Petitions
Accused Products
Abstract
Methods are described for performing classification (categorization) of text documents written in various languages. Language-independent semantic structures are constructed before classifying documents. These structures reflect lexical, morphological, syntactic, and semantic properties of documents. The methods suggested are able to perform cross-language text classification which is based on document properties reflecting their meaning. The methods are applicable to genre classification, topic detection, news analysis, authorship analysis, etc.
-
Citations
21 Claims
-
1. A method of performing text classification based on language-independent text features, the method comprising:
-
performing, by a processor, a first syntactic and semantic analysis of a training natural language text to produce a first plurality of language-independent semantic structures representing a plurality of sentences of the training natural language text; producing, based on the first plurality of language-independent semantic structures, a text classifier model; performing a second syntactic and semantic analysis of an input natural language text to produce a second plurality of language-independent semantic structures representing a plurality of sentences of the input natural language text; extracting, using the second plurality of language-independent semantic structures, a set of features, wherein at least one feature references a semantic class of a language-independent semantic hierarchy comprising a plurality of semantic classes, in which the semantic class exhibits one or more properties inherited from its parent semantic class; applying the text classifier model to the set of features to produce a classification spectrum comprising a plurality of weight values, wherein each weight value reflects a degree of association of the input natural language text with a particular category of natural language texts; and associating the input natural language text with one or more categories using the classification spectrum. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A non-transitory computer readable storage medium comprising executable instructions for causing a computing system to perform operations comprising:
-
performing a first syntactic and semantic analysis of a training natural language text to produce a first plurality of language-independent semantic structures representing a plurality of sentences of the training natural language text;
producing, based on the first plurality of language-independent semantic structures, a text classifier model;performing a second syntactic and semantic analysis of an input natural language text to produce a second plurality of language-independent semantic structures representing a plurality of sentences of the input natural language text; extracting, using the second plurality of language-independent semantic structures, a set of features, wherein at least one feature references a semantic class of a language-independent semantic hierarchy comprising a plurality of semantic classes, in which the semantic class exhibits one or more properties inherited from its parent semantic class; applying the text classifier model to the set of features to produce a classification spectrum comprising a plurality of weight values, wherein each weight value references a degree of association of the input natural language text with a particular category of natural language texts; and associating the input natural language text with one or more categories using the classification spectrum. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A computer system adapted to perform text classification based on language-independent text features, the computer system comprising:
-
a feature extractor adapted to perform operations comprising; performing a first syntactic and semantic analysis of a training natural language text to produce a first plurality of language-independent semantic structures representing a plurality of sentences of the training natural language text; producing, based on the first plurality of language-independent semantic structures, a text classifier model; performing a second syntactic and semantic analysis of an input natural language text to produce a second plurality of language-independent semantic structures representing a plurality of sentences of the input natural language text; extracting, using the second plurality of language-independent semantic structures, a set of features, wherein at least one feature references a semantic class of a language-independent semantic hierarchy comprising a plurality of semantic classes, in which the semantic class exhibits one or more properties inherited from its parent semantic class; and a text classifier adapted to perform operations comprising; applying the text classifier model to the set of features to generate a classification spectrum comprising a plurality of weight values, wherein each weight value references a degree of association of the input natural language text with a particular category of natural language texts; and associating the input natural language text with one or more categories using the classification spectrum. - View Dependent Claims (16, 17, 18, 19, 20, 21)
-
Specification