CROSS-LANGUAGE TEXT CLASSIFICATION
First Claim
1. A method for a computer to analyze, across languages, a text written in one or more natural languages, the method comprising:
- performing an analysis of a sentence of the text, wherein the analysis includes performing steps including;
performing lexical-morphological analysis of the sentence of the text;
performing a syntactical analysis of the sentence of the text; and
performing a semantic analysis of the sentence of the text;
generating a set of features, where at least one feature is based on the results of the said analysis; and
performing a text classification based on said set of features, wherein the text classification includes assigning the text to one or more categories.
5 Assignments
0 Petitions
Accused Products
Abstract
Methods are described for performing classification (categorization) of text documents written in various languages. Language-independent semantic structures are constructed before classifying documents. These structures reflect lexical, morphological, syntactic, and semantic properties of documents. The methods suggested are able to perform cross-language text classification which is based on document properties reflecting their meaning. The methods are applicable to genre classification, topic detection, news analysis, authorship analysis, etc.
40 Citations
37 Claims
-
1. A method for a computer to analyze, across languages, a text written in one or more natural languages, the method comprising:
-
performing an analysis of a sentence of the text, wherein the analysis includes performing steps including; performing lexical-morphological analysis of the sentence of the text; performing a syntactical analysis of the sentence of the text; and performing a semantic analysis of the sentence of the text; generating a set of features, where at least one feature is based on the results of the said analysis; and performing a text classification based on said set of features, wherein the text classification includes assigning the text to one or more categories. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A non-transitory computer readable medium comprising instructions for causing a computing system to carry out steps comprising:
-
performing a feature extraction on the text, wherein the feature extraction includes performing steps including; defining a lexical-morphological feature of a sentence of the text; defining a syntactical feature of the sentence of the text; defining a semantic feature of the sentence of the text; and generating a set of features for the sentence of the text related to the lexical-morphological feature, the syntactical feature and the semantic feature; performing a text classification based on said set of features, wherein the text classification includes assigning the text to one or more categories. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
-
-
26. A computer system adapted to assign to a language-independent category a source sentence in a source language, the computer system comprising:
-
a feature extractor adapted to perform steps including; defining a lexical-morphological feature of the sentence of the text; defining a syntactical feature of the sentence of the text; defining a semantic feature of the sentence of the text; and generating a set of features for the sentence of the text related to the lexical-morphological feature, the syntactical feature and the semantical feature; and a text classifier adapted to perform steps including; classify text based on said set of features, wherein the text classification includes assigning the text to one or more categories. - View Dependent Claims (27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37)
-
Specification