Cross-language text clustering
First Claim
Patent Images
1. A method for a computing device to analyze, across languages, a set of texts in one or more natural languages, the method comprising for each text:
- electronically analyzing the text, wherein electronically analyzing the text comprises;
performing a syntactic analysis of at least one sentence of the text, the syntactic analysis comprising a rough syntactic analysis to generate a graph of generalized constituents representing all possible variants of parsing the at least one sentence of the text syntactically, the syntactic analysis further comprising a precise syntactic analysis to generate at least one syntactic tree from the graph of generalized constituents, and selecting a preferred one of the at least one syntactic tree; and
creating a language-independent semantic structure (LISS) by performing a semantic analysis of the preferred one of the at least one syntactic tree, wherein the LISS comprises an acyclic graph where each word in the sentence is represented by a corresponding one of a plurality of semantic classes, and wherein each of the semantic classes is a universal language-independent semantic notion of a respective word;
generating a set of features for the text based at least in part on the LISS;
creating at least one index for the text, wherein each value in the index relates to a corresponding one of the set of features and comprises a list of at least one of numbers or addresses of occurrences of the corresponding feature in the text; and
performing text clustering based on said set of features, wherein performing the text clustering comprises assigning the text to one or more clusters.
3 Assignments
0 Petitions
Accused Products
Abstract
Methods are described for performing clustering or classification of texts of different languages. Language-independent semantic structures (LISS) are constructed before clustering is performed. These structures reflect lexical, morphological, syntactic, and semantic properties of texts. The methods suggested are able to perform cross-language text clustering which is based on the meaning derived from texts. The methods are applicable to genre classification, topic detection, news analysis, authorship analysis, internet searches, and creating corpora for other tasks, etc.
-
Citations
23 Claims
-
1. A method for a computing device to analyze, across languages, a set of texts in one or more natural languages, the method comprising for each text:
-
electronically analyzing the text, wherein electronically analyzing the text comprises; performing a syntactic analysis of at least one sentence of the text, the syntactic analysis comprising a rough syntactic analysis to generate a graph of generalized constituents representing all possible variants of parsing the at least one sentence of the text syntactically, the syntactic analysis further comprising a precise syntactic analysis to generate at least one syntactic tree from the graph of generalized constituents, and selecting a preferred one of the at least one syntactic tree; and creating a language-independent semantic structure (LISS) by performing a semantic analysis of the preferred one of the at least one syntactic tree, wherein the LISS comprises an acyclic graph where each word in the sentence is represented by a corresponding one of a plurality of semantic classes, and wherein each of the semantic classes is a universal language-independent semantic notion of a respective word; generating a set of features for the text based at least in part on the LISS; creating at least one index for the text, wherein each value in the index relates to a corresponding one of the set of features and comprises a list of at least one of numbers or addresses of occurrences of the corresponding feature in the text; and performing text clustering based on said set of features, wherein performing the text clustering comprises assigning the text to one or more clusters. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
-
Specification