Cross-language text clustering

US 9,495,358 B2
Filed: 10/10/2012
Issued: 11/15/2016
Est. Priority Date: 10/10/2006
Status: Expired due to Fees

First Claim

Patent Images

1. A method for a computing device to analyze, across languages, a set of texts in one or more natural languages, the method comprising for each text:

electronically analyzing the text, wherein electronically analyzing the text comprises;

performing a syntactic analysis of at least one sentence of the text, the syntactic analysis comprising a rough syntactic analysis to generate a graph of generalized constituents representing all possible variants of parsing the at least one sentence of the text syntactically, the syntactic analysis further comprising a precise syntactic analysis to generate at least one syntactic tree from the graph of generalized constituents, and selecting a preferred one of the at least one syntactic tree; and

creating a language-independent semantic structure (LISS) by performing a semantic analysis of the preferred one of the at least one syntactic tree, wherein the LISS comprises an acyclic graph where each word in the sentence is represented by a corresponding one of a plurality of semantic classes, and wherein each of the semantic classes is a universal language-independent semantic notion of a respective word;

generating a set of features for the text based at least in part on the LISS;

creating at least one index for the text, wherein each value in the index relates to a corresponding one of the set of features and comprises a list of at least one of numbers or addresses of occurrences of the corresponding feature in the text; and

performing text clustering based on said set of features, wherein performing the text clustering comprises assigning the text to one or more clusters.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods are described for performing clustering or classification of texts of different languages. Language-independent semantic structures (LISS) are constructed before clustering is performed. These structures reflect lexical, morphological, syntactic, and semantic properties of texts. The methods suggested are able to perform cross-language text clustering which is based on the meaning derived from texts. The methods are applicable to genre classification, topic detection, news analysis, authorship analysis, internet searches, and creating corpora for other tasks, etc.

257 Citations

23 Claims

1. A method for a computing device to analyze, across languages, a set of texts in one or more natural languages, the method comprising for each text:
- electronically analyzing the text, wherein electronically analyzing the text comprises;
  
  performing a syntactic analysis of at least one sentence of the text, the syntactic analysis comprising a rough syntactic analysis to generate a graph of generalized constituents representing all possible variants of parsing the at least one sentence of the text syntactically, the syntactic analysis further comprising a precise syntactic analysis to generate at least one syntactic tree from the graph of generalized constituents, and selecting a preferred one of the at least one syntactic tree; and
  
  creating a language-independent semantic structure (LISS) by performing a semantic analysis of the preferred one of the at least one syntactic tree, wherein the LISS comprises an acyclic graph where each word in the sentence is represented by a corresponding one of a plurality of semantic classes, and wherein each of the semantic classes is a universal language-independent semantic notion of a respective word;
  
  generating a set of features for the text based at least in part on the LISS;
  
  creating at least one index for the text, wherein each value in the index relates to a corresponding one of the set of features and comprises a list of at least one of numbers or addresses of occurrences of the corresponding feature in the text; and
  
  performing text clustering based on said set of features, wherein performing the text clustering comprises assigning the text to one or more clusters.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
- - 2. The method of claim 1, wherein said analyzing comprises resolving lexical ambiguities.
  - 3. The method of claim 1, wherein said analyzing comprises resolving anaphoras.
  - 4. The method of claim 1, wherein said set of features comprises lexical features.
  - 5. The method of claim 1, wherein said set of features comprises syntactic features.
  - 6. The method of claim 1, wherein said set of features comprises grammatical features.
  - 7. The method of claim 1, wherein said set of features comprises semantic features.
  - 8. The method of claim 1, wherein the at least one index is for morphological, syntactic, lexical and semantic features, the at least one index being represented as a table.
  - 9. The method of claim 1, wherein said clustering uses a similarity measure, wherein said similarity measure is based on a result of said semantic analysis.
  - 10. The method of claim 9, wherein said similarity measure depends on distances between semantic classes in a semantic hierarchy.
  - 11. The method of claim 10, wherein said similarity measure depends on a frequency of words related to a common ancestor of said semantic classes in said semantic hierarchy.
  - 12. The method of claim 1, wherein analyzing the sentence of the text further comprises generating a statistic for at least one grammatical feature of the sentence of the text.
  - 13. The method of claim 1, wherein analyzing the sentence of the text further comprises generating a statistic for at least one lexical feature of the sentence of the text.
  - 14. The method of claim 1, wherein analyzing the sentence of the text further comprises generating a statistic for at least one syntactic feature of the sentence of the text.
  - 15. The method of claim 1, wherein analyzing the sentence of the text further comprises generating a statistic for at least one semantic feature of the sentence of the text.
  - 16. The method of claim 1, wherein analyzing the sentence of the text further comprises generating a statistic for at least one language independent semantic structure (LISS) of the sentence of the text.
  - 17. The method of claim 1, wherein analyzing the sentence of the text further comprises generating a statistic for at least one semantic class of a semantic hierarchy related to the sentence of the text.
  - 18. The method of claim 1, wherein the set of features for each text comprises generating a statistic of at least one extracted feature.
  - 19. The method of claim 1, wherein the method further comprises making one or more of the clusters accessible to another computing device.
  - 20. The method of claim 19, wherein clusters are located across a plurality of computing devices.
  - 21. The method of claim 1, wherein members of the set of texts are located on a plurality of computing devices, wherein the plurality of computing devices are accessible through one or more network protocols.
  - 22. The method of claim 17, wherein clusters comprise texts of different languages.
  - 23. The method of claim 17, wherein clusters comprise texts of mixed languages.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
ABBYY Production LLC (ABBYY Software)
Original Assignee
ABBYY InfoPoisk LLC
Inventors
Zuev, Konstantin, Danielyan, Tatiana
Primary Examiner(s)
Neway, Samuel G

Application Number

US13/648,527
Publication Number

US 20130041652A1
Time in Patent Office

1,497 Days
Field of Search

704/9
US Class Current

1/1
CPC Class Codes

G06F 40/211   Syntactic parsing, e.g. bas...

G06F 40/268   Morphological analysis

G06F 40/284   Lexical analysis, e.g. toke...

G06F 40/30   Semantic analysis

G06F 40/55   Rule-based translation

Cross-language text clustering

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

257 Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Cross-language text clustering

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

257 Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links