Method and system for clustering using generalized sentence patterns
First Claim
1. A method in a computer system with a processor and memory for identifying clusters of documents, the method comprising:
- providing sentences having words, each sentence representing a topic of a document;
for each sentence representing the topic of a document, identifying a generalized sentence for the sentence, the generalized sentence representing a generalization of words of the sentence, a generalization including a part of speech of a word;
identifying by the processor generalized sentence patterns for the identified generalized sentences, each generalized sentence pattern representing a pattern of generalizations of the generalized sentences;
grouping the identified generalized sentence patterns into groups of generalized sentence patterns based on similarity of the generalized sentence patterns;
selecting identified generalized sentence patterns to guide the identification of clusters wherein the groups of generalized sentence patterns are used to guide the identification of clusters; and
applying a cluster identification algorithm to identify clusters of documents using the selected generalized sentence patterns to guide the identification such that documents whose generalized sentences are similar to the same generalized sentence pattern are identified as being in the same clusterwherein similarity of generalized sentence patterns is defined as;
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for clustering documents based on generalized sentence patterns of the topics of the documents is provided. A generalized sentence patterns (“GSP”) system identifies a “sentence” that describes the topic of a document. To cluster documents, the GSP system generates a “generalized sentence” form of the sentence that describes the topic of each document. The generalized sentence is an abstraction of the words of the sentence. The GSP system identifies clusters of documents based on the patterns of their generalized sentences. The GSP system clusters documents when the generalized sentence representations of their topics have a similar pattern.
-
Citations
13 Claims
-
1. A method in a computer system with a processor and memory for identifying clusters of documents, the method comprising:
-
providing sentences having words, each sentence representing a topic of a document; for each sentence representing the topic of a document, identifying a generalized sentence for the sentence, the generalized sentence representing a generalization of words of the sentence, a generalization including a part of speech of a word; identifying by the processor generalized sentence patterns for the identified generalized sentences, each generalized sentence pattern representing a pattern of generalizations of the generalized sentences; grouping the identified generalized sentence patterns into groups of generalized sentence patterns based on similarity of the generalized sentence patterns; selecting identified generalized sentence patterns to guide the identification of clusters wherein the groups of generalized sentence patterns are used to guide the identification of clusters; and applying a cluster identification algorithm to identify clusters of documents using the selected generalized sentence patterns to guide the identification such that documents whose generalized sentences are similar to the same generalized sentence pattern are identified as being in the same cluster wherein similarity of generalized sentence patterns is defined as; - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method in a computer system with a processor and memory for identifying clusters of documents, the method comprising:
-
identifying by the processor generalized sentence patterns for sentences, each sentence representing a document; grouping the identified generalized sentence patterns into groups of generalized sentence patterns based on similarity of the generalized sentence patterns; selecting identified generalized sentence patterns to guide the identification of clusters wherein the groups of generalized sentence patterns are used to guide the identification of clusters; and applying a cluster identification algorithm to identify clusters using the selected generalized sentence patterns to guide the identification wherein the cluster identification algorithm is a constraint-based k-means algorithm and wherein similarity of generalized sentence patterns is defined as; - View Dependent Claims (9, 10)
-
-
11. A method in a computer system with a processor and memory for identifying clusters of documents, the method comprising:
-
identifying by the processor generalized sentence patterns for sentences, each sentence representing a document; grouping the identified generalized sentence patterns into groups of generalized sentence patterns based on similarity of the generalized sentence patterns; selecting identified generalized sentence patterns to guide the identification of clusters wherein the groups of generalized sentence patterns are used to guide the identification of clusters; and applying a cluster identification algorithm to identify clusters using the selected generalized sentence patterns to guide the identification wherein the cluster identification algorithm is a conditional expectation maximization algorithm and wherein similarity of generalized sentence patterns is defined as; - View Dependent Claims (12, 13)
-
Specification