Data clustering
First Claim
1. A computer implemented method, comprising:
- receiving a plurality of documents, each of the plurality of documents represented by a vector of words and associated with a point in time, wherein the received plurality of documents is received and processed in chronological order;
dividing the received plurality of documents into first time slices using a first time interval to form a plurality of consecutive sets of documents;
sub-dividing each of the plurality of consecutive sets of documents into second time slices using respective second time intervals to form one or more subsets of documents;
identifying a plurality of topics in each of the plurality of consecutive sets of documents and the one or more subsets of documents, each of the plurality of topics represented by a set of most relevant topic keywords;
clustering each of the plurality of consecutive sets of documents and the one or more subsets of documents in accordance with each of the identified plurality of topics;
comparing each of the identified plurality of topics with respect to each of the plurality of consecutive sets of documents and the one or more subsets of documents to detect patterns of changes in the set of most relevant topic keywords over time, wherein comparing each of the identified plurality of topics with respect to each of the plurality of consecutive sets of documents and the one or more subsets of documents to detect patterns of changes in the set of most relevant topic keywords over time, comprises;
identifying each of the plurality of topics from each of the plurality of consecutive sets of documents and the one or more subsets of documents of the overlapping time slices to detect patterns of changes in the set of most relevant topic keywords over time;
identifying a topic drift based on the detected patterns of changes in the set of most relevant topic keywords over time; and
identifying a topic convergence based on the detected patterns of changes in the set of most relevant topic keywords over time;
redefining each of the clustered plurality of consecutive sets of documents and the one or more subsets of documents to form homogenous clusters based on the identified topic convergence;
redefining each of the clustered plurality of consecutive sets of documents and the one or more subsets of documents to form homogenous clusters based on the identified topic drift;
outputting the redefined clustered plurality of consecutive sets of documents and the one or more subsets of documents; and
defining a template based on the outputted redefined clustered plurality of consecutive sets of documents and the one or more subsets of documents.
1 Assignment
0 Petitions
Accused Products
Abstract
A system, method and computer program product performs data analysis and clustering. A plurality of data objects are received, each represented by a vector of features and associated with a point in time. The plurality of data objects is divided into first time slices to form a plurality of consecutive sets of data objects. Each set of data objects is sub-divided into one or more second time slices so as to form one or more subsets of data objects. The data objects in each set and subset of data objects are processed to derive clusters of data objects according to similarity of features. The clusters of data objects from different sets and subsets of data objects are used to detect changes in the relevance of cluster features over time.
-
Citations
22 Claims
-
1. A computer implemented method, comprising:
-
receiving a plurality of documents, each of the plurality of documents represented by a vector of words and associated with a point in time, wherein the received plurality of documents is received and processed in chronological order; dividing the received plurality of documents into first time slices using a first time interval to form a plurality of consecutive sets of documents; sub-dividing each of the plurality of consecutive sets of documents into second time slices using respective second time intervals to form one or more subsets of documents; identifying a plurality of topics in each of the plurality of consecutive sets of documents and the one or more subsets of documents, each of the plurality of topics represented by a set of most relevant topic keywords; clustering each of the plurality of consecutive sets of documents and the one or more subsets of documents in accordance with each of the identified plurality of topics; comparing each of the identified plurality of topics with respect to each of the plurality of consecutive sets of documents and the one or more subsets of documents to detect patterns of changes in the set of most relevant topic keywords over time, wherein comparing each of the identified plurality of topics with respect to each of the plurality of consecutive sets of documents and the one or more subsets of documents to detect patterns of changes in the set of most relevant topic keywords over time, comprises; identifying each of the plurality of topics from each of the plurality of consecutive sets of documents and the one or more subsets of documents of the overlapping time slices to detect patterns of changes in the set of most relevant topic keywords over time; identifying a topic drift based on the detected patterns of changes in the set of most relevant topic keywords over time; and identifying a topic convergence based on the detected patterns of changes in the set of most relevant topic keywords over time; redefining each of the clustered plurality of consecutive sets of documents and the one or more subsets of documents to form homogenous clusters based on the identified topic convergence; redefining each of the clustered plurality of consecutive sets of documents and the one or more subsets of documents to form homogenous clusters based on the identified topic drift; outputting the redefined clustered plurality of consecutive sets of documents and the one or more subsets of documents; and defining a template based on the outputted redefined clustered plurality of consecutive sets of documents and the one or more subsets of documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A computer system, comprising:
-
one or more computer processors, one or more computer-readable storage media, and program instructions stored on one or more of the computer-readable storage media for execution by at least one of the one or more processors, the program instructions, when executed by the at least one of the one or more processors, causing the computer system to perform a method comprising; receiving a plurality of documents, each of the plurality of documents represented by a vector of words and associated with a point in time, wherein the received plurality of documents is received and processed in chronological order; dividing the received plurality of documents into first time slices using a first time interval to form a plurality of consecutive sets of documents; sub-dividing each of the plurality of consecutive sets of documents into second time slices using respective second time intervals to form one or more subsets of documents; identifying a plurality of topics in each of the plurality of consecutive sets of documents and the one or more subsets of documents, each of the plurality of topics represented by a set of most relevant topic keywords; clustering each of the plurality of consecutive sets of documents and the one or more subsets of documents in accordance with each of the identified plurality of topics; comparing each of the identified plurality of topics with respect to each of the plurality of consecutive sets of documents and the one or more subsets of documents to detect patterns of changes in the set of most relevant topic keywords over time, wherein comparing each of the identified plurality of topics with respect to each of the plurality of consecutive sets of documents and the one or more subsets of documents to detect patterns of changes in the set of most relevant topic keywords over time, comprises; identifying each of the plurality of topics from each of the plurality of consecutive sets of documents and the one or more subsets of documents of the overlapping time slices to detect patterns of changes in the set of most relevant topic keywords over time; identifying a topic drift based on the detected patterns of changes in the set of most relevant topic keywords over time; and identifying a topic convergence based on the detected patterns of changes in the set of most relevant topic keywords over time; redefining each of the clustered plurality of consecutive sets of documents and the one or more subsets of documents to form homogenous clusters based on the identified topic convergence; redefining each of the clustered plurality of consecutive sets of documents and the one or more subsets of documents to form homogenous clusters based on the identified topic drift; outputting the redefined clustered plurality of consecutive sets of documents and the one or more subsets of documents; and defining a template based on the outputted redefined clustered plurality of consecutive sets of documents and the one or more subsets of documents. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A computer program product for controlling access to a secure resource, the computer program product comprising:
one or more computer-readable storage devices and program instructions stored on at least one of the one or more tangible storage devices, the program instructions comprising; program instructions to receive a plurality of documents, each of the plurality of documents represented by a vector of words and associated with a point in time, wherein the received plurality of documents is received and processed in chronological order; program instructions to divide the received plurality of documents into first time slices using a first time interval to form a plurality of consecutive sets of documents; program instructions to sub-divide each of the plurality of consecutive sets of documents into second time slices using respective second time intervals to form one or more subsets of documents; program instructions to identify a plurality of topics in each of the plurality of consecutive sets of documents and the one or more subsets of documents, each of the plurality of topics represented by a set of most relevant topic keywords; program instructions to cluster each of the plurality of consecutive sets of documents and the one or more subsets of documents in accordance with each of the identified plurality of topics; program instructions to compare each of the identified plurality of topics with respect to each of the plurality of consecutive sets of documents and the one or more subsets of documents to detect patterns of changes in the set of most relevant topic keywords over time, wherein comparing each of the identified plurality of topics with respect to each of the plurality of consecutive sets of documents and the one or more subsets of documents to detect patterns of changes in the set of most relevant topic keywords over time, comprises; program instructions to identify each of the plurality of topics from each of the plurality of consecutive sets of documents and the one or more subsets of documents of the overlapping time slices to detect patterns of changes in the set of most relevant topic keywords over time; program instructions to identify a topic drift based on the detected patterns of changes in the set of most relevant topic keywords over time; and program instructions to identify a topic convergence based on the detected patterns of changes in the set of most relevant topic keywords over time; program instructions to redefine each of the clustered plurality of consecutive sets of documents and the one or more subsets of documents to form homogenous clusters based on the identified topic convergence; program instructions to redefine each of the clustered plurality of consecutive sets of documents and the one or more subsets of documents to form homogenous clusters based on the identified topic drift; program instructions to output the redefined clustered plurality of consecutive sets of documents and the one or more subsets of documents; and program instructions to define a template based on the outputted redefined clustered plurality of consecutive sets of documents and the one or more subsets of documents.
-
21. A computer implemented method, comprising:
-
receiving a plurality of data objects, each of the plurality of data objects represented by a vector of words and associated with a point in time, wherein the received plurality of data objects is received and processed in chronological order; dividing the received plurality of data objects into first time slices using a first time interval to form a plurality of consecutive sets of data objects; sub-dividing each of the plurality of consecutive sets of data objects into second time slices using respective second time intervals to form one or more subsets of data objects; processing the plurality of data objects in each of the plurality of consecutive sets of data objects and the one or more subsets of data objects to derive clusters of the data objects according to similarity of features, wherein each of the derived clusters is represented by a most relevant set of cluster features; identifying a plurality of cluster features in each of the plurality of consecutive sets of data objects and the one or more subsets of cluster features, each of the plurality of topics represented by a set of most relevant cluster features, wherein identifying a plurality of cluster features in each of the plurality of consecutive sets of data objects and the one or more subsets of cluster features, comprises; identifying each of the plurality of topics from each of the plurality of consecutive sets of data objects and the one or more subsets of cluster features to detect patterns of changes in a set of most relevant topic keywords over time; identifying a topic drift based on the detected patterns of changes in the set of most relevant topic keywords over time; and identifying a topic convergence based on the detected patterns of changes in the set of most relevant topic keywords over time; redefining the derived clusters of data objects to form homogenous clusters based on the identified topic convergence; redefining the derived clusters of data objects to form homogenous clusters based on the identified topic drift; outputting the redefined clusters of data objects; and defining a template based on the outputted redefined clusters of data objects. - View Dependent Claims (22)
-
Specification