Data clustering

US 10,452,702 B2
Filed: 05/18/2017
Issued: 10/22/2019
Est. Priority Date: 05/18/2017
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method, comprising:

receiving a plurality of documents, each of the plurality of documents represented by a vector of words and associated with a point in time, wherein the received plurality of documents is received and processed in chronological order;

dividing the received plurality of documents into first time slices using a first time interval to form a plurality of consecutive sets of documents;

sub-dividing each of the plurality of consecutive sets of documents into second time slices using respective second time intervals to form one or more subsets of documents;

identifying a plurality of topics in each of the plurality of consecutive sets of documents and the one or more subsets of documents, each of the plurality of topics represented by a set of most relevant topic keywords;

clustering each of the plurality of consecutive sets of documents and the one or more subsets of documents in accordance with each of the identified plurality of topics;

comparing each of the identified plurality of topics with respect to each of the plurality of consecutive sets of documents and the one or more subsets of documents to detect patterns of changes in the set of most relevant topic keywords over time, wherein comparing each of the identified plurality of topics with respect to each of the plurality of consecutive sets of documents and the one or more subsets of documents to detect patterns of changes in the set of most relevant topic keywords over time, comprises;

identifying each of the plurality of topics from each of the plurality of consecutive sets of documents and the one or more subsets of documents of the overlapping time slices to detect patterns of changes in the set of most relevant topic keywords over time;

identifying a topic drift based on the detected patterns of changes in the set of most relevant topic keywords over time; and

identifying a topic convergence based on the detected patterns of changes in the set of most relevant topic keywords over time;

redefining each of the clustered plurality of consecutive sets of documents and the one or more subsets of documents to form homogenous clusters based on the identified topic convergence;

redefining each of the clustered plurality of consecutive sets of documents and the one or more subsets of documents to form homogenous clusters based on the identified topic drift;

outputting the redefined clustered plurality of consecutive sets of documents and the one or more subsets of documents; and

defining a template based on the outputted redefined clustered plurality of consecutive sets of documents and the one or more subsets of documents.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system, method and computer program product performs data analysis and clustering. A plurality of data objects are received, each represented by a vector of features and associated with a point in time. The plurality of data objects is divided into first time slices to form a plurality of consecutive sets of data objects. Each set of data objects is sub-divided into one or more second time slices so as to form one or more subsets of data objects. The data objects in each set and subset of data objects are processed to derive clusters of data objects according to similarity of features. The clusters of data objects from different sets and subsets of data objects are used to detect changes in the relevance of cluster features over time.

Citations

22 Claims

1. A computer implemented method, comprising:
- receiving a plurality of documents, each of the plurality of documents represented by a vector of words and associated with a point in time, wherein the received plurality of documents is received and processed in chronological order;
  
  dividing the received plurality of documents into first time slices using a first time interval to form a plurality of consecutive sets of documents;
  
  sub-dividing each of the plurality of consecutive sets of documents into second time slices using respective second time intervals to form one or more subsets of documents;
  
  identifying a plurality of topics in each of the plurality of consecutive sets of documents and the one or more subsets of documents, each of the plurality of topics represented by a set of most relevant topic keywords;
  
  clustering each of the plurality of consecutive sets of documents and the one or more subsets of documents in accordance with each of the identified plurality of topics;
  
  comparing each of the identified plurality of topics with respect to each of the plurality of consecutive sets of documents and the one or more subsets of documents to detect patterns of changes in the set of most relevant topic keywords over time, wherein comparing each of the identified plurality of topics with respect to each of the plurality of consecutive sets of documents and the one or more subsets of documents to detect patterns of changes in the set of most relevant topic keywords over time, comprises;
  
  identifying each of the plurality of topics from each of the plurality of consecutive sets of documents and the one or more subsets of documents of the overlapping time slices to detect patterns of changes in the set of most relevant topic keywords over time;
  
  identifying a topic drift based on the detected patterns of changes in the set of most relevant topic keywords over time; and
  
  identifying a topic convergence based on the detected patterns of changes in the set of most relevant topic keywords over time;
  
  redefining each of the clustered plurality of consecutive sets of documents and the one or more subsets of documents to form homogenous clusters based on the identified topic convergence;
  
  redefining each of the clustered plurality of consecutive sets of documents and the one or more subsets of documents to form homogenous clusters based on the identified topic drift;
  
  outputting the redefined clustered plurality of consecutive sets of documents and the one or more subsets of documents; and
  
  defining a template based on the outputted redefined clustered plurality of consecutive sets of documents and the one or more subsets of documents.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein sub-dividing each of the plurality of consecutive sets of documents into the second time slices, comprises:
    - sub-dividing each of the plurality of consecutive sets of documents into two or more consecutive overlapping time slices.
  - 3. The method of claim 2, wherein a start time of each of the second time slices is later than a start time of each of the first time slices by an offset time period, wherein the offset time period increases for each of the consecutive overlapping time slices.
  - 4. The method of claim 1, wherein identifying each of the plurality of topics from each of the plurality of consecutive sets of documents and the one or more subsets of documents of the overlapping time slices to detect patterns of changes in the set of most relevant topic keywords over time comprises:
    - detecting an increasing relevance of one or more topic keywords represented in multiple different topics over time, indicative of topic convergence.
  - 5. The method of claim 4, wherein redefining each of the clustered plurality of consecutive sets of documents and the one or more subsets of documents to form homogenous clusters based on the detected patterns, comprises:
    - re-associating the clustered plurality of consecutive sets of documents and the one or more subsets of documents in accordance with the identified plurality of topics to a single cluster for a common identified topic, based on the detected topic convergence.
  - 6. The method of claim 1, wherein identifying each of the plurality of topics from each of the plurality of consecutive sets of documents and the one or more subsets of documents of the overlapping time slices to detect patterns of changes in the set of most relevant topic keywords over time comprises:
    - detecting a decreasing relevance of one or more topic keywords represented in a particular topic over time, indicative of topic drift.
  - 7. The method of claim 6, wherein redefining each of the clustered plurality of consecutive sets of documents and the one or more subsets of documents to form homogenous clusters based on the detected patterns, comprises:
    - re-associating the clustered plurality of consecutive sets of documents and the one or more subsets of documents for the particular topic to one or more new clusters in accordance with the detected topic drift.
  - 8. The method of claim 1, wherein comparing each of the identified plurality of topics with respect to each of the plurality of consecutive sets of documents and the one or more subsets of documents to detect patterns of changes in the set of most relevant topic keywords over time, comprises:
    - determining delta differences between relevance scores of each of the identified plurality of topics with respect to each of the plurality of consecutive sets of documents and the one or more subsets of documents of each of the consecutive overlapping time slices, andusing the delta differences to detect patterns of changes selected from the group consisting of;
      
      increasing relevance of one or more topic keywords represented in multiple different topics over time, indicative of topic convergence; and
      
      decreasing relevance of one or more topic keywords represented in a particular topic over time, indicative of topic drift.
  - 9. The method of claim 1, wherein identifying a plurality of topics in each of the plurality of consecutive sets of documents and the one or more subsets of documents, each of the plurality of topics represented by a set of most relevant topic keywords, comprises:
    - performing topic analysis on each of the plurality of consecutive sets of documents and the one or more subsets of documents, the topic analysis using Latent Dirichlet Allocation for maximum likelihood fit to identify a predefined number of the plurality of topics, wherein each of the plurality of topics is represented by the set of most relevant topic keywords and each of the plurality of topics has an associated likelihood score indicative of relevance of the keyword.

10. A computer system, comprising:
- one or more computer processors, one or more computer-readable storage media, and program instructions stored on one or more of the computer-readable storage media for execution by at least one of the one or more processors, the program instructions, when executed by the at least one of the one or more processors, causing the computer system to perform a method comprising;
  
  receiving a plurality of documents, each of the plurality of documents represented by a vector of words and associated with a point in time, wherein the received plurality of documents is received and processed in chronological order;
  
  dividing the received plurality of documents into first time slices using a first time interval to form a plurality of consecutive sets of documents;
  
  sub-dividing each of the plurality of consecutive sets of documents into second time slices using respective second time intervals to form one or more subsets of documents;
  
  identifying a plurality of topics in each of the plurality of consecutive sets of documents and the one or more subsets of documents, each of the plurality of topics represented by a set of most relevant topic keywords;
  
  clustering each of the plurality of consecutive sets of documents and the one or more subsets of documents in accordance with each of the identified plurality of topics;
  
  comparing each of the identified plurality of topics with respect to each of the plurality of consecutive sets of documents and the one or more subsets of documents to detect patterns of changes in the set of most relevant topic keywords over time, wherein comparing each of the identified plurality of topics with respect to each of the plurality of consecutive sets of documents and the one or more subsets of documents to detect patterns of changes in the set of most relevant topic keywords over time, comprises;
  
  identifying each of the plurality of topics from each of the plurality of consecutive sets of documents and the one or more subsets of documents of the overlapping time slices to detect patterns of changes in the set of most relevant topic keywords over time;
  
  identifying a topic drift based on the detected patterns of changes in the set of most relevant topic keywords over time; and
  
  identifying a topic convergence based on the detected patterns of changes in the set of most relevant topic keywords over time;
  
  redefining each of the clustered plurality of consecutive sets of documents and the one or more subsets of documents to form homogenous clusters based on the identified topic convergence;
  
  redefining each of the clustered plurality of consecutive sets of documents and the one or more subsets of documents to form homogenous clusters based on the identified topic drift;
  
  outputting the redefined clustered plurality of consecutive sets of documents and the one or more subsets of documents; and
  
  defining a template based on the outputted redefined clustered plurality of consecutive sets of documents and the one or more subsets of documents.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 11. The computer system of claim 10, wherein sub-dividing each of the plurality of consecutive sets of documents into the second time slices, comprises:
    - sub-dividing each of the plurality of consecutive sets of documents into two or more consecutive overlapping time slices.
  - 12. The computer system of claim 11, wherein a start time of each of the second time slices is later than a start time of each of the first time slices by an offset time period, wherein the offset time period increases for each of the consecutive overlapping time slices.
  - 13. The computer system of claim 10, wherein identifying each of the plurality of topics from each of the plurality of consecutive sets of documents and the one or more subsets of documents of the overlapping time slices to detect patterns of changes in the set of most relevant topic keywords over time comprises:
    - detecting an increasing relevance of one or more topic keywords represented in multiple different topics over time, indicative of topic convergence.
  - 14. The computer system of claim 13, wherein redefining each of the clustered plurality of consecutive sets of documents and the one or more subsets of documents to form homogenous clusters based on the detected patterns, comprises:
    - re-associating the clustered plurality of consecutive sets of documents and the one or more subsets of documents in accordance with the identified plurality of topics to a single cluster for a common identified topic, based on the detected topic convergence.
  - 15. The computer system of claim 10, wherein identifying each of the plurality of topics from each of the plurality of consecutive sets of documents and the one or more subsets of documents of the overlapping time slices to detect patterns of changes in the set of most relevant topic keywords over time comprises:
    - detecting a decreasing relevance of one or more topic keywords represented in a particular topic over time, indicative of topic drift.
  - 16. The computer system of claim 15, wherein redefining each of the clustered plurality of consecutive sets of documents and the one or more subsets of documents to form homogenous clusters based on the detected patterns, comprises:
    - re-associating the clustered plurality of consecutive sets of documents and the one or more subsets of documents for the particular topic to one or more new clusters in accordance with the detected topic drift.
  - 17. The computer system of claim 10, wherein comparing each of the identified plurality of topics with respect to each of the plurality of consecutive sets of documents and the one or more subsets of documents to detect patterns of changes in the set of most relevant topic keywords over time, comprises:
    - determining delta differences between relevance scores of each of the identified plurality of topics with respect to each of the plurality of consecutive sets of documents and the one or more subsets of documents of each of the consecutive overlapping time slices, andusing the delta differences to detect patterns of changes selected from the group consisting of;
      
      increasing relevance of one or more topic keywords represented in multiple different topics over time, indicative of topic convergence; and
      
      decreasing relevance of one or more topic keywords represented in a particular topic over time, indicative of topic drift.
  - 18. The computer system of claim 10, wherein identifying a plurality of topics in each of the plurality of consecutive sets of documents and the one or more subsets of documents, each of the plurality of topics represented by a set of most relevant topic keywords, comprises:
    - performing topic analysis on each of the plurality of consecutive sets of documents and the one or more subsets of documents, the topic analysis using Latent Dirichlet Allocation for maximum likelihood fit to identify a predefined number of the plurality of topics, wherein each of the plurality of topics is represented by the set of most relevant topic keywords and each of the plurality of topics has an associated likelihood score indicative of relevance of the keyword.
  - 19. The computer system of claim 10, further comprising:
    - a chat messaging subsystem configured for receiving messages comprising text-based documents from one or more users, the chat messaging system configured for sending the plurality of documents to the one or more computer processors of the computer system.

20. A computer program product for controlling access to a secure resource, the computer program product comprising:
- one or more computer-readable storage devices and program instructions stored on at least one of the one or more tangible storage devices, the program instructions comprising;
  
  program instructions to receive a plurality of documents, each of the plurality of documents represented by a vector of words and associated with a point in time, wherein the received plurality of documents is received and processed in chronological order;
  
  program instructions to divide the received plurality of documents into first time slices using a first time interval to form a plurality of consecutive sets of documents;
  
  program instructions to sub-divide each of the plurality of consecutive sets of documents into second time slices using respective second time intervals to form one or more subsets of documents;
  
  program instructions to identify a plurality of topics in each of the plurality of consecutive sets of documents and the one or more subsets of documents, each of the plurality of topics represented by a set of most relevant topic keywords;
  
  program instructions to cluster each of the plurality of consecutive sets of documents and the one or more subsets of documents in accordance with each of the identified plurality of topics;
  
  program instructions to compare each of the identified plurality of topics with respect to each of the plurality of consecutive sets of documents and the one or more subsets of documents to detect patterns of changes in the set of most relevant topic keywords over time, wherein comparing each of the identified plurality of topics with respect to each of the plurality of consecutive sets of documents and the one or more subsets of documents to detect patterns of changes in the set of most relevant topic keywords over time, comprises;
  
  program instructions to identify each of the plurality of topics from each of the plurality of consecutive sets of documents and the one or more subsets of documents of the overlapping time slices to detect patterns of changes in the set of most relevant topic keywords over time;
  
  program instructions to identify a topic drift based on the detected patterns of changes in the set of most relevant topic keywords over time; and
  
  program instructions to identify a topic convergence based on the detected patterns of changes in the set of most relevant topic keywords over time;
  
  program instructions to redefine each of the clustered plurality of consecutive sets of documents and the one or more subsets of documents to form homogenous clusters based on the identified topic convergence;
  
  program instructions to redefine each of the clustered plurality of consecutive sets of documents and the one or more subsets of documents to form homogenous clusters based on the identified topic drift;
  
  program instructions to output the redefined clustered plurality of consecutive sets of documents and the one or more subsets of documents; and
  
  program instructions to define a template based on the outputted redefined clustered plurality of consecutive sets of documents and the one or more subsets of documents.

21. A computer implemented method, comprising:
- receiving a plurality of data objects, each of the plurality of data objects represented by a vector of words and associated with a point in time, wherein the received plurality of data objects is received and processed in chronological order;
  
  dividing the received plurality of data objects into first time slices using a first time interval to form a plurality of consecutive sets of data objects;
  
  sub-dividing each of the plurality of consecutive sets of data objects into second time slices using respective second time intervals to form one or more subsets of data objects;
  
  processing the plurality of data objects in each of the plurality of consecutive sets of data objects and the one or more subsets of data objects to derive clusters of the data objects according to similarity of features, wherein each of the derived clusters is represented by a most relevant set of cluster features;
  
  identifying a plurality of cluster features in each of the plurality of consecutive sets of data objects and the one or more subsets of cluster features, each of the plurality of topics represented by a set of most relevant cluster features, wherein identifying a plurality of cluster features in each of the plurality of consecutive sets of data objects and the one or more subsets of cluster features, comprises;
  
  identifying each of the plurality of topics from each of the plurality of consecutive sets of data objects and the one or more subsets of cluster features to detect patterns of changes in a set of most relevant topic keywords over time;
  
  identifying a topic drift based on the detected patterns of changes in the set of most relevant topic keywords over time; and
  
  identifying a topic convergence based on the detected patterns of changes in the set of most relevant topic keywords over time;
  
  redefining the derived clusters of data objects to form homogenous clusters based on the identified topic convergence;
  
  redefining the derived clusters of data objects to form homogenous clusters based on the identified topic drift;
  
  outputting the redefined clusters of data objects; and
  
  defining a template based on the outputted redefined clusters of data objects.
- View Dependent Claims (22)
- - 22. The computer implemented method of claim 21, wherein the data objects comprise text-based documents and the features comprise words within the text-based documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Dunne, Jonathan, Penrose, Andrew
Primary Examiner(s)
Leroux, Etienne P
Assistant Examiner(s)
Agharahimi, Farhad

Application Number

US15/598,375
Publication Number

US 20180336207A1
Time in Patent Office

887 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/35   Clustering; Classification

G06F 16/355   Class or cluster creation o...

G06F 16/93   Document management systems

Data clustering

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Data clustering

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links