Latent semantic clustering

US 20060242140A1
Filed: 05/11/2006
Published: 10/26/2006
Est. Priority Date: 04/26/2005
Status: Active Grant

First Claim

Patent Images

1. A computer-based method for automatically identifying clusters of conceptually-related documents in a collection of documents, comprising:

(a) generating a document-representation of each document in an abstract mathematical space;

(b) identifying a plurality of document clusters in the collection of documents based on a conceptual similarity between respective pairs of the document-representations, wherein each document cluster is associated with an exemplary document and a plurality of other documents; and

(c) identifying a non-intersecting document cluster from among the plurality of document clusters based on (i) a conceptual similarity between the document-representation of the exemplary document and the document-representation of each document in the non-intersecting cluster and (ii) a conceptual dissimilarity between a cluster-representation of the non-intersecting document cluster and a cluster-representation of each other document cluster.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An embodiment of the present invention provides a computer-based method for automatically identifying clusters of conceptually-related documents in a collection of documents, including the following steps: generating a document-representation of each document in an abstract mathematical space; identifying a plurality of document clusters in the collection of documents based on a conceptual similarity between respective pairs of the document-representations, wherein each document cluster is associated with an exemplary document and a plurality of other documents; and identifying a non-intersecting document cluster from among the plurality of document clusters based on (i) a conceptual similarity between the document-representation of the exemplary document and the document-representation of each document in the non-intersecting cluster and (ii) a conceptual dissimilarity between a cluster-representation of the non-intersecting document cluster and a cluster-representation of each other document cluster. Variants of the method enable creating hierarchy of clusters and conducting incremental updates of preexisting hierarchical structures.

94 Citations

View as Search Results

29 Claims

1. A computer-based method for automatically identifying clusters of conceptually-related documents in a collection of documents, comprising:
- (a) generating a document-representation of each document in an abstract mathematical space;
  
  (b) identifying a plurality of document clusters in the collection of documents based on a conceptual similarity between respective pairs of the document-representations, wherein each document cluster is associated with an exemplary document and a plurality of other documents; and
  
  (c) identifying a non-intersecting document cluster from among the plurality of document clusters based on (i) a conceptual similarity between the document-representation of the exemplary document and the document-representation of each document in the non-intersecting cluster and (ii) a conceptual dissimilarity between a cluster-representation of the non-intersecting document cluster and a cluster-representation of each other document cluster.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein step (b) comprises:
    - (b1) identifying a plurality of document clusters in the collection of documents based on a conceptual similarity between respective pairs of the document-representations; and
      
      (b2) generating a cluster-representation of each document cluster in the plurality of document clusters, wherein each cluster-representation is associated with an exemplary document and a plurality of other documents.
  - 3. The method of claim 1, wherein step (c) comprises:
    - (c1) identifying a non-intersecting document cluster from among the plurality of document clusters if (i) a conceptual similarity between the document-representation of the exemplary document and the document-representation of each document in the non-intersecting cluster is above a predefined similarity threshold and (ii) a conceptual dissimilarity between a cluster-representation of the non-intersecting document cluster and a cluster-representation of each other document cluster is above a predefined dissimilarity threshold.
  - 4. The method of claim 3, further comprising:
    - (d) iteratively adjusting the predefined similarity threshold from a maximum similarity level to a minimum similarity level via a predefined similarity increment;
      
      (e) iteratively adjusting the predefined dissimilarity threshold from a minimum dissimilarity level to a maximum dissimilarity level via a predefined dissimilarity increment; and
      
      (f) repeating step (c1) for each similarity level and each dissimilarity level.
  - 5. The method of claim 1, wherein generating a document-representation of each document in an abstract mathematical space comprises:
    - generating a vector representation of each document in a Latent Semantic Indexing (LSI) space.
  - 6. The method of claim 5, wherein identifying a plurality of document clusters in the collection of documents based on a conceptual similarity between pairs of the document-representations comprises:
    - identifying a plurality of document clusters in the collection of documents based on a cosine similarity between pairs of the document-representations.
  - 7. The method of claim 1, further comprising:
    - (d) aggregating the non-intersecting document cluster with a second non-intersecting document cluster if a conceptual similarity between the cluster-representation of the non-intersecting document cluster and a cluster-representation of the second non-intersecting document cluster is above an aggregation-threshold.
  - 8. The method of claim 1, further comprising:
    - (d) identifying a document in the collection of documents that has not been included in a non-intersecting document cluster; and
      
      (e) aggregating the un-clustered document with a non-intersecting document cluster that is similar to the un-clustered document above a similarity threshold.
  - 9. The method of claim 1, further comprising:
    - (d) identifying sub-clusters of documents within a first document cluster based on the conceptual similarity between pairs of the document-representations included within the first document cluster.

10. A computer program product for automatically identifying clusters of conceptually-related documents in a collection of documents, comprising:
- a computer usable medium having computer readable program code embodied in said medium for causing an application program to execute on an operating system of a computer, said computer readable program code comprising;
  
  computer readable first program code that causes the computer to generate a document-representation of each document in an abstract mathematical space;
  
  computer readable second program code that causes the computer to identify a plurality of document clusters in the collection of documents based on a conceptual similarity between respective pairs of the document-representations, wherein each document cluster includes an exemplary document and a plurality of other documents; and
  
  computer readable third program code that causes the computer to identify a non-intersecting document cluster from among the plurality of document clusters based on (i) a conceptual similarity between the document-representation of the exemplary document and the document-representation of each document in the non-intersecting cluster and (ii) a conceptual dissimilarity between a cluster-representation of the non-intersecting document cluster and a cluster-representation of each other document cluster.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The computer program product of claim 10, wherein the computer readable second program code comprises:
    - code that causes the computer to identify a plurality of document clusters in the collection of documents based on a conceptual similarity between respective pairs of the document-representations; and
      
      code that causes the computer to generate a cluster-representation of each document cluster in the plurality of document clusters, wherein each cluster-representation is associated with an exemplary document and a plurality of other documents.
  - 12. The computer program product of claim 10, wherein the computer readable third program code comprises:
    - code that causes the computer to identify a non-intersecting document cluster from among the plurality of document clusters if (i) a conceptual similarity between the document-representation of the exemplary document and the document-representation of each document in the non-intersecting cluster is above a predefined similarity threshold and (ii) a conceptual dissimilarity between a cluster-representation of the non-intersecting document cluster and a cluster-representation of each other document cluster is above a predefined dissimilarity threshold.
  - 13. The computer program product of claim 12, further comprising:
    - computer readable fourth program code that causes the computer to iteratively adjust the predefined similarity threshold from a maximum similarity level to a minimum similarity level via a predefined similarity increment;
      
      computer readable fifth program code that causes the computer to iteratively adjust the predefined dissimilarity threshold from a minimum dissimilarity level to a maximum dissimilarity level via a predefined dissimilarity increment; and
      
      computer readable sixth program code that causes the computer to repeat the third computer readable program code means for each similarity level and each dissimilarity level.
  - 14. The computer program product of claims 10, wherein the computer readable first program code that causes the computer to generate a document-representation of each document in an abstract mathematical space comprises:
    - code that causes the computer to generate a vector representation of each document in a Latent Semantic Indexing (LSI) space.
  - 15. The computer program product of claim 14, wherein the computer readable second program code that causes the computer to identify a plurality of document clusters in the collection of documents based on a conceptual similarity between pairs of the document-representations comprises:
    - code that causes the computer to identify a plurality of document clusters in the collection of documents based on a cosine similarity between pairs of the document-representations.
  - 16. The computer program product of claim 10, further comprising:
    - computer readable fourth program code that causes the computer to aggregate the non-intersecting document cluster with a second non-intersecting document cluster if a conceptual similarity between the cluster-representation of the non-intersecting document cluster and a cluster-representation of the second non-intersecting document cluster is above an aggregation-threshold.
  - 17. The computer program product of claim 10, further comprising:
    - computer readable fourth program code that causes the computer to identify a document in the collection of documents that has not been included in a non-intersecting document cluster; and
      
      computer readable fifth program code that causes the computer to aggregate the un-clustered document with a non-intersecting document cluster that is similar to the un-clustered document above a similarity threshold.
  - 18. The computer program product of claim 10, further comprising:
    - computer readable fourth program code that causes the computer to identify sub-clusters of documents within a first document cluster based on the conceptual similarity between pairs of the document-representations included within the first document cluster.

19. A computer-based method for automatically identifying clusters of conceptually-related documents in a collection of documents, comprising:
- (a) generating a document-representation of each document in an abstract mathematical space;
  
  (b) identifying a plurality of document clusters in the collection of documents based on a conceptual similarity between respective pairs of the document-representations, wherein each document cluster includes a plurality of documents;
  
  (c) computing an intra-cluster conceptual similarity for each document cluster based on the document-representations of the plurality of documents included in each cluster;
  
  (d) computing inter-cluster conceptual dissimilarities between pairs of document clusters in the plurality of document clusters; and
  
  (e) identifying a non-intersecting document cluster from among the plurality of document clusters based on (i) the intra-cluster conceptual similarities and (ii) the inter-cluster conceptual dissimilarities.

20. A computer-based method for automatically organizing documents in a collection of documents into clusters of documents, comprising:
- (a) generating a representation of each document in an abstract mathematical space;
  
  (b) measuring a similarity between the representation of each document in the collection of documents and the representation of at least one other document in the collection of documents;
  
  (c) labeling each document in the collection of documents with a first mapping or a second mapping based on the similarity measurements; and
  
  (d) organizing the documents into clusters based on the mappings.
- View Dependent Claims (21, 22, 23, 24)
- - 21. The computer-based method of claim 20, wherein:
    - step (c) comprises labeling a first document with the first mapping and a second document with the second mapping, if the similarity between the representation of the first document and the representation of the second document exceeds a threshold, and if the first document and the second document are not already labeled with a mapping; and
      
      step (d) comprises creating a cluster of conceptually-related documents that includes the first document and the second document.
  - 22. The computer-based method of claim 20, wherein:
    - step (c) comprises labeling a first document with the second mapping, if the similarity between the representation of the first document and the representation of a second document in an existing cluster of conceptually-related documents exceeds a threshold, and if the similarity between the representation of the first document and the representation of the second document is greater than the similarity between the representation of the first document and the representation of any other document in the collection of documents; and
      
      step (d) comprises adding the first document to the existing cluster of conceptually-related documents.
  - 23. The computer-based method of claim 20, wherein:
    - step (c) comprises labeling a document with the second mapping if the similarity between the representation of the document and the representation of each other document in the collection of documents does not exceed a threshold; and
      
      step (d) comprises adding the document to a cluster of conceptually-unrelated documents.
  - 24. The computer-based method of claim 20, wherein the collection of documents has a preexisting cluster structure of nodes and each node has a preexisting representation, and wherein prior to step (a) the method further comprises:
    - transforming the preexisting representation of each node into a representation in the abstract mathematical space; and
      
      labeling each node with the first mapping.

25. A computer program product for automatically organizing documents in a collection of documents into clusters of documents, comprising:
- a computer usable medium having computer readable program code embodied in said medium for causing an application program to execute on an operating system of a computer, said computer readable program code comprising;
  
  computer readable first program code that causes the computer to generate a representation of each document in an abstract mathematical space;
  
  computer readable second program code that causes the computer to measure a similarity between the representation of each document in the collection of documents and the representation of at least one other document in the collection of documents;
  
  computer readable third program code that causes the computer to label each document in the collection of documents with a first mapping or a second mapping based on the similarity measurements; and
  
  computer readable fourth program code that causes the computer to organize the documents into clusters based on the mappings.
- View Dependent Claims (26, 27, 28, 29)
- - 26. The computer program product of claim 25, wherein:
    - the computer readable third program code comprises code that causes the computer to label a first document with the first mapping and a second document with the second mapping, if the similarity between the representation of the first document and the representation of the second document exceeds a threshold, and if the first document and the second document are not already labeled with a mapping; and
      
      the computer readable fourth program code comprises code that causes the computer to create a cluster of conceptually-related documents that includes the first document and the second document.
  - 27. The computer program product of claim 25, wherein:
    - the computer readable third program code comprises code that causes the computer to label a first document with the second mapping, if the similarity between the representation of the first document and the representation of a second document in an existing cluster of conceptually-related documents exceeds a threshold, and if the similarity between the representation of the first document and the representation of the second document is greater than the similarity between the representation of the first document and the representation of any other document in the collection of documents; and
      
      the computer readable fourth program code comprises code that causes the computer to add the first document to the existing cluster of conceptually-related documents.
  - 28. The computer program product of claim 25, wherein:
    - the computer readable third program code comprises code that causes the computer to label a document with the second mapping if the similarity between the representation of the document and the representation of each other document in the collection of documents does not exceed a threshold; and
      
      the computer readable fourth program code comprises code that causes the computer to add the document to a cluster of conceptually-unrelated documents.
  - 29. The computer program product of claim 25, wherein the collection of documents has a preexisting cluster structure of nodes and each node has a preexisting representation, and wherein prior to the computer readable first program code the computer program product further comprises:
    - code that causes the computer to transform the preexisting representation of each node into a representation in the abstract mathematical space; and
      
      code that causes the computer to label each node with the first mapping.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Relativity ODA LLC
Original Assignee
Content Analyst Company, LLC (Relativity ODA LLC)
Inventors
Wnek, Janusz

Granted Patent

US 7,844,566 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/355 Class or cluster creation o...

G06F 18/232 Non-hierarchical techniques

Latent semantic clustering

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

94 Citations

29 Claims

Specification

Solutions

Use Cases

Quick Links

Latent semantic clustering

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

94 Citations

29 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links