Latent semantic clustering

US 7,844,566 B2
Filed: 05/11/2006
Issued: 11/30/2010
Est. Priority Date: 04/26/2005
Status: Active Grant

First Claim

Patent Images

1. A computer-based method for automatically identifying clusters of conceptually-related documents in a collection of documents, comprising:

(a) generating a document-representation of each document in an abstract mathematical space;

(b) identifying a plurality of document clusters in the collection of documents based on a conceptual similarity between respective pairs of the document-representations, wherein each document cluster is associated with an exemplary document and a plurality of other documents; and

(c) identifying a non-intersecting document cluster from among the plurality of document clusters based on (i) a conceptual similarity between the document-representation of the exemplary document and the document-representation of each document in the non-intersecting cluster and (ii) a conceptual dissimilarity between a cluster-representation of the non-intersecting document cluster and a cluster-representation of each other document cluster, wherein step (c) comprises,(c1) identifying a non-intersecting document cluster from among the plurality of document clusters if (i) a conceptual similarity between the document-representation of the exemplary document and the document-representation of each document in the non-intersecting cluster is above a predefined similarity threshold and (ii) a conceptual dissimilarity between a cluster-representation of the non-intersecting document cluster and a cluster-representation of each other document cluster is above a predefined dissimilarity threshold; and

(d) iteratively adjusting the predefined similarity threshold from a maximum similarity level to a minimum similarity level via a predefined similarity increment;

(e) iteratively adjusting the predefined dissimilarity threshold from a minimum dissimilarity level to a maximum dissimilarity level via a predefined dissimilarity increment; and

(f) repeating step (c1) for each similarity level and each dissimilarity level.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An embodiment of the present invention provides a computer-based method for automatically identifying clusters of conceptually-related documents in a collection of documents, including the following steps: generating a document-representation of each document in an abstract mathematical space; identifying a plurality of document clusters in the collection of documents based on a conceptual similarity between respective pairs of the document-representations, wherein each document cluster is associated with an exemplary document and a plurality of other documents; and identifying a non-intersecting document cluster from among the plurality of document clusters based on (i) a conceptual similarity between the document-representation of the exemplary document and the document-representation of each document in the non-intersecting cluster and (ii) a conceptual dissimilarity between a cluster-representation of the non-intersecting document cluster and a cluster-representation of each other document cluster. Variants of the method enable creating hierarchy of clusters and conducting incremental updates of preexisting hierarchical structures.

127 Citations

View as Search Results

8 Claims

1. A computer-based method for automatically identifying clusters of conceptually-related documents in a collection of documents, comprising:
- (a) generating a document-representation of each document in an abstract mathematical space;
  
  (b) identifying a plurality of document clusters in the collection of documents based on a conceptual similarity between respective pairs of the document-representations, wherein each document cluster is associated with an exemplary document and a plurality of other documents; and
  
  (c) identifying a non-intersecting document cluster from among the plurality of document clusters based on (i) a conceptual similarity between the document-representation of the exemplary document and the document-representation of each document in the non-intersecting cluster and (ii) a conceptual dissimilarity between a cluster-representation of the non-intersecting document cluster and a cluster-representation of each other document cluster, wherein step (c) comprises,(c1) identifying a non-intersecting document cluster from among the plurality of document clusters if (i) a conceptual similarity between the document-representation of the exemplary document and the document-representation of each document in the non-intersecting cluster is above a predefined similarity threshold and (ii) a conceptual dissimilarity between a cluster-representation of the non-intersecting document cluster and a cluster-representation of each other document cluster is above a predefined dissimilarity threshold; and
  
  (d) iteratively adjusting the predefined similarity threshold from a maximum similarity level to a minimum similarity level via a predefined similarity increment;
  
  (e) iteratively adjusting the predefined dissimilarity threshold from a minimum dissimilarity level to a maximum dissimilarity level via a predefined dissimilarity increment; and
  
  (f) repeating step (c1) for each similarity level and each dissimilarity level.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, wherein step (b) comprises:
    - (b1) identifying a plurality of document clusters in the collection of documents based on a conceptual similarity between respective pairs of the document-representations; and
      
      (b2) generating a cluster-representation of each document cluster in the plurality of document clusters, wherein each cluster-representation is associated with an exemplary document and a plurality of other documents.
  - 3. The method of claim 1, wherein generating a document-representation of each document in an abstract mathematical space comprises:
    - generating a vector representation of each document in a Latent Semantic Indexing (LSI) space.
  - 4. The method of claim 3, wherein identifying a plurality of document clusters in the collection of documents based on a conceptual similarity between pairs of the document-representations comprises:
    - identifying a plurality of document clusters in the collection of documents based on a cosine similarity between pairs of the document-representations.

5. A computer program product for automatically identifying clusters of conceptually-related documents in a collection of documents, comprising:
- a computer usable medium having computer readable program code embodied in said medium for causing an application program to execute on an operating system of a computer, said computer readable program code comprising;
  
  computer readable first program code that causes the computer to generate a document-representation of each document in an abstract mathematical space;
  
  computer readable second program code that causes the computer to identify a plurality of document clusters in the collection of documents based on a conceptual similarity between respective pairs of the document-representations, wherein each document cluster includes an exemplary document and a plurality of other documents; and
  
  computer readable third program code that causes the computer to identify a non- intersecting document cluster from among the plurality of document clusters based on (i) a conceptual similarity between the document-representation of the exemplary document and the document-representation of each document in the non-intersecting cluster and (ii) a conceptual dissimilarity between a cluster-representation of the non-intersecting document cluster and a cluster-representation of each other document cluster, wherein the computer readable third program code comprises,code that causes the computer to identify a non-intersecting document cluster from among the plurality of document clusters if (i) a conceptual similarity between the document-representation of the exemplary document and the document-representation of each document in the non-intersecting cluster is above a predefined similarity threshold and (ii) a conceptual dissimilarity between a cluster-representation of the non-intersecting document cluster and a cluster-representation of each other document cluster is above a predefined dissimilarity threshold; and
  
  computer readable fourth program code that causes the computer to iteratively adjust the predefined similarity threshold from a maximum similarity level to a minimum similarity level via a predefined similarity increment;
  
  computer readable fifth program code that causes the computer to iteratively adjust the predefined dissimilarity threshold from a minimum dissimilarity level to a maximum dissimilarity level via a predefined dissimilarity increment; and
  
  computer readable sixth program code that causes the computer to repeat the third computer readable program code means for each similarity level and each dissimilarity level.
- View Dependent Claims (6, 7, 8)
- - 6. The computer program product of claim 5, wherein the computer readable second program code comprises:
    - code that causes the computer to identify a plurality of document clusters in the collection of documents based on a conceptual similarity between respective pairs of the document-representations; and
      
      code that causes the computer to generate a cluster-representation of each document cluster in the plurality of document clusters, wherein each cluster-representation is associated with an exemplary document and a plurality of other documents.
  - 7. The computer program product of claim 5, wherein the computer readable first program code that causes the computer to generate a document-representation of each document in an abstract mathematical space comprises:
    - code that causes the computer to generate a vector representation of each document in a Latent Semantic Indexing (LSI) space.
  - 8. The computer program product of claim 7, wherein the computer readable second program code that causes the computer to identify a plurality of document clusters in the collection of documents based on a conceptual similarity between pairs of the document-representations comprises:
    - code that causes the computer to identify a plurality of document clusters in the collection of documents based on a cosine similarity between pairs of the document-representations.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Relativity ODA LLC
Original Assignee
Content Analyst Company, LLC (Relativity ODA LLC)
Inventors
Wnek, Janusz
Primary Examiner(s)
STARKS, WILBERT L

Application Number

US11/431,664
Publication Number

US 20060242140A1
Time in Patent Office

1,664 Days
Field of Search

382/187, 706/10, 706/45, 706/55
US Class Current

706/55
CPC Class Codes

G06F 16/355 Class or cluster creation o...

G06F 18/232 Non-hierarchical techniques

Latent semantic clustering

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

127 Citations

8 Claims

Specification

Solutions

Use Cases

Quick Links

Latent semantic clustering

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

127 Citations

8 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links