Concept decomposition using clustering

US 6,560,597 B1
Filed: 03/21/2000
Issued: 05/06/2003
Est. Priority Date: 03/21/2000
Status: Expired due to Fees

First Claim

Patent Images

1. A method of operating a computer system to represent text documents stored in a database collection, comprising:

representing the text documents in a vector representation format in which there are n documents and d words;

normalizing the document vectors;

determining an initial partitioning of the normalized document vectors comprising a set of k disjoint clusters and determining k cluster vectors, wherein a cluster vector comprises a mean vector of all the normalized document vectors in a partition;

computing a set of K concept vectors based on the initial set of cluster vectors, wherein the concept vectors define a subspace of the document vector space and wherein the subspace spans a part of the document vector space; and

projecting each document vector onto the subspace defined by the concept vectors, thereby defining a set of document concept decomposition vectors that represent the document vector space, with a reduced dimensionality.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method operates with a document collection in which documents are represented as normalized document vectors. The document vector space is partitioned into a set of disjoint clusters and a concept vector is computed for each partition, the concept vector comprising the mean vector of all the documents in each partition. Documents are then reassigned to the cluster having their closest concept vector, and a new set of concept vectors for the new partitioning is computed. From an initial partitioning, the concept vectors are iteratively calculated to a stopping threshold value, leaving a concept vector subspace of the document vectors. The documents can then be projected onto the concept vector subspace to be represented as a linear combination of the concept vectors, thereby reducing the dimensionality of the document space. A search query can be received for the content of text documents and a search can then be performed on the projected document vectors to identify text documents that correspond to the search query.

254 Citations

27 Claims

1. A method of operating a computer system to represent text documents stored in a database collection, comprising:
- representing the text documents in a vector representation format in which there are n documents and d words;
  
  normalizing the document vectors;
  
  determining an initial partitioning of the normalized document vectors comprising a set of k disjoint clusters and determining k cluster vectors, wherein a cluster vector comprises a mean vector of all the normalized document vectors in a partition;
  
  computing a set of K concept vectors based on the initial set of cluster vectors, wherein the concept vectors define a subspace of the document vector space and wherein the subspace spans a part of the document vector space; and
  
  projecting each document vector onto the subspace defined by the concept vectors, thereby defining a set of document concept decomposition vectors that represent the document vector space, with a reduced dimensionality.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. A method as defined in claim 1, wherein computing a set of k concept vectors comprises:
3. A method as defined in claim 2, wherein projecting onto the subspace comprises representing each document vector as a linear combination of the concept vectors.
4. A method as defined in claim 2, wherein the predetermined stopping threshold comprises a metric describing the amount of information thrown away by the projection operation.
5. A method as defined in claim 1, wherein the initial set of k disjoint cluster vectors comprises a set of vectors uniformly distributed in the normalized document vector space.
6. A method as defined in claim 1, wherein the initial set of k disjoint cluster vectors comprises a set of vectors randomly distributed in the normalized document vector space.
7. A method as defined in claim 1, further including:
- receiving a search query relating to content of the text documents;
  
  performing a search on the projected document vectors for the received query; and
  
  identifying text documents that correspond to the search query.

8. A computer system that processes text documents to produce a document database representation, the system comprising:
- a Text Data Repository that contains the text documents; and
  
  a Document Processor that processes the text documents and is adapted to represent the text documents in a vector representation format in which there are n documents and d words, normalize the document vectors, to determine an initial partitioning of the normalized document vectors comprising a set of k disjoint clusters and determining k cluster vectors, wherein a cluster vector comprises a mean vector of all the normalized document vectors in a partition, then to compute a set of k concept vectors based on the initial set of cluster vectors, wherein the concept vectors define a subspace that spans the document vector space, and then to project each document vector onto the subspace defined by the concept vectors, thereby defining a set of document concept decomposition vectors that represent the document vector space, with a reduced dimensionality.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. A system as defined in claim 8, wherein the Document Processor computes a set of k concept vectors by:
10. A system as defined in claim 9, wherein the Document Processor projects onto the subspace by representing each document vector as a linear combination of the concept vectors.
11. A system as defined in claim 9, wherein the predetermined stopping threshold comprises a metric describing the amount of information thrown away by the projection operation.
12. A system as defined in claim 8, wherein the initial set of k disjoint cluster vectors comprises a set of vectors uniformly distributed in the normalized document vector space.
13. A system as defined in claim 8, wherein the initial set of k disjoint cluster vectors comprises a set of vectors randomly distributed in the normalized document vector space.
14. A system as defined in claim 8, further including a search engine that is adapted to receive a search query relating to content of the text documents, perform a search on the projected document vectors for the received query, and identify text documents that correspond to the search query.

15. A program product having a signal-bearing medium tangibly embodying a program of machine-readable instructions executable by a digital processing apparatus to perform a method of representing text documents stored in a database collection, the performed method comprising:
- representing the text documents in a vector representation format in which there are n documents and d words;
  
  normalizing the document vectors;
  
  determining an initial partitioning of the normalized document vectors comprising a set of k disjoint clusters and determining k cluster vectors, wherein a cluster vector comprises a mean vector of all the normalized document vectors in a partition;
  
  computing a set of K concept vectors based on the initial set of cluster vectors, wherein the concept vectors define a subspace of the document vector space and wherein the subspace spans a part of the document vector space; and
  
  projecting each document vector onto the subspace defined by the concept vectors, thereby defining a set of document concept decomposition vectors that represent the document vector space, with a reduced dimensionality.
- View Dependent Claims (16, 17, 18, 19, 20, 21)
- - 16. A program product as defined in claim 15, wherein computing a set of k concept vectors comprises:
17. A program product as defined in claim 16, wherein projecting onto the subspace comprises representing each document vector as a linear combination of the concept vectors.
18. A program product as defined in claim 16, wherein the predetermined stopping threshold comprises a metric describing the amount of information thrown away by the projection operation.
19. A program product as defined in claim 15, wherein the initial set of k disjoint cluster vectors comprises a set of vectors uniformly distributed in the normalized document vector space.
20. A program product as defined in claim 15, wherein the initial set of k disjoint cluster vectors comprises a set of vectors randomly distributed in the normalized document vector space.
21. A program product as defined in claim 15, wherein the performed method further includes:
- receiving a search query relating to content of the text documents;
  
  performing a search on the projected document vectors for the received query; and
  
  identifying text documents that correspond to the search query.

22. A computer system that processes text documents to produce a document database representation, the system comprising:
- a Text Data Repository that contains the text documents; and
  
  a Document Processor that processes the text documents and is adapted to represent the text documents in a vector representation format in which there are n documents and d words, normalize the document vectors, to determine an initial partitioning of the normalized document vectors comprising a set of k disjoint clusters and determining k cluster vectors, wherein a cluster vector comprises a mean vector of all the normalized document vectors in a partition, then to compute a set of k concept vectors based on the initial set of cluster vectors;
  
  wherein the Document Processor computes a set of k concept vectors by;
  
  determining the mean vector of the normalized document vectors in each disjoint cluster;
  
  determining the closest concept vector to each normalized document vector in the document vector space;
  
  assigning each of the normalized document vectors to a partition defined by the determined closest concept vector; and
  
  repeating the computing, determining, and assigning operations up to a predetermined stopping threshold, thereby defining a vector subspace that spans a part of the document vector space;
  
  and wherein the Document Processor projects each document vector onto the subspace defined by the concept vectors, thereby defining a set of document concept decomposition vectors that represent the document vector space, with a reduced dimensionality.
- View Dependent Claims (23, 24, 25, 26, 27)
- - 23. A system as defined in claim 22, wherein the Document Processor projects onto the subspace by representing each document vector as a linear combination of the concept vectors.
  - 24. A system as defined in claim 22, wherein the predetermined stopping threshold comprises a metric describing the amount of information thrown away by the projection operation.
  - 25. A system as defined in claim 22, wherein the initial set of k disjoint cluster vectors comprises a set of vectors uniformly distributed in the normalized document vector space.
  - 26. A system as defined in claim 22, wherein the initial set of k disjoint cluster vectors comprises a set of vectors randomly distributed in the normalized document vector space.
  - 27. A system as defined in claim 22, further including a search engine that is adapted to receive a search query relating to content of the text documents, perform a search on the projected document vectors for the received query, and identify text documents that correspond to the search query.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Dhillon, Inderjit Singh, Modha, Dharmendra Shantilal
Primary Examiner(s)
Mizrahi, Diane D.

Application Number

US09/528,941
Time in Patent Office

1,141 Days
Field of Search

707/4, 707/5, 707/3, 707/102, 707/6, 709/201
US Class Current

1/1
CPC Class Codes

G06F 16/355   Class or cluster creation o...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99936   Pattern matching access

Y10S 707/99943   Generating database or data...

Concept decomposition using clustering

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

254 Citations

27 Claims

Specification

Use Cases

Quick Links

Others

Concept decomposition using clustering

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

254 Citations

27 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others