Concept decomposition using clustering
First Claim
1. A method of operating a computer system to represent text documents stored in a database collection, comprising:
- representing the text documents in a vector representation format in which there are n documents and d words;
normalizing the document vectors;
determining an initial partitioning of the normalized document vectors comprising a set of k disjoint clusters and determining k cluster vectors, wherein a cluster vector comprises a mean vector of all the normalized document vectors in a partition;
computing a set of K concept vectors based on the initial set of cluster vectors, wherein the concept vectors define a subspace of the document vector space and wherein the subspace spans a part of the document vector space; and
projecting each document vector onto the subspace defined by the concept vectors, thereby defining a set of document concept decomposition vectors that represent the document vector space, with a reduced dimensionality.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method operates with a document collection in which documents are represented as normalized document vectors. The document vector space is partitioned into a set of disjoint clusters and a concept vector is computed for each partition, the concept vector comprising the mean vector of all the documents in each partition. Documents are then reassigned to the cluster having their closest concept vector, and a new set of concept vectors for the new partitioning is computed. From an initial partitioning, the concept vectors are iteratively calculated to a stopping threshold value, leaving a concept vector subspace of the document vectors. The documents can then be projected onto the concept vector subspace to be represented as a linear combination of the concept vectors, thereby reducing the dimensionality of the document space. A search query can be received for the content of text documents and a search can then be performed on the projected document vectors to identify text documents that correspond to the search query.
254 Citations
27 Claims
-
1. A method of operating a computer system to represent text documents stored in a database collection, comprising:
-
representing the text documents in a vector representation format in which there are n documents and d words;
normalizing the document vectors;
determining an initial partitioning of the normalized document vectors comprising a set of k disjoint clusters and determining k cluster vectors, wherein a cluster vector comprises a mean vector of all the normalized document vectors in a partition;
computing a set of K concept vectors based on the initial set of cluster vectors, wherein the concept vectors define a subspace of the document vector space and wherein the subspace spans a part of the document vector space; and
projecting each document vector onto the subspace defined by the concept vectors, thereby defining a set of document concept decomposition vectors that represent the document vector space, with a reduced dimensionality. - View Dependent Claims (2, 3, 4, 5, 6, 7)
determining the mean vector of the normalized document vectors in each disjoint cluster;
determining the closest concept vector to each normalized document vector in the document vector space;
assigning each of the normalized document vectors to a partition defined by the determined closest concept vector; and
repeating the computing, determining, and assigning operations up to a predetermined stopping threshold, thereby defining a vector subspace that spans the document vector space.
-
-
3. A method as defined in claim 2, wherein projecting onto the subspace comprises representing each document vector as a linear combination of the concept vectors.
-
4. A method as defined in claim 2, wherein the predetermined stopping threshold comprises a metric describing the amount of information thrown away by the projection operation.
-
5. A method as defined in claim 1, wherein the initial set of k disjoint cluster vectors comprises a set of vectors uniformly distributed in the normalized document vector space.
-
6. A method as defined in claim 1, wherein the initial set of k disjoint cluster vectors comprises a set of vectors randomly distributed in the normalized document vector space.
-
7. A method as defined in claim 1, further including:
-
receiving a search query relating to content of the text documents;
performing a search on the projected document vectors for the received query; and
identifying text documents that correspond to the search query.
-
-
8. A computer system that processes text documents to produce a document database representation, the system comprising:
-
a Text Data Repository that contains the text documents; and
a Document Processor that processes the text documents and is adapted to represent the text documents in a vector representation format in which there are n documents and d words, normalize the document vectors, to determine an initial partitioning of the normalized document vectors comprising a set of k disjoint clusters and determining k cluster vectors, wherein a cluster vector comprises a mean vector of all the normalized document vectors in a partition, then to compute a set of k concept vectors based on the initial set of cluster vectors, wherein the concept vectors define a subspace that spans the document vector space, and then to project each document vector onto the subspace defined by the concept vectors, thereby defining a set of document concept decomposition vectors that represent the document vector space, with a reduced dimensionality. - View Dependent Claims (9, 10, 11, 12, 13, 14)
determining the mean vector of the normalized document vectors in each disjoint cluster;
determining the closest concept vector to each normalized document vector in the document vector space;
assigning each of the normalized document vectors to a partition defined by the determined closest concept vector; and
repeating the computing, determining, and assigning operations up to a predetermined stopping threshold, thereby defining a vector subspace that spans a part of the document vector space.
-
-
10. A system as defined in claim 9, wherein the Document Processor projects onto the subspace by representing each document vector as a linear combination of the concept vectors.
-
11. A system as defined in claim 9, wherein the predetermined stopping threshold comprises a metric describing the amount of information thrown away by the projection operation.
-
12. A system as defined in claim 8, wherein the initial set of k disjoint cluster vectors comprises a set of vectors uniformly distributed in the normalized document vector space.
-
13. A system as defined in claim 8, wherein the initial set of k disjoint cluster vectors comprises a set of vectors randomly distributed in the normalized document vector space.
-
14. A system as defined in claim 8, further including a search engine that is adapted to receive a search query relating to content of the text documents, perform a search on the projected document vectors for the received query, and identify text documents that correspond to the search query.
-
15. A program product having a signal-bearing medium tangibly embodying a program of machine-readable instructions executable by a digital processing apparatus to perform a method of representing text documents stored in a database collection, the performed method comprising:
-
representing the text documents in a vector representation format in which there are n documents and d words;
normalizing the document vectors;
determining an initial partitioning of the normalized document vectors comprising a set of k disjoint clusters and determining k cluster vectors, wherein a cluster vector comprises a mean vector of all the normalized document vectors in a partition;
computing a set of K concept vectors based on the initial set of cluster vectors, wherein the concept vectors define a subspace of the document vector space and wherein the subspace spans a part of the document vector space; and
projecting each document vector onto the subspace defined by the concept vectors, thereby defining a set of document concept decomposition vectors that represent the document vector space, with a reduced dimensionality. - View Dependent Claims (16, 17, 18, 19, 20, 21)
determining the mean vector of the normalized document vectors in each disjoint cluster;
determining the closest concept vector to each normalized document vector in the document vector space;
assigning each of the normalized document vectors to a partition defined by the determined closest concept vector; and
repeating the computing, determining, and assigning operations up to a predetermined stopping threshold, thereby defining a vector subspace that spans a part of the document vector space.
-
-
17. A program product as defined in claim 16, wherein projecting onto the subspace comprises representing each document vector as a linear combination of the concept vectors.
-
18. A program product as defined in claim 16, wherein the predetermined stopping threshold comprises a metric describing the amount of information thrown away by the projection operation.
-
19. A program product as defined in claim 15, wherein the initial set of k disjoint cluster vectors comprises a set of vectors uniformly distributed in the normalized document vector space.
-
20. A program product as defined in claim 15, wherein the initial set of k disjoint cluster vectors comprises a set of vectors randomly distributed in the normalized document vector space.
-
21. A program product as defined in claim 15, wherein the performed method further includes:
-
receiving a search query relating to content of the text documents;
performing a search on the projected document vectors for the received query; and
identifying text documents that correspond to the search query.
-
-
22. A computer system that processes text documents to produce a document database representation, the system comprising:
-
a Text Data Repository that contains the text documents; and
a Document Processor that processes the text documents and is adapted to represent the text documents in a vector representation format in which there are n documents and d words, normalize the document vectors, to determine an initial partitioning of the normalized document vectors comprising a set of k disjoint clusters and determining k cluster vectors, wherein a cluster vector comprises a mean vector of all the normalized document vectors in a partition, then to compute a set of k concept vectors based on the initial set of cluster vectors;
wherein the Document Processor computes a set of k concept vectors by;
determining the mean vector of the normalized document vectors in each disjoint cluster;
determining the closest concept vector to each normalized document vector in the document vector space;
assigning each of the normalized document vectors to a partition defined by the determined closest concept vector; and
repeating the computing, determining, and assigning operations up to a predetermined stopping threshold, thereby defining a vector subspace that spans a part of the document vector space;
and wherein the Document Processor projects each document vector onto the subspace defined by the concept vectors, thereby defining a set of document concept decomposition vectors that represent the document vector space, with a reduced dimensionality. - View Dependent Claims (23, 24, 25, 26, 27)
-
Specification