Method and system for text mining using multidimensional subspaces
First Claim
1. A method of representing a document collection, wherein the document collection comprises a plurality of documents, with each document comprising a plurality of terms, using a:
- subspace projection based on a distribution of the frequency of occurrences of each of the terms in each of the documents, the method comprising;
(a) constructing a term frequency matrix, wherein each entry of the term frequency matrix is the frequency of occurrence of one of the terms in one of the documents;
(b) determining a statistical transformation policy;
(c) statistically transforming the entries of the term frequency matrix according to the statistical transformation policy;
(d) determining a projection type;
(e) determining a lower dimensional subspace; and
(f) generating an original term subspace by projecting the projection type into the lower dimensional subspace.
1 Assignment
0 Petitions
Accused Products
Abstract
A text mining program is provided that allows a user to perform text mining operations, such as: information retrieval, term and document visualization, term and document clustering, term and document classification, summarization of individual documents and groups of documents, and document cross-referencing. This is accomplished by representing the text of a document collection using subspace transformations. This subspace transformation representation is performed by: constructing a term frequency matrix of the term frequencies for each of the documents, transforming the term frequencies for statistical purposes, and projecting the documents or the terms into a lower dimensional subspace. As the document collection is updated, the subspace is dynamically updated to reflect the new document collection.
-
Citations
33 Claims
-
1. A method of representing a document collection, wherein the document collection comprises a plurality of documents, with each document comprising a plurality of terms, using a:
- subspace projection based on a distribution of the frequency of occurrences of each of the terms in each of the documents, the method comprising;
(a) constructing a term frequency matrix, wherein each entry of the term frequency matrix is the frequency of occurrence of one of the terms in one of the documents;
(b) determining a statistical transformation policy;
(c) statistically transforming the entries of the term frequency matrix according to the statistical transformation policy;
(d) determining a projection type;
(e) determining a lower dimensional subspace; and
(f) generating an original term subspace by projecting the projection type into the lower dimensional subspace. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
(a) determining a document length, wherein the document length is the total number of term occurrences in the document;
(b) for each entry, determining a relative frequency of each term by dividing the raw frequency of each term by the document length;
(c) performing an operation on the relative term frequencies to make high frequencies and low frequency terms more comparable;
(d) determining a row average; and
(e) centering data by subtracting the row average from each term frequency.
- subspace projection based on a distribution of the frequency of occurrences of each of the terms in each of the documents, the method comprising;
-
5. The method of claim 1, wherein said constructing of the term frequency matrix comprises defining the term frequency matrix, wherein each row represents a term occurring in one or more of the documents in the document collection, each column represents one of the documents in the document collection, and each entry in the term frequency matrix is the number of occurrences of the term represented by the row of the entry in the document represented by the column of the entry.
-
6. The method of claim 5, wherein said defining of the term frequency matrix comprises generating a term list.
-
7. The method of claim 6, wherein said generating a term list comprises:
-
(a) determining a policy for tokenizing terms;
(b) tokenizing a plurality of terms according to the tokenizing policy;
(c) determining a stopwords policy; and
(d) removing stopwords from the terms represented by rows in the term frequency matrix according to the stopwords policy.
-
-
8. The method of claim 7, further comprising:
-
(a) determining a low frequency words policy; and
(b) removing a plurality of low frequency words according to the low frequency words policy.
-
-
9. The method of claim 8, further comprising:
-
(a) determining an acronym expansion policy; and
(b) performing acronym expansion according to the acronym expansion policy.
-
-
10. The method of claim 9, further comprising:
-
(a) determining an abbreviation expansion policy; and
(b) performing abbreviation expansion according to the abbreviation expansion policy.
-
-
11. The method of claim 10, further comprising:
-
(a) determining a policy for other term normalizations; and
(b) performing other term normalizations according to the other term normalization policy.
-
-
12. The method of claim 11, further comprising:
-
(a) determining a stemming policy; and
(b) performing stemming according to the stemming policy.
-
-
13. The method of claim 12, wherein said projecting the projection type into the lower dimensional subspace comprises:
-
(a) determining a number of dimensions to use;
(b) performing a two-sided orthogonal decomposition of the term frequency matrix according to the determined dimension, the decomposition identifying significant features using three matrices;
a term basis matrix, a weight matrix and a document basis matrix, wherein the term basis matrix and the document basis matrix both have orthonormal columns; and
(c) projecting the projection type into the lower dimensional subspace using the results of the decomposition.
-
-
14. The method of claim 13, wherein the two-sided orthogonal decomposition is a truncated URV decomposition.
-
15. The method of claim 13, wherein the document collection is a dynamically changing document collection, and further comprising updating the original term subspace as the document collection changes.
-
16. The method of claim 15, wherein said updating the original term subspace comprises:
-
(a) identifying a plurality of new documents;
(b) identifying a plurality of new terms in the new documents;
(c) constructing a new term frequency matrix that represents the new documents;
(d) statistically transforming the entries of the new term frequency matrix according to the statistical transformation policy;
(e) projecting the new term frequency matrix on the original term subspace, the term basis matrix;
(f) computing a residual;
(g) augmenting the existing term subspace with the residual;
(h) expanding the original term subspace, wherein expanding the original term subspace comprises expanding the document basis matrix by adding a small identity matrix; and
(i) re-identifying significant features in the subspace.
-
-
17. The method of claim 16, wherein said identifying a plurality of new terms comprises:
-
(a) tokenizing a plurality of terms according to the tokenizing policy;
(b) removing stopwords according to the stopwords policy;
(c) removing low frequency according to the low frequency words policy;
(d) performing acronym expansion according to the acronym expansion policy;
(e) performing abbreviation expansion according to the abbreviation expansion policy;
(f) performing other term normalization according to the other term normalization policy; and
(g) performing stemming according to the stemming policy.
-
-
18. The method of claim 16, wherein said constructing the new term frequency matrix comprises defining the new term frequency matrix, wherein each row represents a term occurring in one or more of the documents, each column represents one of the documents, and each entry in the block consists of the number of occurrences of the term represented by the row of the entry in the document represented by the column of the entry.
-
19. The method of claim 12, further comprising performing a text mining operation, wherein performing the text mining operation is at least one of:
-
(a) providing verbal semantics for at least one dimension of the subspace;
(b) performing document summarization on an individual document;
(c) performing document summarization for a group of documents;
(d) performing information visualization;
(e) performing information retrieval;
(f) performing document cross-referencing;
(g) performing clustering on a clustering projection type; and
(h) performing classification on a classification projection type.
-
-
20. The method of claim 19, wherein providing verbal semantics for the dimension comprises:
-
(a) identifying a column in the term basis matrix that represents a desired dimension;
(b) identifying a plurality of terms in the column with the largest absolute values for positive entries;
(c) identifying a plurality of terms in the column with the largest absolute values for negative entries; and
(d) returning the identified positive entries and the identified negative entries as a contrast set describing the meaning of position along the dimension.
-
-
21. The method of claim 19, wherein performing document summarization on the individual document comprises:
-
(a) projecting the document into a subspace;
(b) projecting the document projection back into the term subspace;
(c) identifying a plurality of terms with the largest entries; and
(d) returning the identified terms as a document summarization.
-
-
22. The method of claim 19, wherein performing document summarization for the group of documents comprises:
-
(a) projecting the group of document into a subspace;
(b) finding a centroid of the group of documents in the subspace;
(c) projecting the centroid back into the term subspace;
(d) identifying a plurality of terms with the largest entries; and
(e) returning the identified terms as a summarization of the group of documents.
-
-
23. The method of claim 19, wherein performing information visualization comprises:
-
(a) in response to a user request for dimensions of visualization, determining a number of dimensions;
(b) computing the requested dimensions;
(c) determining whether the user has requested fewer than the determined number of dimensions;
(d) if the user has requested fewer than the determined number of dimensions, using a set of default dimensions;
(e) determining if the dimensions are orthogonalized;
(f) if the dimensions are not orthogonal, taking an appropriate action based on a user preference, wherein the user preference comprises one of;
(i) orthogonalizing the dimensions; and
(ii) providing an indication that the dimensions are not orthogonal;
(g) projecting the documents onto the dimensions;
(h) generating legends for the dimensions by providing verbal semantics for the dimensions; and
(i) displaying a plurality of indicators, wherein each indicator represents a document, on a plurality of labeled axes corresponding to the dimensions.
-
-
24. The method of claim 19, wherein performing information retrieval comprises:
-
(a) in response to a user query, constructing a term frequency query vector;
(b) statistically transforming the term frequencies according to the statistical transformation policy;
(c) projecting the term frequency vector into a subspace;
(d) determining a similarity measurement between the user query and the each document in the document collection; and
(e) returning a plurality of ranked matching documents based on the determined similarity measurements.
-
-
25. The method of claim 24, wherein said constructing of the term frequency query vector comprises:
-
(a) tokenizing the terms according to the tokenizing policy;
(b) removing stopwords according to the stopwords policy;
(c) removing low frequency words according to the low frequency words policy;
(d) performing acronym expansion according to the acronym expansion policy;
(e) performing abbreviation expansion according to the abbreviation expansion policy;
(f) performing other term normalizations according to the other term normalization policy;
(g) performing stemming according to the stemming policy;
(h) identifying the terms that are in the original term frequency matrix representing the original document collection; and
(i) counting the frequency of the identified terms in the query.
-
-
26. The method of claim 19, wherein performing document cross-referencing comprises:
-
(a) segmenting one or more of the documents into a plurality of suitably sized text units;
(b) constructing a subspace representation for the text units from one or more subsets of the document;
(c) for the subspace representations of each of the text units in one or more documents, determining a similarity measurement between the text unit and the subspace representations of the text units in one or more target documents; and
(d) returning a plurality of ranked matching text units based on the determined similarity measurement.
-
-
27. The method of claim 19, wherein performing clustering on the clustering projection type comprises:
-
(a) using a clustering algorithm;
(b) if the clustering projection type is the documents, clustering the documents according to the distance of their projections in the term subspace and (c) if the clustering projection type is the terms, clustering the terms according to the distance of their projections in the document subspace.
-
-
28. The method of claim 19, wherein performing classification on the classification projection type comprises:
-
(a) using a classification algorithm that accepts real valued features;
(b) if the classification projection type is the documents, using the coordinates of the projections of the documents in the term subspace as the features; and
(c) if the classification projection type is the terms using the coordinates of the terms of the documents in the document subspace as the features.
-
-
29. A method of representing a document collection, wherein the document collection comprises a plurality of documents, with each document comprising a plurality of terms, using a subspace projection based on a distribution of the frequency of occurrences of each of the terms in each of the documents, the method comprising:
-
(a) constructing a term frequency matrix, wherein each entry of the term frequency matrix is the frequency of occurrence of one of the terms in one of the documents;
(b) determining a statistical transformation policy;
(c) statistically transforming the entries of the term frequency matrix according to the statistical transformation policy;
(d) determining a projection type;
(e) determining a lower dimensional subspace;
(f) generating an original term subspace by projecting the projection type into the lower dimensional subspace;
(g) if the document collection is a dynamically changing document collection, updating the original term subspace as the document collection changes; and
(h) performing at least one text mining operation.
-
-
30. A computer readable medium having computer executable instructions for performing the method of:
-
(a) constructing a term frequency matrix, wherein each entry of the term frequency matrix is the frequency of occurrence of one of the terms in one of the documents;
(b) determining a statistical transformation policy;
(c) statistically transforming the term frequencies according to the statistical transformation policy;
(d) determining a projection type;
(e) determining a lower dimensional subspace; and
(f) generating an original term subspace by projecting the projection type into the lower dimensional subspace.
-
-
31. A computer readable medium having computer executable instructions for performing the method of:
-
(a) constructing a term frequency matrix, wherein each entry of the term frequency matrix is the frequency of occurrence of one of the terms in one of the documents;
(b) determining a statistical transformation policy;
(c) statistically transforming the entries of the term frequency matrix according to the statistical transformation policy;
(d) determining a projection type;
(e) determining a lower dimensional subspace;
(f) generating an original term subspace by projecting the projection type into the lower dimensional subspace;
(g) if the document collection is a dynamically changing document collection, updating the original term subspace as the document collection changes; and
(h) performing at least one text mining operation.
-
-
32. A system for representing a document collection using a subspace projection, comprising:
-
(a) a processing unit; and
(b) a storage medium coupled to the processing unit, a storage medium storing program code implemented by the processing unit for;
(i) constructing a term frequency matrix, wherein each entry of the term frequency matrix is the frequency of occurrence of one of the terms in one of the documents;
(ii) determining a statistical transformation policy;
(iii) statistically transforming the term frequencies according to the statistical transformation policy;
(iv) determining a projection type;
(v) determining a lower dimensional subspace; and
(vi) generating an original term subspace by projecting the projection type into the lower dimensional subspace.
-
-
33. A system for representing a document collection using a subspace projection, comprising:
-
(a) a processing unit; and
(b) a storage medium coupled to the processing unit, a storage medium storing program code implemented by the processing unit for;
(i) constructing a term frequency matrix, wherein each entry of the term frequency matrix is the frequency of occurrence of one of the terms in one of the documents;
(ii) determining a statistical transformation policy;
(iii) statistically transforming the entries of the term frequency matrix according to the statistical transformation policy;
(iv) determining a projection type;
(v) determining a lower dimensional subspace;
(vi) generating an original term subspace by projecting the projection type into the lower dimensional subspace;
(vii) if the document collection is a dynamically changing document collection, updating the original term subspace as the document collection changes; and
(viii) performing at least one text mining operation.
-
Specification