Methods, apparatus and computer program products for information retrieval and document classification utilizing a multidimensional subspace
First Claim
1. A method of retrieving information from a text data collection that comprises a plurality of documents with each document comprised of a plurality of terms, wherein the text data collection is represented by a term-by-document matrix having a plurality of entries with each entry being the frequency of occurrence of a term in a respective document, and wherein the method comprises:
- receiving a query;
projecting a representation of at least a portion of the term-by-document matrix into a lower dimensional subspace to thereby create at least those portions of a subspace representation Ak relating to a term identified by the query;
weighting at least those portions of a subspace representation Ak relating to a term identified by the query following the projection into the lower dimensional subspace;
scoring the plurality of documents with respect to the query based at least partially upon the weighted portion of the subspace representation Ak; and
identifying respective documents based upon relative scores of the documents with respect to the query.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods, apparatus and computer program products are provided for retrieving information from a text data collection and for classifying a document into none, one or more of a plurality of predefined classes. In each aspect, a representation of at least a portion of the original matrix is projected into a lower dimensional subspace and those portions of the subspace representation that relate to the term(s) of the query are weighted following the projection into the lower dimensional subspace. In order to retrieve the documents that are most relevant with respect to a query, the documents are then scored with documents having better scores being of generally greater relevance. Alternatively, in order to classify a document, the relationship of the document to the classes of documents is scored with the document then being classified in those classes, if any, that have the best scores.
459 Citations
47 Claims
-
1. A method of retrieving information from a text data collection that comprises a plurality of documents with each document comprised of a plurality of terms, wherein the text data collection is represented by a term-by-document matrix having a plurality of entries with each entry being the frequency of occurrence of a term in a respective document, and wherein the method comprises:
-
receiving a query;
projecting a representation of at least a portion of the term-by-document matrix into a lower dimensional subspace to thereby create at least those portions of a subspace representation Ak relating to a term identified by the query;
weighting at least those portions of a subspace representation Ak relating to a term identified by the query following the projection into the lower dimensional subspace;
scoring the plurality of documents with respect to the query based at least partially upon the weighted portion of the subspace representation Ak; and
identifying respective documents based upon relative scores of the documents with respect to the query. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method of classifying a document with respect to a plurality of predefined classes defined by a term-by-class matrix with each predefined class including at least one term, wherein the method comprises:
-
receiving a representation of the document to be classified;
projecting a representation of at least a portion of the term-by-class matrix into a lower dimensional subspace to thereby create at least those portions of a subspace representation Ak relating to a term included within the representation of the document to be classified;
weighting at least those portions of the subspace representation Ak relating to a term included within the representation of the document to be classified following the projection into the lower dimensional subspace;
scoring the relationship of the document to each predefined class based at least, partially upon the weighted portion of the subspace representation Ak;
determining if the document is to be classified into any of the plurality of predefined classes based upon the scores of the relationship of the document to each predefined class; and
classifying the document into at least one of the plurality of predefined classes if so determined based upon the scores of the relationship of the document to each predefined class. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A method retrieving information from a text data collection that comprises a plurality of documents with each document comprised of a plurality of terms, wherein the text data collection is represented by a term-by-document matrix having a plurality of entries with each entry being the frequency of occurrence of a term in a respective document, and wherein the method comprises:
-
receiving a query;
determining if the query is to be treated as a pseudo-document or as a set of terms;
processing the query depending upon the treatment of the query as a pseudo-document or as a set of terms;
scoring the plurality of documents with respect to the query based upon said processing of the query; and
identifying respective documents based upon relative scores of the documents with respect to the query. - View Dependent Claims (14, 15)
projecting a representation of at least a portion of the term-by-document matrix into a lower dimensional subspace to thereby create at least those portions of a subspace representation Ak corresponding to a term identified by the query; and
weighting at least those portions of a subspace representation Ak corresponding to a term identified by the query following the projection into the lower dimensional subspace, and wherein said scoring comprises scoring the plurality of documents with respect to the query based at least partially upon the weighted portion of the subspace representation Ak.
-
-
15. A method according to claim 13 wherein the processing of the query in instances in which the query is treated as a pseudo-document comprises:
-
projecting a representation of at least a portion of the term-by-document matrix into a lower dimensional subspace;
projecting a query vector representative of the query into the lower dimensional subspace; and
comparing the projection of the query vector and the representation of at least a portion of the term-by-document matrix, and wherein said scoring comprises scoring the plurality of documents with respect to the query based at least partially upon the comparison of the projection of the query vector and the representation of at least a portion of the term-by-document matrix.
-
-
16. A computer program product for retrieving information from a text data collection that comprises a plurality of documents with each document comprised of a plurality of terms, wherein the text data collection is represented by a term-by-document matrix having a plurality of entries with each entry being the frequency of occurrence of a term in a respective document, wherein the computer program product comprises a computer-readable storage medium having computer-readable program code means embodied in said medium, and wherein said computer-readable program code means comprises:
-
first computer-readable program code means for receiving a query;
second computer-readable program code means for projecting a representation of at least a portion of the term-by-document matrix into a lower dimensional subspace to thereby create at least those portions of a subspace representation Ak relating to a term identified by the query;
third computer-readable program code means for weighting at least those portions of a subspace representation Ak relating to a term identified by the query following the projection into the lower dimensional subspace; and
fourth computer-readable program code means for scoring the plurality of documents with respect to the query based at least partially upon the weighted portion of the subspace representation Ak. - View Dependent Claims (17, 18, 19, 20, 21, 22)
-
-
23. A computer program product for classifying a document with respect to a plurality of predefined classes defined by a term-by-class matrix with each predefined class including at least one term, wherein the computer program product comprises a computer-readable storage medium having computer-readable program code means embodied in said medium, and wherein said computer-readable program code means comprises:
-
first computer-readable program code means for receiving a representation of the document to be classified;
second computer-readable program code means for projecting a representation of at least a portion of the term-by-class matrix into a lower dimensional subspace to thereby create at least those portions of a subspace representation Ak relating to a term included within the representation of the document to be classified;
third computer-readable program code means for weighting at least those portions of the subspace representation Ak relating to a term included within the representation of the document to be classified following the projection into the lower dimensional subspace;
fourth computer-readable program code means for scoring the relationship of the document to each predefined class based at least partially upon the weighted portion of the subspace representation Ak; and
fifth computer-readable program code means for determining if the document is to be classified into any of the plurality of predefined classes based upon the scores of the relationship of the document to each predefined class. - View Dependent Claims (24, 25, 26, 27, 28)
-
-
29. A computer program product for retrieving information from a text data collection that comprises a plurality of documents with each document comprised of a plurality of terms, wherein the text data collection is represented by a term-by-document matrix having a plurality of entries with each entry being the frequency of occurrence of a term in a respective document, wherein the computer program product comprises a computer-readable storage medium having computer-readable program code means embodied in said medium, and wherein said computer-readable program code means comprises:
-
first computer-readable program code means for receiving a query;
second computer-readable program code means for determining if the query is to be treated as a pseudo-document or as a set of terms;
third computer-readable program code means for processing the query depending upon the treatment of the query as a pseudo-document or as a set of terms; and
fourth computer-readable program code means for scoring the plurality of documents with respect to the query based upon said processing of the query. - View Dependent Claims (30, 31)
fifth computer-readable program code means, operable in instances in which the query is treated as a set of terms, for projecting a representation of at least a portion of the term-by-document matrix into a lower dimensional subspace to thereby create at least those portions of a subspace representation Ak corresponding to a term identified by the query; and
sixth computer-readable program code means, also operable in instances in which the query is treated as a set of terms, for weighting at least those portions of a subspace representation Ak corresponding to a term identified by the query following the projection into the lower dimensional subspace, and wherein said fourth computer-readable program code means scores the plurality of documents with respect to the query based at least partially upon the weighted portion of the subspace representation Ak in instances in which the query is treated as a set of terms.
-
-
31. A computer program product according to claim 29 wherein said third computer-readable program code means comprises:
-
fifth computer-readable program code means, operable in instances in which the query is treated as a pseudo-document, for projecting a representation of at least a portion of the term-by-document matrix into a lower dimensional subspace;
sixth computer-readable program code means, also operable in instances in which the query is treated as a pseudo-document, for projecting a query vector representative of the query into the lower dimensional subspace; and
seventh computer-readable program code means, further operable in instances in which the query is treated as a pseudo-document, for comparing the projection of the query vector and the representation of at least a portion of the term-by-document matrix, and wherein said fourth computer-readable program code means scores the plurality of documents with respect to the query based at least partially upon the comparison of the projection of the query vector and the representation of at least a portion of the term-by-document matrix in instances in which the query is treated as a pseudo-document.
-
-
32. An apparatus for retrieving information from a text data collection that comprises a plurality of documents with each document comprised of a plurality of terms, wherein the text data collection is represented by a term-by-document matrix having a plurality of entries with each entry being the frequency of occurrence of a term in a respective document, and wherein the apparatus comprises:
-
means for receiving a query;
means for projecting a representation of at least a portion of the term-by-document matrix into a lower dimensional subspace to thereby create at least those portions of a subspace representation Ak relating to a term identified by the query;
means for weighting at least those portions of the subspace representation Ak relating to a term identified by the query following the projection into the lower dimensional subspace; and
means for scoring the plurality of documents with respect to the query based at least partially upon the weighted portion of the subspace representation Ak. - View Dependent Claims (33, 34, 35, 36, 37, 38)
-
-
39. An apparatus for classifying a document with respect to a plurality of predefined classes defined by a term-by-class matrix with each predefined class including at least one term, wherein the apparatus comprises:
-
means for receiving a representation of the document to be classified;
means for projecting a representation of at least a portion of the term-by-class matrix into a lower dimensional subspace to thereby create at least those portions of a subspace representation Ak relating to a term included within the representation of the document to be classified;
means for weighting at least those portions of the subspace representation Ak relating to a term included within the representation of the document to be classified following the projection into the lower dimensional subspace;
means for scoring the relationship of the document to each predefined class based at least partially upon the weighted portion of the subspace representation Ak; and
means for determining if the document is to be classified into any of the plurality of predefined classes based upon the scores of the relationship of the document to each predefined class. - View Dependent Claims (40, 41, 42, 43, 44)
-
-
45. An apparatus for retrieving information from a text data collection that comprises a plurality of documents with each document comprised of a plurality of terms, wherein the text data collection is represented by a term-by-document matrix having a plurality of entries with each entry being the frequency of occurrence of a term in a respective document, and wherein the apparatus comprises:
-
means for receiving a query;
means for determining if the query is to be treated as a pseudo-document or as a set of terms;
means for processing the query depending upon the treatment of the query as a pseudo-document or as a set of terms; and
means for scoring the plurality of documents with respect to the query based upon said processing of the query. - View Dependent Claims (46, 47)
means, operable in instances in which the query is treated as a set of terms, for projecting a representation of at least a portion of the term-by-document matrix into a lower dimensional subspace to thereby create at least those portions of a subspace representation Ak corresponding to a term identified by the query; and
means, also operable in instances in which the query is treated as a set of terms, for weighting at least those portions of a subspace representation Ak corresponding to a term identified by the query following the projection into the lower dimensional subspace, and wherein means for scoring scores the plurality of documents with respect to the query based at least partially upon the weighted portion of the subspace representation Ak in instances in which the query is treated as a set of terms.
-
-
47. An apparatus according to claim 45 wherein said means for processing comprises:
-
means, operable in instances in which the query is treated as a pseudo-document, for projecting a representation of at least a portion of the term-by-document matrix into a lower dimensional subspace;
means, also operable in instances in which the query is treated as a pseudo-document, for projecting a query vector representative of the query into the lower dimensional subspace; and
means, further operable in instances in which the query is treated as a pseudo-document, for comparing the projection of the query vector and the representation of at least a portion of the term-by-document matrix, and wherein said means for scoring scores the plurality of documents with respect to the query based at least partially upon the comparison of the projection of the query vector and the representation of at least a portion of the term-by-document matrix in instances in which the query is treated as a pseudo-document.
-
Specification