Methods, apparatus and computer program products for information retrieval and document classification utilizing a multidimensional subspace

US 6,701,305 B1
Filed: 10/20/2000
Issued: 03/02/2004
Est. Priority Date: 06/09/1999
Status: Expired due to Term

First Claim

Patent Images

1. A method of retrieving information from a text data collection that comprises a plurality of documents with each document comprised of a plurality of terms, wherein the text data collection is represented by a term-by-document matrix having a plurality of entries with each entry being the frequency of occurrence of a term in a respective document, and wherein the method comprises:

receiving a query;

projecting a representation of at least a portion of the term-by-document matrix into a lower dimensional subspace to thereby create at least those portions of a subspace representation A_krelating to a term identified by the query;

weighting at least those portions of a subspace representation A_krelating to a term identified by the query following the projection into the lower dimensional subspace;

scoring the plurality of documents with respect to the query based at least partially upon the weighted portion of the subspace representation A_k; and

identifying respective documents based upon relative scores of the documents with respect to the query.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, apparatus and computer program products are provided for retrieving information from a text data collection and for classifying a document into none, one or more of a plurality of predefined classes. In each aspect, a representation of at least a portion of the original matrix is projected into a lower dimensional subspace and those portions of the subspace representation that relate to the term(s) of the query are weighted following the projection into the lower dimensional subspace. In order to retrieve the documents that are most relevant with respect to a query, the documents are then scored with documents having better scores being of generally greater relevance. Alternatively, in order to classify a document, the relationship of the document to the classes of documents is scored with the document then being classified in those classes, if any, that have the best scores.

459 Citations

47 Claims

1. A method of retrieving information from a text data collection that comprises a plurality of documents with each document comprised of a plurality of terms, wherein the text data collection is represented by a term-by-document matrix having a plurality of entries with each entry being the frequency of occurrence of a term in a respective document, and wherein the method comprises:
- receiving a query;
  
  projecting a representation of at least a portion of the term-by-document matrix into a lower dimensional subspace to thereby create at least those portions of a subspace representation A_krelating to a term identified by the query;
  
  weighting at least those portions of a subspace representation A_krelating to a term identified by the query following the projection into the lower dimensional subspace;
  
  scoring the plurality of documents with respect to the query based at least partially upon the weighted portion of the subspace representation A_k; and
  
  identifying respective documents based upon relative scores of the documents with respect to the query.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. A method according to claim 1 wherein the subspace representation A_kincludes a plurality of rows corresponding to respective terms, and wherein said weighting comprises determining an inverse infinity norm of the term.
  - 3. A method according to claim 1 wherein the subspace representation A_kincludes a plurality of rows corresponding to respective terms, and wherein said weighting comprises determining an inverse 1-norm of the term.
  - 4. A method according to claim 1 wherein the subspace representation A_kincludes a plurality of rows corresponding to respective terms, and wherein said weighting comprises determining an inverse 2-norm of the term.
  - 5. A method according to claim 1 further comprising weighting the term-by-document matrix on a document-by-document basis prior to the projection into the lower dimensional subspace.
  - 6. A method according to claim 1 wherein the projection into the lower dimensional subspace comprises obtaining an orthogonal decomposition of the representation of the term-by-document matrix into a k-dimensional subspace.

7. A method of classifying a document with respect to a plurality of predefined classes defined by a term-by-class matrix with each predefined class including at least one term, wherein the method comprises:
- receiving a representation of the document to be classified;
  
  projecting a representation of at least a portion of the term-by-class matrix into a lower dimensional subspace to thereby create at least those portions of a subspace representation A_krelating to a term included within the representation of the document to be classified;
  
  weighting at least those portions of the subspace representation A_krelating to a term included within the representation of the document to be classified following the projection into the lower dimensional subspace;
  
  scoring the relationship of the document to each predefined class based at least, partially upon the weighted portion of the subspace representation A_k;
  
  determining if the document is to be classified into any of the plurality of predefined classes based upon the scores of the relationship of the document to each predefined class; and
  
  classifying the document into at least one of the plurality of predefined classes if so determined based upon the scores of the relationship of the document to each predefined class.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. A method according to claim 7 wherein the subspace representation A_kincludes a plurality of rows corresponding to respective terms, and wherein said weighting comprises determining an inverse infinity norm of the term.
  - 9. A method according to claim 7 wherein the subspace representation A_kincludes a plurality of rows corresponding to respective terms, and wherein said weighting comprises determining an inverse 1-norm of the term.
  - 10. A method according to claim 7 wherein the subspace representation A_kincludes a plurality of rows corresponding to respective terms, and wherein said weighting comprises determining an inverse 2-norm of the term.
  - 11. A method according to claim 7 further comprising weighting the term-by-class matrix on a class-by-class basis prior to the projection into the lower dimensional subspace.
  - 12. A method according to claim 7 wherein the projection into the lower dimensional subspace comprises obtaining an orthogonal decomposition of the representation of the term-by-class matrix into a k-dimensional subspace.

13. A method retrieving information from a text data collection that comprises a plurality of documents with each document comprised of a plurality of terms, wherein the text data collection is represented by a term-by-document matrix having a plurality of entries with each entry being the frequency of occurrence of a term in a respective document, and wherein the method comprises:
- receiving a query;
  
  determining if the query is to be treated as a pseudo-document or as a set of terms;
  
  processing the query depending upon the treatment of the query as a pseudo-document or as a set of terms;
  
  scoring the plurality of documents with respect to the query based upon said processing of the query; and
  
  identifying respective documents based upon relative scores of the documents with respect to the query.
- View Dependent Claims (14, 15)
- - 14. A method according to claim 13 wherein the processing of the query in instances in which the query is treated as a set of terms comprises:
15. A method according to claim 13 wherein the processing of the query in instances in which the query is treated as a pseudo-document comprises:
- projecting a representation of at least a portion of the term-by-document matrix into a lower dimensional subspace;
  
  projecting a query vector representative of the query into the lower dimensional subspace; and
  
  comparing the projection of the query vector and the representation of at least a portion of the term-by-document matrix, and wherein said scoring comprises scoring the plurality of documents with respect to the query based at least partially upon the comparison of the projection of the query vector and the representation of at least a portion of the term-by-document matrix.

16. A computer program product for retrieving information from a text data collection that comprises a plurality of documents with each document comprised of a plurality of terms, wherein the text data collection is represented by a term-by-document matrix having a plurality of entries with each entry being the frequency of occurrence of a term in a respective document, wherein the computer program product comprises a computer-readable storage medium having computer-readable program code means embodied in said medium, and wherein said computer-readable program code means comprises:
- first computer-readable program code means for receiving a query;
  
  second computer-readable program code means for projecting a representation of at least a portion of the term-by-document matrix into a lower dimensional subspace to thereby create at least those portions of a subspace representation A_krelating to a term identified by the query;
  
  third computer-readable program code means for weighting at least those portions of a subspace representation A_krelating to a term identified by the query following the projection into the lower dimensional subspace; and
  
  fourth computer-readable program code means for scoring the plurality of documents with respect to the query based at least partially upon the weighted portion of the subspace representation A_k.
- View Dependent Claims (17, 18, 19, 20, 21, 22)
- - 17. A computer program product according to claim 16 wherein the subspace representation A_kincludes a plurality of rows corresponding to respective terms, and wherein said third computer-readable program code means determines an inverse infinity norm of the term.
  - 18. A computer program product according to claim 16 wherein the subspace representation A_kincludes a plurality of rows corresponding to respective terms, and wherein said third computer-readable program code means determines an inverse 1-norm of the term.
  - 19. A computer program product according to claim 16 wherein the subspace representation A_kincludes a plurality of rows corresponding to respective terms, and wherein said third computer-readable program code means determines an inverse 2-norm of the term.
  - 20. A computer program product according to claim 16 further comprising fifth computer-readable program code means for weighting the term-by-document matrix on a document-by-document basis prior to the projection into the lower dimensional subspace.
  - 21. A computer program product according to claim 16 wherein said second computer-readable program code means obtains an orthogonal decomposition of the representation of the term-by-document matrix into a k-dimensional subspace.
  - 22. A computer program product according to claim 16 further comprising sixth computer-readable program code means for identifying respective documents based upon relative scores of the documents with respect to the query.

23. A computer program product for classifying a document with respect to a plurality of predefined classes defined by a term-by-class matrix with each predefined class including at least one term, wherein the computer program product comprises a computer-readable storage medium having computer-readable program code means embodied in said medium, and wherein said computer-readable program code means comprises:
- first computer-readable program code means for receiving a representation of the document to be classified;
  
  second computer-readable program code means for projecting a representation of at least a portion of the term-by-class matrix into a lower dimensional subspace to thereby create at least those portions of a subspace representation A_krelating to a term included within the representation of the document to be classified;
  
  third computer-readable program code means for weighting at least those portions of the subspace representation A_krelating to a term included within the representation of the document to be classified following the projection into the lower dimensional subspace;
  
  fourth computer-readable program code means for scoring the relationship of the document to each predefined class based at least partially upon the weighted portion of the subspace representation A_k; and
  
  fifth computer-readable program code means for determining if the document is to be classified into any of the plurality of predefined classes based upon the scores of the relationship of the document to each predefined class.
- View Dependent Claims (24, 25, 26, 27, 28)
- - 24. A computer program product according to claim 23 wherein the subspace representation A_kincludes a plurality of rows corresponding to respective terms, and wherein said third computer-readable program code means determines an inverse infinity norm of the term.
  - 25. A computer program product according to claim 23 wherein the subspace representation A_kincludes a plurality of rows corresponding to respective terms, and wherein said third computer-readable program code means determines an inverse 1-norm of the term.
  - 26. A computer program product according to claim 23 wherein the subspace representation A_kincludes a plurality of rows corresponding to respective terms, and wherein said third computer-readable program code means determines an inverse 2-norm of the term.
  - 27. A computer program product according to claim 23 further comprising sixth computer-readable program code means for weighting the term-by-class matrix on a class-by-class basis prior to the projection into the lower dimensional subspace.
  - 28. A computer program product according to claim 23 wherein said second computer-readable program code means obtains an orthogonal decomposition of the representation of the term-by-class into matrix a k-dimensional subspace.

29. A computer program product for retrieving information from a text data collection that comprises a plurality of documents with each document comprised of a plurality of terms, wherein the text data collection is represented by a term-by-document matrix having a plurality of entries with each entry being the frequency of occurrence of a term in a respective document, wherein the computer program product comprises a computer-readable storage medium having computer-readable program code means embodied in said medium, and wherein said computer-readable program code means comprises:
- first computer-readable program code means for receiving a query;
  
  second computer-readable program code means for determining if the query is to be treated as a pseudo-document or as a set of terms;
  
  third computer-readable program code means for processing the query depending upon the treatment of the query as a pseudo-document or as a set of terms; and
  
  fourth computer-readable program code means for scoring the plurality of documents with respect to the query based upon said processing of the query.
- View Dependent Claims (30, 31)
- - 30. A computer program product according to claim 29 wherein said third computer-readable program code means comprises:
31. A computer program product according to claim 29 wherein said third computer-readable program code means comprises:
- fifth computer-readable program code means, operable in instances in which the query is treated as a pseudo-document, for projecting a representation of at least a portion of the term-by-document matrix into a lower dimensional subspace;
  
  sixth computer-readable program code means, also operable in instances in which the query is treated as a pseudo-document, for projecting a query vector representative of the query into the lower dimensional subspace; and
  
  seventh computer-readable program code means, further operable in instances in which the query is treated as a pseudo-document, for comparing the projection of the query vector and the representation of at least a portion of the term-by-document matrix, and wherein said fourth computer-readable program code means scores the plurality of documents with respect to the query based at least partially upon the comparison of the projection of the query vector and the representation of at least a portion of the term-by-document matrix in instances in which the query is treated as a pseudo-document.

32. An apparatus for retrieving information from a text data collection that comprises a plurality of documents with each document comprised of a plurality of terms, wherein the text data collection is represented by a term-by-document matrix having a plurality of entries with each entry being the frequency of occurrence of a term in a respective document, and wherein the apparatus comprises:
- means for receiving a query;
  
  means for projecting a representation of at least a portion of the term-by-document matrix into a lower dimensional subspace to thereby create at least those portions of a subspace representation A_krelating to a term identified by the query;
  
  means for weighting at least those portions of the subspace representation A_krelating to a term identified by the query following the projection into the lower dimensional subspace; and
  
  means for scoring the plurality of documents with respect to the query based at least partially upon the weighted portion of the subspace representation A_k.
- View Dependent Claims (33, 34, 35, 36, 37, 38)
- - 33. An apparatus according to claim 32 wherein the subspace representation A_kincludes a plurality of rows corresponding to respective terms, and wherein said means for weighting comprises means for determining an inverse infinity norm of the term.
  - 34. An apparatus according to claim 32 wherein the subspace representation A_kincludes a plurality of rows corresponding to respective terms, and wherein said means for weighting comprises means for determining an inverse 1-norm of the term.
  - 35. An apparatus according to claim 32 wherein the subspace representation A_kincludes a plurality of rows corresponding to respective terms, and wherein said means for weighting comprises means for determining an inverse 2-norm of the term.
  - 36. An apparatus according to claim 32 further comprising means for weighting the term-by-document matrix on a document-by-document basis prior to the projection into the lower dimensional subspace.
  - 37. An apparatus according to claim 32 wherein said means for projecting a representation of at least a portion of the term-by-document matrix into a lower dimensional subspace comprises means for obtaining an orthogonal decomposition of the representation of the term-by-document matrix into a k-dimensional subspace.
  - 38. An apparatus according to claim 32 further comprising means for identifying respective documents based upon relative scores of the documents with respect to the query.

39. An apparatus for classifying a document with respect to a plurality of predefined classes defined by a term-by-class matrix with each predefined class including at least one term, wherein the apparatus comprises:
- means for receiving a representation of the document to be classified;
  
  means for projecting a representation of at least a portion of the term-by-class matrix into a lower dimensional subspace to thereby create at least those portions of a subspace representation A_krelating to a term included within the representation of the document to be classified;
  
  means for weighting at least those portions of the subspace representation A_krelating to a term included within the representation of the document to be classified following the projection into the lower dimensional subspace;
  
  means for scoring the relationship of the document to each predefined class based at least partially upon the weighted portion of the subspace representation A_k; and
  
  means for determining if the document is to be classified into any of the plurality of predefined classes based upon the scores of the relationship of the document to each predefined class.
- View Dependent Claims (40, 41, 42, 43, 44)
- - 40. An apparatus according to claim 39 wherein the subspace representation A_kincludes a plurality of rows corresponding to respective terms, and wherein said means for weighting comprises means for determining an inverse infinity norm of the term.
  - 41. An apparatus according to claim 39 wherein the subspace representation A_kincludes a plurality of rows corresponding to respective terms, and wherein said means for weighting comprises means for determining an inverse 1-norm of the term.
  - 42. An apparatus according to claim 39 wherein the subspace representation A_kincludes a plurality of rows corresponding to respective terms, and wherein said means for weighting comprises means for determining an inverse 2-norm of the term.
  - 43. An apparatus according to claim 39 further comprising means for weighting the term-by-class matrix on a class-by-class basis prior to the projection into the lower dimensional subspace.
  - 44. An apparatus according to claim 39 wherein said means for projecting a representation of at least a portion of the term-by-class matrix into a lower dimensional subspace comprises means for obtaining an orthogonal decomposition of the representation of the term-by-class matrix into a k-dimensional subspace.

45. An apparatus for retrieving information from a text data collection that comprises a plurality of documents with each document comprised of a plurality of terms, wherein the text data collection is represented by a term-by-document matrix having a plurality of entries with each entry being the frequency of occurrence of a term in a respective document, and wherein the apparatus comprises:
- means for receiving a query;
  
  means for determining if the query is to be treated as a pseudo-document or as a set of terms;
  
  means for processing the query depending upon the treatment of the query as a pseudo-document or as a set of terms; and
  
  means for scoring the plurality of documents with respect to the query based upon said processing of the query.
- View Dependent Claims (46, 47)
- - 46. An apparatus according to claim 45 wherein said means for processing comprises:
47. An apparatus according to claim 45 wherein said means for processing comprises:
- means, operable in instances in which the query is treated as a pseudo-document, for projecting a representation of at least a portion of the term-by-document matrix into a lower dimensional subspace;
  
  means, also operable in instances in which the query is treated as a pseudo-document, for projecting a query vector representative of the query into the lower dimensional subspace; and
  
  means, further operable in instances in which the query is treated as a pseudo-document, for comparing the projection of the query vector and the representation of at least a portion of the term-by-document matrix, and wherein said means for scoring scores the plurality of documents with respect to the query based at least partially upon the comparison of the projection of the query vector and the representation of at least a portion of the term-by-document matrix in instances in which the query is treated as a pseudo-document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
The Boeing Co.
Original Assignee
The Boeing Co.
Inventors
Wu, Yuan-Jye, Kao, Anne Shu-Wan, Holt, Fredrick Baden, Pierce, Daniel John, Poteet, Stephen Robert
Primary Examiner(s)
Starks, Jr., Wilbert L.
Assistant Examiner(s)
Hirl, Joseph P

Application Number

US09/693,114
Time in Patent Office

1,229 Days
Field of Search

706/45, 706/12, 706/46
US Class Current

706/45
CPC Class Codes

G06F 16/31   Indexing; Data structures t...

G06F 16/313   Selection or weighting of t...

G06F 16/3334   Selection or weighting of t...

G06F 16/334   Query execution G06F16/335 ...

G06F 16/3347   using vector based model

G06F 16/353   into predefined classes

Methods, apparatus and computer program products for information retrieval and document classification utilizing a multidimensional subspace

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

459 Citations

47 Claims

Specification

Use Cases

Quick Links

Others

Methods, apparatus and computer program products for information retrieval and document classification utilizing a multidimensional subspace

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

459 Citations

47 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others