System and method for identifying query-relevant keywords in documents with latent semantic analysis
First Claim
1. A method for identifying a set of query-relevant keywords comprised in one or more documents obtained from a query, wherein the query need not comprise the keywords, comprising:
- a) creating a term-weight matrix M comprising one or more document term-weight vectors d, wherein each document term-weight vector d comprises information on the frequency in one of the one or more documents obtained from the terms in the query;
b) creating an expanded query term-weight vector qexpanded from the query term-weight vector q and the term-weight matrix M; and
c) using the expanded query term-weight vector qexpanded and the document term-weight vectors d, locating a set of keywords identified as terms related to the query and also comprised in at least one of the documents, wherein the query need not comprise the keywords.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method for identifying query-related keywords in documents found in a search using latent semantic analysis. The documents are represented as a document term matrix M containing one or more document term-weight vectors d, which may be term-frequency (tf) vectors or term-frequency inverse-document-frequency (tf-idf) vectors. This matrix is subjected to a truncated singular value decomposition. The resulting transform matrix U can be used to project a query term-weight vector q into the reduced N-dimensional space, followed by its expansion back into the full vector space using the inverse of U.
To perform a search, the similarity of qexpanded is measured relative to each candidate document vector in this space. Exemplary similarity functions are dot product and cosine similarity. Keywords are selected with the highest values in qexpanded that are also comprised in at least one document. Matching keywords from the query may be highlighted in the search results.
-
Citations
24 Claims
-
1. A method for identifying a set of query-relevant keywords comprised in one or more documents obtained from a query, wherein the query need not comprise the keywords, comprising:
-
a) creating a term-weight matrix M comprising one or more document term-weight vectors d, wherein each document term-weight vector d comprises information on the frequency in one of the one or more documents obtained from the terms in the query;
b) creating an expanded query term-weight vector qexpanded from the query term-weight vector q and the term-weight matrix M; and
c) using the expanded query term-weight vector qexpanded and the document term-weight vectors d, locating a set of keywords identified as terms related to the query and also comprised in at least one of the documents, wherein the query need not comprise the keywords. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A method for identifying a set of query-relevant keywords comprised in one or more documents obtained from a query, wherein the query need not comprise the keywords, comprising:
-
a) creating a term-weight matrix M comprising one or more document term-weight vectors d, wherein each document term-weight vector d comprises information on the frequency in one of the one or more documents obtained from the terms in the query;
b) projecting a query term-weight vector q corresponding to the query terms into latent semantic space, creating a reduced query term-weight vector qreduced;
c) expanding the query term-weight vector back out of latent semantic space, creating the expanded query term-weight vector qexpanded; and
d) using the expanded query term-weight vector qexpanded and the document term-weight vectors d, locating a set of keywords identified as terms related to the query and also comprised in at least one of the documents, wherein the query need not comprise the keywords.
-
-
19. A method for identifying a set of query-relevant keywords comprised in one or more documents obtained from a query, wherein the query need not comprise the keywords, comprising:
-
a) creating a term-weight matrix M comprising one or more document term-weight vectors d, wherein each document term-weight vector d comprises information on the frequency in one of the one or more documents obtained from the terms in the query;
b) approximating the term-weight matrix M as a product of three matrices U S V′
, wherein S is a square diagonal matrix, U is a unitary transform matrix, and V is a unitary matrix;
c) expanding the query term-weight vector back out of latent semantic space, creating the expanded query term-weight vector qexpanded; and
d) using the expanded query term-weight vector qexpanded and the document term-weight vectors d, locating a set of keywords identified as terms related to the query and also comprised in at least one of the documents, wherein the query need not comprise the keywords.
-
-
20. In a computer, a method for identifying a set of query-relevant keywords comprised in one or more documents obtained from a query, wherein the query need not comprise the keywords, comprising:
-
a) receiving by the user a term-weight matrix M created by the computer and comprising one or more document term-weight vectors d, wherein each document term-weight vector d comprises information on the frequency in one of the one or more documents obtained from the terms in the query;
b) receiving by the user an expanded query term-weight vector qexpanded created by the computer from the from the query term-weight vector q and the term-weight matrix M; and
c) receiving by the user a set of keywords identified by the computer, using the expanded query term-weight vector qexpanded and the document term-weight vectors d, as terms related to the query and also comprised in at least one of the documents, wherein the query need not comprise the keywords.
-
-
21. In a computer, a method for identifying a set of query-relevant keywords comprised in one or more documents, wherein the query need not comprise the keywords, comprising:
-
a) receiving by the user a term-weight matrix M comprising information on the frequency in the one or more documents of terms in the selected query and approximated by the computer as a product of three matrices U S V′
, wherein S is a square diagonal matrix, U is a unitary transform matrix, and V is a unitary matrix;
b) receiving by the user an expanded query term-weight vector qexpanded created by the computer using a query term-weight vector q according to the equation qexpanded=U′
Uq; and
c) receiving by the user a set of keywords identified by the computer, using the expanded query term-weight vector qexpanded and the document term-weight vectors d, as terms related to the query and also comprised in at least one of the documents, wherein the query need not comprise the keywords.
-
-
22. A system for identifying a set of query-relevant keywords comprised in one or more selected documents, wherein the query need not comprise the keywords, comprising:
-
a) one or more processors capable of creating a term-weight matrix M comprising one or more document term-weight vectors d, wherein each document term-weight vector d comprises information on the frequency in one of the one or more documents obtained from the terms in the query;
capable of creating an expanded query term-weight vector qexpanded from the query term-weight vector q and the term-weight matrix M; and
capable of, using the expanded query term-weight vector qexpanded and the document term-weight vectors d, locating a set of keywords identified as terms related to the query and also comprised in at least one of the documents, wherein the query need not comprise the keywords; and
b) a machine readable medium including operations stored thereon that when processed by the one or more processors cause a system to perform the steps of creating a term-weight matrix M comprising one or more document term-weight vectors d, wherein each document term-weight vector d comprises information on the frequency in one of the one or more documents obtained from the terms in the query;
creating an expanded query term-weight vector qexpanded from the query term-weight vector q and the term-weight matrix M; and
, using the expanded query term-weight vector qexpanded and the document term-weight vectors d, locating a set of keywords identified as terms related to the query and also comprised in at least one of the documents, wherein the query need not comprise the keywords.
-
-
23. A program of instructions executable by a computer to perform a method for identifying a set of query-relevant keywords comprised in one or more selected documents, wherein the query need not comprise the keywords, comprising the steps of:
-
a) creating a term-weight matrix M comprising one or more document term-weight vectors d, wherein each document term-weight vector d comprises information on the frequency in one of the one or more documents obtained from the terms in the query;
b) creating an expanded query term-weight vector qexpanded from the query term-weight vector q and the term-weight matrix M; and
c) using the expanded query term-weight vector qexpanded and the document term-weight vectors d, locating a set of keywords identified as terms related to the query and also comprised in at least one of the documents, wherein the query need not comprise the keywords.
-
-
24. A system or apparatus for identifying a set of query-relevant keywords comprised in one or more selected documents, wherein the query need not comprise the keywords, comprising:
-
a) means for creating a term-weight matrix M comprising one or more document term-weight vectors d, wherein each document term-weight vector d comprises information on the frequency in the one or more documents of terms in the selected query;
b) means for creating an expanded query term-weight vector qexpanded from the query term-weight vector q and the term-weight matrix M; and
c) means for locating, using the expanded query term-weight vector qexpanded and the term-weight matrix M, a set of keywords identified as terms related to the query and also comprised in at least one of the documents, wherein the query need not comprise the keywords.
-
Specification