System and method for identifying query-relevant keywords in documents with latent semantic analysis
First Claim
1. A method for identifying a set of query-relevant keywords comprised in one or more documents obtained from a query, wherein the query need not comprise the keywords, comprising:
- a) creating a term-weight matrix M comprising one or more document term-weight vectors d, wherein each document term-weight vector d comprises information on the frequency in one of the one or more documents obtained from the terms in the query;
b) creating an expanded query term-weight vector qexpanded from the query term-weight vector q and the term-weight matrix M, wherein the expanded query term-weight vector qexpanded identifies terms related to the query, wherein the query need not comprise the terms within the expanded query term-weight vector qexpanded;
c) creating a keyword vector using the document term weight vectors d and the expanded query term-weight vector qexpanded, wherein the keyword vector identifies terms from the expanded query term-weight vector qexpanded that do not occur in at least one of the one or more documents, each term included in the keyword vector having a ranking value;
d) generating an ordered list of document terms corresponding to the terms included in the keyword vector that occur in at least one of the one or more documents;
e) selecting keywords from the ordered list of document terms having the highest ranking value; and
f) highlighting the keywords in the one or more documents.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method for identifying query-related keywords in documents found in a search using latent semantic analysis. The documents are represented as a document term matrix M containing one or more document term-weight vectors d, which may be term-frequency (tf) vectors or term-frequency inverse-document-frequency (tf-idf) vectors. This matrix is subjected to a truncated singular value decomposition. The resulting transform matrix U can be used to project a query term-weight vector q into the reduced N-dimensional space, followed by its expansion back into the full vector space using the inverse of U.
To perform a search, the similarity of qexpanded is measured relative to each candidate document vector in this space. Exemplary similarity functions are dot product and cosine similarity. Keywords are selected with the highest values in qexpanded that are also comprised in at least one document. Matching keywords from the query may be highlighted in the search results.
-
Citations
20 Claims
-
1. A method for identifying a set of query-relevant keywords comprised in one or more documents obtained from a query, wherein the query need not comprise the keywords, comprising:
-
a) creating a term-weight matrix M comprising one or more document term-weight vectors d, wherein each document term-weight vector d comprises information on the frequency in one of the one or more documents obtained from the terms in the query; b) creating an expanded query term-weight vector qexpanded from the query term-weight vector q and the term-weight matrix M, wherein the expanded query term-weight vector qexpanded identifies terms related to the query, wherein the query need not comprise the terms within the expanded query term-weight vector qexpanded; c) creating a keyword vector using the document term weight vectors d and the expanded query term-weight vector qexpanded, wherein the keyword vector identifies terms from the expanded query term-weight vector qexpanded that do not occur in at least one of the one or more documents, each term included in the keyword vector having a ranking value; d) generating an ordered list of document terms corresponding to the terms included in the keyword vector that occur in at least one of the one or more documents; e) selecting keywords from the ordered list of document terms having the highest ranking value; and f) highlighting the keywords in the one or more documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A method for identifying a set of query-relevant keywords comprised in one or more documents obtained from a query, wherein the query need not comprise the keywords, comprising:
-
a) creating a term-weight matrix M comprising one or more document term-weight vectors d, wherein each document term-weight vector d comprises information on the frequency in one of the one or more documents obtained from the terms in the query; b) projecting a query term-weight vector q corresponding to the query terms into latent semantic space, creating a reduced query term-weight vector qreduced; c) expanding the query term-weight vector back out of latent semantic space, creating the expanded query term-weight vector qexpanded, wherein the expanded query term-weight vector qexpanded identifies terms related to the query, wherein the query need not comprise the terms within the expanded query term-weight vector qexpanded; d) creating a keyword vector using the document term weight vectors d and the expanded query term-weight vector qexpanded, wherein the keyword vector identifies terms from the expanded query term-weight vector qexpanded that do not occur in at least one of the one or more documents, each term included in the keyword vector has a ranking value; e) generating an ordered list of document terms corresponding to the terms included in the keyword vector that occur in at least one of the one or more documents; f) selecting keywords from the ordered list of document terms having the highest ranking value; and g) highlighting the keywords in the one or more documents.
-
-
17. A method for identifying a set of query-relevant keywords comprised in one or more documents obtained from a query, wherein the query need not comprise the keywords, comprising:
- TK
a) creating a term-weight matrix M comprising one or more document term-weight vectors d, wherein each document term-weight vector d comprises information on the frequency in one of the one or more documents obtained from the terms in the query; b) approximating the term-weight matrix M as a product of three matrices U S V′
, wherein S is a square diagonal matrix, U is a unitary transform matrix, and V is a unitary matrix;c) expanding the query term-weight vector back out of latent semantic space, creating the expanded query term-weight vector qexpanded, wherein the expanded query term-weight vector qexpanded identifies terms related to the query, wherein the query need not comprise the terms within the expanded query term-weight vector qexpanded; and d) creating a keyword vector using the document term weight vectors d and the expanded query term-weight vector qexpanded, wherein the keyword vector identifies terms from the expanded query term-weight vector qexpanded that do not occur in at least one of the one or more documents, each term included in the keyword vector has a ranking value; e) generating an ordered list of document terms corresponding to the terms included in the keyword vector that occur in at least one of the one or more documents; f) selecting keywords from the ordered list of document terms having the highest ranking value; and g) highlighting the keywords in the one or more documents.
- TK
-
18. In a computer, a method for identifying a set of query-relevant keywords comprised in one or more documents obtained from a query, wherein the query need not comprise the keywords, comprising:
-
a) receiving by the user a term-weight matrix M created by the computer and comprising one or more document term-weight vectors d, wherein each document term-weight vector d comprises information on the frequency in one of the one or more documents obtained from the terms in the query; b) receiving by the user an expanded query term-weight vector qexpanded created by the computer from the query term-weight vector q and the term-weight matrix M, wherein the expanded query term-weight vector qexpanded identifies terms related to the query, wherein the query need not comprise the terms within the expanded query term-weight vector qexpanded; c) receiving by the user a keyword vector created by the computer using the document term weight vectors d and the expanded query term-weight vector qexpanded, wherein the keyword vector identifies terms from the expanded query term-weight vector qexpanded that do not occur in at least one of the one or more documents, each term included in the keyword vector has a ranking value; d) receiving by the user an ordered list of document terms corresponding to the terms included in the keyword vector that occur in at least one of the one or more documents, the ordered list of document terms being created by the computer; e) receiving by the user keywords from the ordered list of document terms having the highest ranking value, the keywords having the highest rank selected by the computer; and f) receiving by the user highlighted keywords in the one or more documents.
-
-
19. In a computer, a method for identifying a set of query-relevant keywords comprised in one or more documents, wherein the query need not comprise the keywords, comprising:
-
a) receiving by the user a term-weight matrix M comprising information on the frequency in the one or more documents of terms in the selected query and approximated by the computer as a product of three matrices U S V′
, wherein S is a square diagonal matrix, U is a unitary transform matrix, and V is a unitary matrix;b) receiving by the user an expanded query term-weight vector qexpanded created by the computer using a query term-weight vector q according to the equation qexpanded=U*(U′
*q), wherein the expanded query term-weight vector qexpanded identifies terms related to the query, wherein the query need not comprise the terms within the expanded query term-weight vector qexpanded;c) receiving by the user a keyword vector created by the computer using the document term weight vectors d and the expanded query term-weight vector qexpanded, wherein the keyword vector identifies terms from the expanded query term-weight vector qexpanded that do not occur in at least one of the one or more documents, each term included in the keyword vector has a ranking value; d) receiving by the user an ordered list of document terms corresponding to the terms included in the keyword vector that occur in at least one of the one or more documents, the ordered list of document terms being created by the computer; e) receiving by the user keywords from the ordered list of document terms having the highest ranking value, the keywords having the highest rank selected by the computer; and f) receiving by the user highlighted keywords in the one or more documents.
-
-
20. A system for identifying a set of query-relevant keywords comprised in one or more selected documents, wherein the query need not comprise the keywords, comprising a procesor;
-
a) means for creating a term-weight matrix M comprising one or more document term-weight vectors d, wherein each document term-weight vector d comprises information on the frequency in the one or more documents of terms in the selected query; b) means for creating an expanded query term-weight vector qexpanded from the query term-weight vector q and the term-weight matrix M, wherein the expanded query term-weight vector qexpanded identifies terms related to the query, wherein the query need not comprise the terms within the expanded query term-weight vector qexpanded; c) means for creating a keyword vector using the document term weight vectors d and the expanded query term-weight vector qexpanded, wherein the keyword vector identifies terms from the expanded query term-weight vector qexpanded that do not occur in at least one of the one or more documents, each term included in the keyword vector having a ranking value; d) means for generating an ordered list of document terms corresponding to the terms included in the keyword vector that occur in at least one of the one or more documents; e) means for selecting keywords from the ordered list of document terms having the highest ranking value; and f) means for highlighting the keywords in the one or more documents.
-
Specification