System and method for identifying query-relevant keywords in documents with latent semantic analysis

US 7,440,947 B2
Filed: 11/12/2004
Issued: 10/21/2008
Est. Priority Date: 11/12/2004
Status: Expired due to Fees

First Claim

Patent Images

1. A method for identifying a set of query-relevant keywords comprised in one or more documents obtained from a query, wherein the query need not comprise the keywords, comprising:

a) creating a term-weight matrix M comprising one or more document term-weight vectors d, wherein each document term-weight vector d comprises information on the frequency in one of the one or more documents obtained from the terms in the query;

b) creating an expanded query term-weight vector q_expandedfrom the query term-weight vector q and the term-weight matrix M, wherein the expanded query term-weight vector q_expandedidentifies terms related to the query, wherein the query need not comprise the terms within the expanded query term-weight vector q_expanded;

c) creating a keyword vector using the document term weight vectors d and the expanded query term-weight vector q_expanded, wherein the keyword vector identifies terms from the expanded query term-weight vector q_expandedthat do not occur in at least one of the one or more documents, each term included in the keyword vector having a ranking value;

d) generating an ordered list of document terms corresponding to the terms included in the keyword vector that occur in at least one of the one or more documents;

e) selecting keywords from the ordered list of document terms having the highest ranking value; and

f) highlighting the keywords in the one or more documents.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for identifying query-related keywords in documents found in a search using latent semantic analysis. The documents are represented as a document term matrix M containing one or more document term-weight vectors d, which may be term-frequency (tf) vectors or term-frequency inverse-document-frequency (tf-idf) vectors. This matrix is subjected to a truncated singular value decomposition. The resulting transform matrix U can be used to project a query term-weight vector q into the reduced N-dimensional space, followed by its expansion back into the full vector space using the inverse of U.

To perform a search, the similarity of q_expandedis measured relative to each candidate document vector in this space. Exemplary similarity functions are dot product and cosine similarity. Keywords are selected with the highest values in q_expandedthat are also comprised in at least one document. Matching keywords from the query may be highlighted in the search results.

Citations

20 Claims

1. A method for identifying a set of query-relevant keywords comprised in one or more documents obtained from a query, wherein the query need not comprise the keywords, comprising:
- a) creating a term-weight matrix M comprising one or more document term-weight vectors d, wherein each document term-weight vector d comprises information on the frequency in one of the one or more documents obtained from the terms in the query;
  
  b) creating an expanded query term-weight vector q_expandedfrom the query term-weight vector q and the term-weight matrix M, wherein the expanded query term-weight vector q_expandedidentifies terms related to the query, wherein the query need not comprise the terms within the expanded query term-weight vector q_expanded;
  
  c) creating a keyword vector using the document term weight vectors d and the expanded query term-weight vector q_expanded, wherein the keyword vector identifies terms from the expanded query term-weight vector q_expandedthat do not occur in at least one of the one or more documents, each term included in the keyword vector having a ranking value;
  
  d) generating an ordered list of document terms corresponding to the terms included in the keyword vector that occur in at least one of the one or more documents;
  
  e) selecting keywords from the ordered list of document terms having the highest ranking value; and
  
  f) highlighting the keywords in the one or more documents.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method of claim 1, wherein the document term-weight vectors d are term-frequency vectors.
  - 3. The method of claim 1, wherein the document term-weight vectors d are term-frequency inverse-document frequency (tf-idf) vectors, whose nth component is computed as a function of a) the term-frequency in document d of the nth term and b) the fraction of the documents that comprise the nth term.
  - 4. The method of claim 3, wherein the function is computed according to the equation tf-idf=(1.0+log(tf))*log(1/df), where tf is a term-frequency for the nth term in document d, and df is the fraction of the documents that comprises the nth term.
  - 5. The method of claim 1, wherein the step of creating a keyword vector using the document term weight vectors d and the expanded query term-weight vector q_expanded, comprises:
    - c1) computing for each document a term-by-term product of the expanded query term-weight vector q_expandedand the corresponding document term-weight vector d; and
      
      c2) assigning the ranking values based on the computed products.
  - 6. The method of claim 1, wherein creating the expanded query term-weight vector q_expandedcomprises the following steps:
    - b1) projecting a query term-weight vector q corresponding to the query terms into latent semantic space, creating a reduced query term-weight vector q_reduced; and
      
      b2) expanding the query term-weight vector back out of latent semantic space, creating the expanded query term-weight vector q_expanded.
  - 7. The method of claim 6, wherein the step of projecting comprises:
    - b1(1) approximating the term-weight matrix M as a product of three matrices U S V′
      
      , wherein S is a square diagonal matrix, U is a unitary transform matrix, and V is a unitary matrix.
  - 8. The method of claim 7, wherein the step of projecting further comprises:
    - b1(2) creating a reduced query term-weight vector q_reduced=q*U′
      
      .
  - 9. The method of claim 7, wherein the step of expanding the query term-weight vector q comprises:
    - b2(1) creating the expanded query term-weight vector q_expandedaccording to the equation q_expanded=q_reduced*U=U*(U′
      
      *q).
  - 10. The method of claim 6, wherein the step of expanding the query term-weight vector q comprises:
    - b2(1) computing each of n terms of q_expandedby computing a similarity function between a reduced query term-weight vector q_reducedand row n of the unitary transform matrix U.
  - 11. The method of claim 10, wherein the similarity function is a cosine similarity function.
  - 12. The method of claim 1, wherein creating an expanded query term-weight vector q_expandedcomprises:
    - b1) approximating the term-weight matrix M as a product of three matrices U S V′
      
      , wherein S is a square diagonal matrix, U is a unitary transform matrix, and V is a unitary matrix.
  - 13. The method of claim 12, wherein creating an expanded query term-weight vector q_expandedfurther comprises:
    - b2) creating the expanded query term-weight vector q_expandedusing a query term-weight vector q according to the equation q_expanded=U*(U′
      
      *q).
  - 14. The method of claim 1, wherein the document is a text document.
  - 15. The method of claim 1, wherein the document is text associated with a video segment.

16. A method for identifying a set of query-relevant keywords comprised in one or more documents obtained from a query, wherein the query need not comprise the keywords, comprising:
- a) creating a term-weight matrix M comprising one or more document term-weight vectors d, wherein each document term-weight vector d comprises information on the frequency in one of the one or more documents obtained from the terms in the query;
  
  b) projecting a query term-weight vector q corresponding to the query terms into latent semantic space, creating a reduced query term-weight vector q_reduced;
  
  c) expanding the query term-weight vector back out of latent semantic space, creating the expanded query term-weight vector q_expanded, wherein the expanded query term-weight vector q_expandedidentifies terms related to the query, wherein the query need not comprise the terms within the expanded query term-weight vector q_expanded;
  
  d) creating a keyword vector using the document term weight vectors d and the expanded query term-weight vector q_expanded, wherein the keyword vector identifies terms from the expanded query term-weight vector q_expandedthat do not occur in at least one of the one or more documents, each term included in the keyword vector has a ranking value;
  
  e) generating an ordered list of document terms corresponding to the terms included in the keyword vector that occur in at least one of the one or more documents;
  
  f) selecting keywords from the ordered list of document terms having the highest ranking value; and
  
  g) highlighting the keywords in the one or more documents.

17. A method for identifying a set of query-relevant keywords comprised in one or more documents obtained from a query, wherein the query need not comprise the keywords, comprising:
- TKa) creating a term-weight matrix M comprising one or more document term-weight vectors d, wherein each document term-weight vector d comprises information on the frequency in one of the one or more documents obtained from the terms in the query;
  
  b) approximating the term-weight matrix M as a product of three matrices U S V′
  
  , wherein S is a square diagonal matrix, U is a unitary transform matrix, and V is a unitary matrix;
  
  c) expanding the query term-weight vector back out of latent semantic space, creating the expanded query term-weight vector q_expanded, wherein the expanded query term-weight vector q_expandedidentifies terms related to the query, wherein the query need not comprise the terms within the expanded query term-weight vector q_expanded; and
  
  d) creating a keyword vector using the document term weight vectors d and the expanded query term-weight vector q_expanded, wherein the keyword vector identifies terms from the expanded query term-weight vector q_expandedthat do not occur in at least one of the one or more documents, each term included in the keyword vector has a ranking value;
  
  e) generating an ordered list of document terms corresponding to the terms included in the keyword vector that occur in at least one of the one or more documents;
  
  f) selecting keywords from the ordered list of document terms having the highest ranking value; and
  
  g) highlighting the keywords in the one or more documents.

18. In a computer, a method for identifying a set of query-relevant keywords comprised in one or more documents obtained from a query, wherein the query need not comprise the keywords, comprising:
- a) receiving by the user a term-weight matrix M created by the computer and comprising one or more document term-weight vectors d, wherein each document term-weight vector d comprises information on the frequency in one of the one or more documents obtained from the terms in the query;
  
  b) receiving by the user an expanded query term-weight vector q_expandedcreated by the computer from the query term-weight vector q and the term-weight matrix M, wherein the expanded query term-weight vector q_expandedidentifies terms related to the query, wherein the query need not comprise the terms within the expanded query term-weight vector q_expanded;
  
  c) receiving by the user a keyword vector created by the computer using the document term weight vectors d and the expanded query term-weight vector q_expanded, wherein the keyword vector identifies terms from the expanded query term-weight vector q_expandedthat do not occur in at least one of the one or more documents, each term included in the keyword vector has a ranking value;
  
  d) receiving by the user an ordered list of document terms corresponding to the terms included in the keyword vector that occur in at least one of the one or more documents, the ordered list of document terms being created by the computer;
  
  e) receiving by the user keywords from the ordered list of document terms having the highest ranking value, the keywords having the highest rank selected by the computer; and
  
  f) receiving by the user highlighted keywords in the one or more documents.

19. In a computer, a method for identifying a set of query-relevant keywords comprised in one or more documents, wherein the query need not comprise the keywords, comprising:
- a) receiving by the user a term-weight matrix M comprising information on the frequency in the one or more documents of terms in the selected query and approximated by the computer as a product of three matrices U S V′
  
  , wherein S is a square diagonal matrix, U is a unitary transform matrix, and V is a unitary matrix;
  
  b) receiving by the user an expanded query term-weight vector q_expandedcreated by the computer using a query term-weight vector q according to the equation q_expanded=U*(U′
  
  *q), wherein the expanded query term-weight vector q_expandedidentifies terms related to the query, wherein the query need not comprise the terms within the expanded query term-weight vector q_expanded;
  
  c) receiving by the user a keyword vector created by the computer using the document term weight vectors d and the expanded query term-weight vector q_expanded, wherein the keyword vector identifies terms from the expanded query term-weight vector q_expandedthat do not occur in at least one of the one or more documents, each term included in the keyword vector has a ranking value;
  
  d) receiving by the user an ordered list of document terms corresponding to the terms included in the keyword vector that occur in at least one of the one or more documents, the ordered list of document terms being created by the computer;
  
  e) receiving by the user keywords from the ordered list of document terms having the highest ranking value, the keywords having the highest rank selected by the computer; and
  
  f) receiving by the user highlighted keywords in the one or more documents.

20. A system for identifying a set of query-relevant keywords comprised in one or more selected documents, wherein the query need not comprise the keywords, comprising a procesor;
- a) means for creating a term-weight matrix M comprising one or more document term-weight vectors d, wherein each document term-weight vector d comprises information on the frequency in the one or more documents of terms in the selected query;
  
  b) means for creating an expanded query term-weight vector q_expandedfrom the query term-weight vector q and the term-weight matrix M, wherein the expanded query term-weight vector q_expandedidentifies terms related to the query, wherein the query need not comprise the terms within the expanded query term-weight vector q_expanded;
  
  c) means for creating a keyword vector using the document term weight vectors d and the expanded query term-weight vector q_expanded, wherein the keyword vector identifies terms from the expanded query term-weight vector q_expandedthat do not occur in at least one of the one or more documents, each term included in the keyword vector having a ranking value;
  
  d) means for generating an ordered list of document terms corresponding to the terms included in the keyword vector that occur in at least one of the one or more documents;
  
  e) means for selecting keywords from the ordered list of document terms having the highest ranking value; and
  
  f) means for highlighting the keywords in the one or more documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Fuji Xerox Company Limited (Fujifilm Holdings Corporation)
Original Assignee
Fuji Xerox Company Limited (Fujifilm Holdings Corporation)
Inventors
Girgensohn, Andreas, Cooper, Matthew, Wilcox, Lynn D., Adcock, John E.
Primary Examiner(s)
Wong; Leslie
Assistant Examiner(s)
Chau; Dung K

Application Number

US10/987,377
Publication Number

US 20060106767A1
Time in Patent Office

1,439 Days
Field of Search

707/3, 707/5, 707/6, 707/7, 707/101, 707/104.1
US Class Current

1/1
CPC Class Codes

G06F 16/31   Indexing; Data structures t...

G06F 40/247   Thesauruses; Synonyms

G06F 40/30   Semantic analysis

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99936   Pattern matching access

Y10S 707/99937   Sorting

System and method for identifying query-relevant keywords in documents with latent semantic analysis

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for identifying query-relevant keywords in documents with latent semantic analysis

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links