Multi-concept latent semantic analysis queries

US 9,026,535 B2
Filed: 01/02/2013
Issued: 05/05/2015
Est. Priority Date: 12/14/2011
Status: Active Grant

First Claim

Patent Images

1. A computer system, comprising:

one or more memory units; and

one or more processing units operable to;

access text;

identify a plurality of terms from the text;

determine a plurality of term vectors associated with the identified plurality of terms;

calculate a weight of each of the determined plurality of term vectors;

cluster the determined plurality of term vectors into a plurality of clusters, the plurality of clusters comprising a first cluster related to a first concept of the text and a second cluster related to a second concept of the text, the first concept being distinct from the second concept, the first and second clusters each comprising two or more of the determined term vectors, the clustering comprising grouping two or more of the determined term vectors together based on the determined weights of the two or more term vectors and a distance between the two or more term vectors;

identify, using latent semantic analysis (LSA), a first set of terms associated with the first cluster;

identify, using LSA, a second set of terms associated with the second cluster;

determine a first weight associated with the first cluster and a second weight associated with the second cluster, wherein the first weight is based at least on the weights of the term vectors of the first cluster, and wherein the second weight is based at least on the weights of the term vectors of the second cluster;

determine a first percentage of a list of output terms that should come from the first cluster and a second percentage of the list of output terms that should come from the second cluster, the first percentage based on a ratio of the first weight to a sum of the first and second weights, the second percentage based on a ratio of the second weight to the sum of the first and second weights;

select one or more terms from the first set of terms according to the determined first percentage;

select one or more terms from the second set of terms according to the determined second percentage;

combine the selected terms from the first and second sets of terms into the list of output terms, the list of output terms having the first and second concepts of the text; and

store the list of output terms in the one or more memory units.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method includes accessing text, identifying a plurality of terms from the text, determining a plurality of term vectors associated with the identified plurality of terms, and clustering the determined plurality of term vectors into a plurality of clusters, the plurality of clusters comprising a first and a second cluster, the first and second clusters each comprising two or more of the determined term vectors. The method further includes creating a first pseudo-document according to the first cluster, creating a second pseudo-document according to the second cluster, identifying a first set of terms associated with the first cluster using latent semantic analysis (LSA) of the first pseudo-document, identifying a second set of terms associated with the second cluster using LSA of the second pseudo-document, and combining the first and second sets of terms into a list of output terms.

Citations

24 Claims

1. A computer system, comprising:
- one or more memory units; and
  
  one or more processing units operable to;
  
  access text;
  
  identify a plurality of terms from the text;
  
  determine a plurality of term vectors associated with the identified plurality of terms;
  
  calculate a weight of each of the determined plurality of term vectors;
  
  cluster the determined plurality of term vectors into a plurality of clusters, the plurality of clusters comprising a first cluster related to a first concept of the text and a second cluster related to a second concept of the text, the first concept being distinct from the second concept, the first and second clusters each comprising two or more of the determined term vectors, the clustering comprising grouping two or more of the determined term vectors together based on the determined weights of the two or more term vectors and a distance between the two or more term vectors;
  
  identify, using latent semantic analysis (LSA), a first set of terms associated with the first cluster;
  
  identify, using LSA, a second set of terms associated with the second cluster;
  
  determine a first weight associated with the first cluster and a second weight associated with the second cluster, wherein the first weight is based at least on the weights of the term vectors of the first cluster, and wherein the second weight is based at least on the weights of the term vectors of the second cluster;
  
  determine a first percentage of a list of output terms that should come from the first cluster and a second percentage of the list of output terms that should come from the second cluster, the first percentage based on a ratio of the first weight to a sum of the first and second weights, the second percentage based on a ratio of the second weight to the sum of the first and second weights;
  
  select one or more terms from the first set of terms according to the determined first percentage;
  
  select one or more terms from the second set of terms according to the determined second percentage;
  
  combine the selected terms from the first and second sets of terms into the list of output terms, the list of output terms having the first and second concepts of the text; and
  
  store the list of output terms in the one or more memory units.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 24)
- - 2. The system of claim 1, wherein clustering the determined plurality of term vectors into the plurality of clusters comprises using agglomerative clustering.
  - 3. The system of claim 1, wherein combining the selected terms from the first and second sets of terms into the list of output terms comprises using log-entropy mixing.
  - 4. The system of claim 1, the one or more processing units further operable to determine one or more important term vectors based on the calculated weight of each of the determined plurality of term vectors, wherein clustering the determined plurality of term vectors into a plurality of clusters further comprises ensuring that the determined one or more important term vectors are included in the plurality of clusters.
  - 5. The system of claim 1, wherein calculating the weight of each of the determined plurality of term vectors comprises utilizing log-entropy weighting.
  - 6. The system of claim 5, wherein utilizing log-entropy weighting comprises:
    - calculating a mean log-entropy weight of each of the determined plurality of term vectors;
      
      calculating a standard deviation for the calculated log-entropy weights of the determined plurality of term vectors; and
      
      comparing the calculated mean log-entropy weight with the calculated standard deviation.
  - 7. The system of claim 6, wherein:
    - comparing the calculated mean log-entropy weight with the calculated standard deviation comprises;
      
      determining, for each particular term vector of the determined plurality of term vectors, whether the calculated mean log-entropy weight of the particular term vector is greater than a first predetermined amount more than the calculated standard deviation; and
      
      determining, for each particular term vector of the determined plurality of term vectors, whether the calculated mean log-entropy weight of the particular term vector is greater than a second predetermined amount less than the calculated standard deviation; and
      
      the one or more processing units are further operable to;
      
      identify each particular term vector as an important term vector when the calculated mean log-entropy weight of the particular term vector is greater than the first predetermined amount more than the calculated standard deviation;
      
      identify each particular term vector as an unimportant term vector when the calculated mean log-entropy weight of the particular term vector is greater than the second predetermined amount less than the calculated standard deviation; and
      
      prevent important term vectors from being clustered with unimportant term vectors.
  - 8. The system of claim 1, the one or more processing units further operable to:
    - identify, based on the calculated weights of each of the determined plurality of term vectors, one or more important term vectors and one or more unimportant term vectors from the determined plurality of term vectors; and
      
      prevent important term vectors from being clustered with unimportant term vectors.
  - 24. The system of claim 1, wherein the clustering of the determined plurality of term vectors into the plurality of clusters occurs prior to using LSA to identify the first set of terms associated with the first cluster and prior to using LSA to identify a second set of terms associated with the second cluster.

9. A computer-implemented method, comprising:
- accessing text by a processing system;
  
  identifying, by the processing system, a plurality of terms from the text;
  
  determining, by the processing system, a plurality of term vectors associated with the identified terms;
  
  calculating a weight of each of the determined term vectors;
  
  clustering, by the processing system, the determined term vectors into a plurality of clusters, each of the clusters being related to a distinct concept of the text, each cluster comprising at least one of the determined term vectors, the clustering comprising selecting the at least one of the determined term vectors based on the determined weights of the term vectors and distances between the determined term vectors;
  
  identifying, by the processing system using latent semantic analysis (LSA), a first set of terms associated with a first cluster of the plurality of clusters and a second set of terms associated with a second cluster of the plurality of clusters;
  
  determining, by the processing system, a first weight associated with the first cluster and a second weight associated with the second cluster, wherein the first weight is based at least on the weights of the term vectors of the first cluster, and wherein the second weight is based at least on the weights of the term vectors of the second cluster;
  
  determining, by the processing system, a first percentage of a list of output terms that should come from the first cluster and a second percentage of the list of output terms that should come from the second cluster, the first percentage based on a ratio of the first weight to a sum of the first and second weights, the second percentage based on a ratio of the second weight to the sum of the first and second weights;
  
  selecting, by the processing system, one or more terms from the first set of terms according to the determined first percentage;
  
  selecting, by the processing system, one or more terms from the second set of terms according to the determined second percentage;
  
  creating, by the processing system, the list of output terms using at least a portion of the selected terms from the first and second sets of terms, the list of output terms having the distinct concepts of the plurality of clusters; and
  
  storing, by the processing system, the list of output terms in one or more memory units.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The computer-implemented method of claim 9, wherein clustering the determined term vectors into the plurality of clusters comprises using agglomerative clustering.
  - 11. The computer-implemented method of claim 9, wherein creating the list of output terms comprises using log-entropy mixing.
  - 12. computer-implemented method of claim 9, further comprising:
    - creating a query pseudo-document from the determined term vectors; and
      
      creating a leaned pseudo-document using the query pseudo-document;
      
      wherein identifying the first or second set of terms comprises using LSA of the leaned pseudo-document.
  - 13. The computer-implemented method of claim 12, wherein creating the leaned pseudo-document comprises:
    - determining a query vector according to the query pseudo-document;
      
      determining a cluster vector;
      
      normalizing the query vector and the cluster vector to the same length; and
      
      determining a leaned cluster vector that points to a location that is between the query vector and the cluster vector.
  - 14. The computer-implemented method of claim 9, wherein:
    - the weights of each of the determined term vectors comprise log-entropy weights;
      
      the first weight associated with the first cluster comprises a sum of the determined log-entropy weights of the term vectors of the first cluster; and
      
      the second weight associated with the second cluster comprises a sum of the determined log-entropy weights of the term vectors of the second cluster.
  - 15. The computer-implemented method of claim 9, wherein calculating the weight of each of the determined term vectors comprises utilizing log-entropy weighting.
  - 16. The computer-implemented method of claim 9, further comprising:
    - identifying, based on the calculated weights, one or more important term vectors and one or more unimportant term vectors; and
      
      preventing important term vectors from being clustered with unimportant term vectors.

17. A non-transitory computer-readable medium comprising software, the software when executed by one or more processing units operable to perform operations comprising:
- accessing text;
  
  identifying a plurality of terms from the text;
  
  determining a plurality of term vectors associated with the identified plurality of terms;
  
  calculating a weight of each of the determined plurality of term vectors;
  
  clustering the determined plurality of term vectors into a plurality of clusters, the plurality of clusters comprising a first cluster related to a first concept of the text and a second cluster related to a second concept of the text, the first concept being distinct from the second concept, the first and second clusters each comprising two or more of the determined term vectors, the clustering comprising grouping two or more of the determined term vectors together based on the determined weights of the two or more term vectors and a distance between the two or more term vectors;
  
  identifying, using latent semantic analysis (LSA), a first set of terms associated with the first cluster;
  
  identifying, using LSA, a second set of terms associated with the second cluster;
  
  determining a first weight associated with the first cluster and a second weight associated with the second cluster, wherein the first weight is based at least on the weights of the term vectors of the first cluster, and wherein the second weight is based at least on the weights of the term vectors of the second cluster;
  
  determining a first percentage of a list of output terms that should come from the first cluster and a second percentage of the list of output terms that should come from the second cluster, the first percentage based on a ratio of the first weight to a sum of the first and second weights, the second percentage based on a ratio of the second weight to the sum of the first and second weights;
  
  selecting one or more terms from the first set of terms according to the determined first percentage;
  
  selecting one or more terms from the second set of terms according to the determined second percentage;
  
  combining the selected terms from the first and second sets of terms into the list of output terms, the list of output terms having the first and second concepts of the text; and
  
  storing the list of output terms in one or more memory units.
- View Dependent Claims (18, 19, 20, 21, 22, 23)
- - 18. The non-transitory computer-readable medium of claim 17, wherein clustering the determined plurality of term vectors into the plurality of clusters comprises using agglomerative clustering.
  - 19. The non-transitory computer-readable medium of claim 17, the one or more processing units further operable to perform operations comprising:
    - creating a query pseudo-document from the determined plurality of term vectors;
      
      creating a first leaned pseudo-document using the query pseudo-document; and
      
      creating a second leaned pseudo-document using and the query pseudo-document; and
      
      wherein;
      
      identifying the first set of terms associated with the first cluster comprises using LSA of the first leaned pseudo-document; and
      
      identifying the second set of terms associated with the second cluster comprises using LSA of the second leaned pseudo-document.
  - 20. The non-transitory computer-readable medium of claim 19, wherein creating the first and second leaned pseudo-documents comprises:
    - determining a query vector according to the query pseudo-document;
      
      determining a first cluster vector;
      
      determining a second cluster vector;
      
      normalizing the query vector and the first cluster vector to the same length;
      
      normalizing the query vector and the second cluster vector to the same length;
      
      determining a first leaned cluster vector that points to a location that is between the query vector and the first cluster vector; and
      
      determining a second leaned cluster vector that points to a location that is between the query vector and the second cluster vector.
  - 21. The non-transitory computer-readable medium of claim 17, wherein:
    - the weights of each of the determined plurality of term vectors comprise log-entropy weights;
      
      the first weight associated with the first cluster comprises a sum of the determined log-entropy weights of the term vectors of the first cluster; and
      
      the second weight associated with the second cluster comprises a sum of the determined log-entropy weights of the term vectors of the second cluster.
  - 22. The non-transitory computer-readable medium of claim 17, wherein calculating the weight of each of the determined plurality of term vectors comprises utilizing log-entropy weighting.
  - 23. The non-transitory computer-readable medium of claim 17, the one or more processing units further operable to perform operations comprising:
    - identifying, based on the calculated weights of each of the determined plurality of term vectors, one or more important term vectors and one or more unimportant term vectors from the determined plurality of term vectors; and
      
      preventing important term vectors from being clustered with unimportant term vectors.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Brainspace Corporation (Reveal Data Corp.)
Original Assignee
Brainspace Corporation (Reveal Data Corp.)
Inventors
Jakubik, Paul A.
Primary Examiner(s)
Jalil, Neveen Abel
Assistant Examiner(s)
BISKEBORN, KRISTOFER M

Application Number

US13/732,869
Publication Number

US 20130218554A1
Time in Patent Office

853 Days
Field of Search

707/737, 707/739, 707/748, 707/750
US Class Current

707/737
CPC Class Codes

G06F 16/35   Clustering; Classification

G06F 16/355   Class or cluster creation o...

G06F 16/358   Browsing; Visualisation the...

G06F 40/30   Semantic analysis

Multi-concept latent semantic analysis queries

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Multi-concept latent semantic analysis queries

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links