Multi-concept latent semantic analysis queries
First Claim
1. A computer system, comprising:
- one or more memory units; and
one or more processing units operable to;
access text;
identify a plurality of terms from the text;
determine a plurality of term vectors associated with the identified plurality of terms;
calculate a weight of each of the determined plurality of term vectors;
cluster the determined plurality of term vectors into a plurality of clusters, the plurality of clusters comprising a first cluster related to a first concept of the text and a second cluster related to a second concept of the text, the first concept being distinct from the second concept, the first and second clusters each comprising two or more of the determined term vectors, the clustering comprising grouping two or more of the determined term vectors together based on the determined weights of the two or more term vectors and a distance between the two or more term vectors;
identify, using latent semantic analysis (LSA), a first set of terms associated with the first cluster;
identify, using LSA, a second set of terms associated with the second cluster;
determine a first weight associated with the first cluster and a second weight associated with the second cluster, wherein the first weight is based at least on the weights of the term vectors of the first cluster, and wherein the second weight is based at least on the weights of the term vectors of the second cluster;
determine a first percentage of a list of output terms that should come from the first cluster and a second percentage of the list of output terms that should come from the second cluster, the first percentage based on a ratio of the first weight to a sum of the first and second weights, the second percentage based on a ratio of the second weight to the sum of the first and second weights;
select one or more terms from the first set of terms according to the determined first percentage;
select one or more terms from the second set of terms according to the determined second percentage;
combine the selected terms from the first and second sets of terms into the list of output terms, the list of output terms having the first and second concepts of the text; and
store the list of output terms in the one or more memory units.
7 Assignments
0 Petitions
Accused Products
Abstract
A method includes accessing text, identifying a plurality of terms from the text, determining a plurality of term vectors associated with the identified plurality of terms, and clustering the determined plurality of term vectors into a plurality of clusters, the plurality of clusters comprising a first and a second cluster, the first and second clusters each comprising two or more of the determined term vectors. The method further includes creating a first pseudo-document according to the first cluster, creating a second pseudo-document according to the second cluster, identifying a first set of terms associated with the first cluster using latent semantic analysis (LSA) of the first pseudo-document, identifying a second set of terms associated with the second cluster using LSA of the second pseudo-document, and combining the first and second sets of terms into a list of output terms.
-
Citations
24 Claims
-
1. A computer system, comprising:
-
one or more memory units; and one or more processing units operable to; access text; identify a plurality of terms from the text; determine a plurality of term vectors associated with the identified plurality of terms; calculate a weight of each of the determined plurality of term vectors; cluster the determined plurality of term vectors into a plurality of clusters, the plurality of clusters comprising a first cluster related to a first concept of the text and a second cluster related to a second concept of the text, the first concept being distinct from the second concept, the first and second clusters each comprising two or more of the determined term vectors, the clustering comprising grouping two or more of the determined term vectors together based on the determined weights of the two or more term vectors and a distance between the two or more term vectors; identify, using latent semantic analysis (LSA), a first set of terms associated with the first cluster; identify, using LSA, a second set of terms associated with the second cluster; determine a first weight associated with the first cluster and a second weight associated with the second cluster, wherein the first weight is based at least on the weights of the term vectors of the first cluster, and wherein the second weight is based at least on the weights of the term vectors of the second cluster; determine a first percentage of a list of output terms that should come from the first cluster and a second percentage of the list of output terms that should come from the second cluster, the first percentage based on a ratio of the first weight to a sum of the first and second weights, the second percentage based on a ratio of the second weight to the sum of the first and second weights; select one or more terms from the first set of terms according to the determined first percentage; select one or more terms from the second set of terms according to the determined second percentage; combine the selected terms from the first and second sets of terms into the list of output terms, the list of output terms having the first and second concepts of the text; and store the list of output terms in the one or more memory units. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 24)
-
-
9. A computer-implemented method, comprising:
-
accessing text by a processing system; identifying, by the processing system, a plurality of terms from the text; determining, by the processing system, a plurality of term vectors associated with the identified terms; calculating a weight of each of the determined term vectors; clustering, by the processing system, the determined term vectors into a plurality of clusters, each of the clusters being related to a distinct concept of the text, each cluster comprising at least one of the determined term vectors, the clustering comprising selecting the at least one of the determined term vectors based on the determined weights of the term vectors and distances between the determined term vectors; identifying, by the processing system using latent semantic analysis (LSA), a first set of terms associated with a first cluster of the plurality of clusters and a second set of terms associated with a second cluster of the plurality of clusters; determining, by the processing system, a first weight associated with the first cluster and a second weight associated with the second cluster, wherein the first weight is based at least on the weights of the term vectors of the first cluster, and wherein the second weight is based at least on the weights of the term vectors of the second cluster; determining, by the processing system, a first percentage of a list of output terms that should come from the first cluster and a second percentage of the list of output terms that should come from the second cluster, the first percentage based on a ratio of the first weight to a sum of the first and second weights, the second percentage based on a ratio of the second weight to the sum of the first and second weights; selecting, by the processing system, one or more terms from the first set of terms according to the determined first percentage; selecting, by the processing system, one or more terms from the second set of terms according to the determined second percentage; creating, by the processing system, the list of output terms using at least a portion of the selected terms from the first and second sets of terms, the list of output terms having the distinct concepts of the plurality of clusters; and storing, by the processing system, the list of output terms in one or more memory units. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A non-transitory computer-readable medium comprising software, the software when executed by one or more processing units operable to perform operations comprising:
-
accessing text; identifying a plurality of terms from the text; determining a plurality of term vectors associated with the identified plurality of terms; calculating a weight of each of the determined plurality of term vectors; clustering the determined plurality of term vectors into a plurality of clusters, the plurality of clusters comprising a first cluster related to a first concept of the text and a second cluster related to a second concept of the text, the first concept being distinct from the second concept, the first and second clusters each comprising two or more of the determined term vectors, the clustering comprising grouping two or more of the determined term vectors together based on the determined weights of the two or more term vectors and a distance between the two or more term vectors; identifying, using latent semantic analysis (LSA), a first set of terms associated with the first cluster; identifying, using LSA, a second set of terms associated with the second cluster; determining a first weight associated with the first cluster and a second weight associated with the second cluster, wherein the first weight is based at least on the weights of the term vectors of the first cluster, and wherein the second weight is based at least on the weights of the term vectors of the second cluster; determining a first percentage of a list of output terms that should come from the first cluster and a second percentage of the list of output terms that should come from the second cluster, the first percentage based on a ratio of the first weight to a sum of the first and second weights, the second percentage based on a ratio of the second weight to the sum of the first and second weights; selecting one or more terms from the first set of terms according to the determined first percentage; selecting one or more terms from the second set of terms according to the determined second percentage; combining the selected terms from the first and second sets of terms into the list of output terms, the list of output terms having the first and second concepts of the text; and storing the list of output terms in one or more memory units. - View Dependent Claims (18, 19, 20, 21, 22, 23)
-
Specification