Information retrieval and text mining using distributed latent semantic indexing
First Claim
1. A computer-implemented method for processing a collection of data objects for use in information retrieval and data mining operations comprising the steps of:
- generating a frequency count for each term in each data object in the collection;
partitioning the collection of data objects into a plurality of sub-collections using the term-by data object information, wherein each sub-collection is based on the conceptual dependence of the data objects within;
generating a term-by-data object matrix for each sub-collection;
decomposing the term-by data object matrix of each sub-collection into a reduced singular value representation;
determining the centroid vectors of each sub-collection;
finding a predetermined number of terms in each sub-collection closest to centroid vector; and
,developing a similarity graph network to establish similarity between sub-collections.
10 Assignments
0 Petitions
Accused Products
Abstract
The use of latent semantic indexing (LSI) for information retrieval and text mining operations is adapted to work on large heterogeneous data sets by first partitioning the data set into a number of smaller partitions having similar concept domains. A similarity graph network is generated in order to expose links between concept domains which are then exploited in determing which domains to query as well as in expanding the query vector. LSI is performed on those partitioned data sets most likely to contain information related to the user query or text mining operation. In this manner LSI can be applied to datasets that heretofore presented scalability problems. Additionally, the computation of the singular value decomposition of the term-by-document matrix can be accomplished at various distributed computers increasing the robustness of the retrieval and text mining system while decreasing search times.
98 Citations
28 Claims
-
1. A computer-implemented method for processing a collection of data objects for use in information retrieval and data mining operations comprising the steps of:
-
generating a frequency count for each term in each data object in the collection; partitioning the collection of data objects into a plurality of sub-collections using the term-by data object information, wherein each sub-collection is based on the conceptual dependence of the data objects within; generating a term-by-data object matrix for each sub-collection; decomposing the term-by data object matrix of each sub-collection into a reduced singular value representation; determining the centroid vectors of each sub-collection; finding a predetermined number of terms in each sub-collection closest to centroid vector; and
,developing a similarity graph network to establish similarity between sub-collections. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 18)
-
-
16. A computer-implemented method of information retrieval in response to a user query from a user comprising the steps of:
-
partitioning a collection of data-objects into a plurality of sub-collections based on conceptual dependence of data-objects wherein the relationship between such sub-collections is expressed by a similarity graph network; generating a query vector based on the user query; identifying all sub-collections likely to be responsive to the user query using the similarity graph network; and
,identifying data objects similar to query vector in each identified sub-collection;
wherein the step of partitioning the collection of data objects further comprises the steps of;generating a frequency count for each term in each data object in the collection; partitioning the collection of data objects into a plurality of sub-collections using the term by data object information; generating a term-by-data object matrix for each sub-collection; decomposing the term-by-data object matrix into a reduced singular value representation; determining the centroid vectors of each sub-collection; finding a predetermined number of terms in each sub-collection closest to centroid vector; and developing a similarity graph network to establish similarity between sub-collections. - View Dependent Claims (17, 19, 20, 21, 22, 23, 24, 25)
-
-
26. A system for the retrieval of information from a collection of data objects in response to a user query comprising:
-
means for inputting a user query; one or more data servers for storing said collection of data objects and for partitioning said collection of data objects into a plurality of sub-collections based on the conceptual dependence of data objects within;
generating a term-by-data object matrix for each sub-collection;an LSI processor hub in communication with each data server for;
(i) developing a similarity graph network based on the similarity of the plurality of the partitioned sub-collections, (ii) generating a query vector based on the user query, (iii) identifying sub-collections likely to be responsive to the user query based on the similarity graph network; and
for (ii) coordinating the identification of data objects similar to query vector in each selected sub-collection. - View Dependent Claims (27)
-
-
28. A system for the processing of a collection of data objects for use in information retrieval and data mining operations comprising:
-
means for generating a frequency count for each term in each data object in the collection; means for partitioning the collection of data objects into a plurality of sub-collections using the term-by-data object information; means for generating a term-by-data object matrix for each sub-collection; means for decomposing the term-by-data object matrix into a reduced singular value representation; means for determining the centroid vectors of each sub-collection; means for finding a predetermined number of terms in each sub-collection closest to centroid vector; and
,means for developing a similarity graph network to establish similarity between sub-collections.
-
Specification