Information retrieval and text mining using distributed latent semantic indexing

US 7,152,065 B2
Filed: 05/01/2003
Issued: 12/19/2006
Est. Priority Date: 05/01/2003
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for processing a collection of data objects for use in information retrieval and data mining operations comprising the steps of:

generating a frequency count for each term in each data object in the collection;

partitioning the collection of data objects into a plurality of sub-collections using the term-by data object information, wherein each sub-collection is based on the conceptual dependence of the data objects within;

generating a term-by-data object matrix for each sub-collection;

decomposing the term-by data object matrix of each sub-collection into a reduced singular value representation;

determining the centroid vectors of each sub-collection;

finding a predetermined number of terms in each sub-collection closest to centroid vector; and

,developing a similarity graph network to establish similarity between sub-collections.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The use of latent semantic indexing (LSI) for information retrieval and text mining operations is adapted to work on large heterogeneous data sets by first partitioning the data set into a number of smaller partitions having similar concept domains. A similarity graph network is generated in order to expose links between concept domains which are then exploited in determing which domains to query as well as in expanding the query vector. LSI is performed on those partitioned data sets most likely to contain information related to the user query or text mining operation. In this manner LSI can be applied to datasets that heretofore presented scalability problems. Additionally, the computation of the singular value decomposition of the term-by-document matrix can be accomplished at various distributed computers increasing the robustness of the retrieval and text mining system while decreasing search times.

98 Citations

View as Search Results

28 Claims

1. A computer-implemented method for processing a collection of data objects for use in information retrieval and data mining operations comprising the steps of:
- generating a frequency count for each term in each data object in the collection;
  
  partitioning the collection of data objects into a plurality of sub-collections using the term-by data object information, wherein each sub-collection is based on the conceptual dependence of the data objects within;
  
  generating a term-by-data object matrix for each sub-collection;
  
  decomposing the term-by data object matrix of each sub-collection into a reduced singular value representation;
  
  determining the centroid vectors of each sub-collection;
  
  finding a predetermined number of terms in each sub-collection closest to centroid vector; and
  
  ,developing a similarity graph network to establish similarity between sub-collections.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 18)
- - 2. The method of claim 1 further comprising the step of preprocessing the documents to remove a pre-selected set of stop words prior to generating the term frequency count for each data object.
  - 3. The method of claim 2 wherein the step of preprocessing further comprises the reduction of various terms to a canonical form.
  - 4. The method of claim 1 wherein the step of partitioning the collection is performed using a bisecting k-means clustering algorithm.
  - 5. The method of claim 1 wherein the step of partitioning the collection is performed using a k-means clustering algorithm.
  - 6. The method of claim 1 wherein the step of partitioning the collection is performed using hierarchical clustering.
  - 7. The method of claim 1 wherein the predetermined number of terms is 10.
  - 8. The method of claim 1 wherein the step of determining the centroid vectors of each sub-collection uses a clustering algorithm on the reduced singular value representation of the term-by-data object matrix for said sub-collection.
  - 9. The method of claim 1 wherein the step of determining the centroid vectors of each sub-collection is based on the result of the partitioning step.
  - 10. The method of claim 1 wherein the reduced singular value representation of the term-by-data object for each sub-collection has approximately 200 orthogonal dimensions.
  - 11. The method of claim 1 wherein the step of establishing similarity between sub-collections is based on the frequency of occurrence of common terms between sub-collections.
  - 12. The method of claim 1 wherein the step of developing the similarity graph network is based on the semantic relationships between the common terms in each of the sub-collections.
  - 13. The method of claim 1 wherein the step of developing the similarity graph network is based on the product of the frequency of occurrence of common terms between sub-collections and the semantic relationships between the common terms in each of the sub-collections.
  - 14. The method of claim 11 wherein the step of developing the similarity graph network further comprises the steps of:
    - determining if a first sub-collection and a second sub-collection that have no common terms both have terms in common with one or more linking sub-collections; and
      
      ,choosing the linking sub-collection having the strongest link.
  - 15. The method of claim 12 wherein the step of developing the similarity graph network further comprises the steps of:
    - determining the correlation between a first sub-collection and a second sub-collection;
      
      permuting said first sub-collection against said second sub-collection;
      
      computing the Mantel test statistic for each permutation;
      
      counting the number of times where the Mantel test statistic is greater than or equal to the correlation between said first sub-collection and said second sub-collection;
      
      determining the p-value from said count;
      
      calculating the measure for a proximity of order zero;
      
      calculating the measure for the first order proximity; and
      
      ,determining the semantic relationship based similarity measure s2 wherein
      s2=(s_ij^p+p)^−
      
      1.
  - 18. The method of claim 16 wherein the method of claim 1 wherein the step of determining the centroid vectors of each sub-collection is based on the result of the partitioning step.

16. A computer-implemented method of information retrieval in response to a user query from a user comprising the steps of:
- partitioning a collection of data-objects into a plurality of sub-collections based on conceptual dependence of data-objects wherein the relationship between such sub-collections is expressed by a similarity graph network;
  
  generating a query vector based on the user query;
  
  identifying all sub-collections likely to be responsive to the user query using the similarity graph network; and
  
  ,identifying data objects similar to query vector in each identified sub-collection;
  
  wherein the step of partitioning the collection of data objects further comprises the steps of;
  
  generating a frequency count for each term in each data object in the collection;
  
  partitioning the collection of data objects into a plurality of sub-collections using the term by data object information;
  
  generating a term-by-data object matrix for each sub-collection;
  
  decomposing the term-by-data object matrix into a reduced singular value representation;
  
  determining the centroid vectors of each sub-collection;
  
  finding a predetermined number of terms in each sub-collection closest to centroid vector; and
  
  developing a similarity graph network to establish similarity between sub-collections.
- View Dependent Claims (17, 19, 20, 21, 22, 23, 24, 25)
- - 17. The method of claim 16 wherein the step of determining the centroid vectors uses a clustering algorithm on the reduced singular value representation of the term-by-data object matrix for said sub-collection.
  - 19. The method of claim 16 wherein the step of developing the similarity graph network further comprises the steps of:
    - determining if a first sub-collection and a second sub-collection that have no common terms both have terms in common with one or more linking sub-collections; and
      
      ,choosing the linking sub-collection having the strongest link.
  - 20. The method of claim 16 wherein the step of developing the similarity graph network further comprises the steps of:
    - determining the correlation between a first sub-collection and a second sub-collection;
      
      permuting said first sub-collection against said second sub-collection;
      
      computing the Mantel test statistic for each permutation;
      
      counting the number of times where the Mantel test statistic is greater than or equal to the correlation between said first sub-collection and said second sub-collection;
      
      determining the p-value from said count;
      
      calculating the measure for a proximity of order zero;
      
      calculating the measure for the first order proximity; and
      
      ,determining the semantic relationship based similarity measure s2 wherein
      s2=(s_ij^p+p)^−
      
      1.
  - 21. The method of claim 16 further comprising the step of preprocessing the documents to remove a pre-selected set of stop words prior to generating the term frequency count for each data object.
  - 22. The method of claim 16 wherein the step of partitioning the collection is preformed using a bisecting k-means clustering algorithm.
  - 23. The method of claim 16 further comprising the steps of:
    - ranking the identified sub-collections based on the likelihood of each to contain data objects responsive to the user query;
      
      selecting which of the ranked sub-collections to query;
      
      presenting the ranked sub-collections to the user; and
      
      ,inputting user selection of the ranked sub-collections to be queried.
  - 24. The method of claim 16 wherein the step of generating a query vector based on the user query further comprises expanding the user query by computing the weighted sum of its projected term vectors in one or more concept domains that are similar to another concept domain that actually contains the query terms.
  - 25. The method of claim 16 further comprising the step of presenting the identified data objects to the user ranked by concept domain.

26. A system for the retrieval of information from a collection of data objects in response to a user query comprising:
- means for inputting a user query;
  
  one or more data servers for storing said collection of data objects and for partitioning said collection of data objects into a plurality of sub-collections based on the conceptual dependence of data objects within;
  
  generating a term-by-data object matrix for each sub-collection;
  
  an LSI processor hub in communication with each data server for;
  
  (i) developing a similarity graph network based on the similarity of the plurality of the partitioned sub-collections, (ii) generating a query vector based on the user query, (iii) identifying sub-collections likely to be responsive to the user query based on the similarity graph network; and
  
  for (ii) coordinating the identification of data objects similar to query vector in each selected sub-collection.
- View Dependent Claims (27)
- - 27. The system of claim 26 further comprising a means for presenting the identified data objects to the user.

28. A system for the processing of a collection of data objects for use in information retrieval and data mining operations comprising:
- means for generating a frequency count for each term in each data object in the collection;
  
  means for partitioning the collection of data objects into a plurality of sub-collections using the term-by-data object information;
  
  means for generating a term-by-data object matrix for each sub-collection;
  
  means for decomposing the term-by-data object matrix into a reduced singular value representation;
  
  means for determining the centroid vectors of each sub-collection;
  
  means for finding a predetermined number of terms in each sub-collection closest to centroid vector; and
  
  ,means for developing a similarity graph network to establish similarity between sub-collections.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nytell Software LLC (Intellectual Ventures LLC)
Original Assignee
Telcordia Technologies Incorporated (Telefonaktiebolaget LM Ericsson)
Inventors
Behrens, Clifford A., Bassu, Devasis
Primary Examiner(s)
ROBINSON, GRETA LEE

Application Number

US10/427,595
Publication Number

US 20040220944A1
Time in Patent Office

1,328 Days
Field of Search

707/100, 707/102, 707/104.1, 707/5, 715/500
US Class Current

1/1
CPC Class Codes

G06F 16/355   Class or cluster creation o...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99943   Generating database or data...

Information retrieval and text mining using distributed latent semantic indexing

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

98 Citations

28 Claims

Specification

Use Cases

Quick Links

Others

Information retrieval and text mining using distributed latent semantic indexing

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

98 Citations

28 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others