Information processing using a hierarchy structure of randomized samples

US 7,216,129 B2
Filed: 02/19/2003
Issued: 05/08/2007
Est. Priority Date: 02/15/2002
Status: Active Grant

First Claim

Patent Images

1. A method for information processing, said information being stored in a database of documents and including attributes, said information at least including a vector of numeral elements and information identifiers to form a matrix, said vector being a node in a hierarchy structure of said information, said method comprising the steps of:

transforming documents in the database into vectors using a vector space model to create a document-keyword matrix;

reducing a dimension of said matrix to a predetermined order to provide a dimension reduced matrix;

randomly assigning vectors of said dimension-reduced matrix to a set of nodes;

constructing a hierarchy structure of said nodes, where the document-keyword vectors are introduced with the hierarchy structure using distance between the document-keyword vectors said hierarchy structure being layered with hierarchy levels starting from a top node;

determining parent nodes and child nodes thereof between adjacent hierarchy levels, said parent nodes being included in an upper level and said child nodes being included in a lower level;

generating relations between said parent nodes and said child nodes by providing pointers to said parent nodes and said child nodes in relation to said distance;

registering pointers by starting from a node pair having closest distance until a predetermined number of pairs being generated,providing a similarity-based query to rank said nodes with respect to said query;

executing a similarity-based information retrieval using the document-keyword matrix;

selecting said nodes to generate a cluster including said ranked nodes with respect to said query.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method is provided for retrieving information from massive databases (i.e., databases with millions of documents) in real time, that allows users to control the trade-off between accuracy in retrieved results and response times. The method may be applied to databases with contents, i.e., documents which have been modeled with a clearly defined metric that enables computation of distances between any two documents, so that pairs of documents which are “closer” with respect to the metric are more similar than pairs of documents which are “further apart”. Our method can be applied to similarity ranking and/or can be combined together with other methods to increase the scalability of information retrieval, detection, ranking, and tracking.

Citations

16 Claims

1. A method for information processing, said information being stored in a database of documents and including attributes, said information at least including a vector of numeral elements and information identifiers to form a matrix, said vector being a node in a hierarchy structure of said information, said method comprising the steps of:
- transforming documents in the database into vectors using a vector space model to create a document-keyword matrix;
  
  reducing a dimension of said matrix to a predetermined order to provide a dimension reduced matrix;
  
  randomly assigning vectors of said dimension-reduced matrix to a set of nodes;
  
  constructing a hierarchy structure of said nodes, where the document-keyword vectors are introduced with the hierarchy structure using distance between the document-keyword vectors said hierarchy structure being layered with hierarchy levels starting from a top node;
  
  determining parent nodes and child nodes thereof between adjacent hierarchy levels, said parent nodes being included in an upper level and said child nodes being included in a lower level;
  
  generating relations between said parent nodes and said child nodes by providing pointers to said parent nodes and said child nodes in relation to said distance;
  
  registering pointers by starting from a node pair having closest distance until a predetermined number of pairs being generated,providing a similarity-based query to rank said nodes with respect to said query;
  
  executing a similarity-based information retrieval using the document-keyword matrix;
  
  selecting said nodes to generate a cluster including said ranked nodes with respect to said query.
- View Dependent Claims (2, 3, 4)
- - 2. The method for information processing according to the claim 1, wherein said reduction step comprises the step of reducing the dimension of said matrix using latent semantic indexing or the covariance matrix method.
  - 3. The method for information processing according to the claim 1, wherein said generating step further comprises the second step of generating another pair of pointers between a parent node and at least one child node having failed to generate said relation, said parent node being permitted to generate said pair of pointers and not having reached a predetermined number of pointers indicating child nodes.
  - 4. The method for information processing according to the claim 1, wherein said information processing is selected from the group consisting of information retrieval, information detecting, information ranking, information tracking and any combination thereof.

5. An information processing system comprising a computer, an output/input interface and a database, said information being stored as documents in a database and including attributes, said information at least including a vector of numeral elements and information identifiers to form a matrix, said vector being a node in a hierarchy structure of said information, said information processing system comprising:
- means for transforming the documents in the database into vectors using a vector space model to create a document-keyword matrix;
  
  means for reducing a dimension of said matrix to a predetermined order to provide a dimension reduced matrix;
  
  means for randomly assigning vectors of said dimension reduction matrix to a set of nodes;
  
  means for constructing a hierarchy structure of said nodes, where the document-keyword vectors are introduced with the hierarchy structure using distance between the document-keyword vectors, said hierarchy structure being layered with hierarchy levels starting from a top node;
  
  means for determining parent nodes and child nodes thereof between adjacent hierarchy levels, said parent nodes being included in an upper level and said child nodes being included in a lower level;
  
  means for generating relations between said parent nodes and said child nodes by providing pointers to said parent nodes and said child nodes in relation to said distance;
  
  registering pointers by starting from a node pair having closest distance until a predetermined number of pairs being generated;
  
  means for providing a similarity based query to rank said nodes with respect to said query;
  
  means for selecting said nodes to generate a cluster including said ranked nodes with respect to said query.
- View Dependent Claims (6, 7, 8)
- - 6. The system according to the claim 5, wherein said means for reducing dimension comprises means for reducing dimension of said matrix using latent semantic indexing or the covariance matrix method.
  - 7. The system according to the claim 5, wherein said means for generating relations further comprises means for executing a second generation of a pair of pointers between a parent node and at least one child node having failed to generate said relation, said parent node being permitted to generate said pair of pointers and not having reached a predetermined number of pointers indicating child nodes.
  - 8. The system according to the claim 5, wherein said information processing is selected from the group consisting of information retrieval, information detecting, information ranking, information tracking and any combination thereof.

9. A computer readable medium storing a computer readable program for executing a method for information processing in a computer, said information being stored in a database as documents and including attributes, said information at least including a vector of numeral elements and information identifiers to form a matrix, said vector being a node in a hierarchy structure of said information, said method comprising the steps of:
- transforming documents in the database into vectors using a vector space model to create a document-keyword matrix;
  
  reducing a dimension of said matrix to a predetermined order to provide a dimension reduced matrix;
  
  randomly assigning vectors of said dimension-reduced matrix to a set of nodes;
  
  constructing a hierarchy structure of said nodes, where the document-keyword vectors are introduced with the hierarchy structure using distance between the document-keyword vectors said hierarchy structure being layered with hierarchy levels starting from a top node;
  
  determining parent nodes and child nodes thereof between adjacent hierarchy levels, said parent nodes being included in an upper level and said child nodes being included in a lower level;
  
  generating relations between said parent nodes and said child nodes by providing pointers to said parent nodes and said child nodes in relation to said distance;
  
  registering pointers by starting from a node pair having closest distance until a predetermined number of pairs being generated,providing a similarity-based query to rank said nodes with respect to said query;
  
  executing a similarity-based information retrieval using the document-keyword matrix;
  
  selecting said nodes to generate a cluster including said ranked nodes with respect to said query.
- View Dependent Claims (10, 11, 12)
- - 10. The computer readable medium according to the claim 9, wherein said reduction step comprises the step of reducing dimension of said matrix using latent semantic indexing or the covariance matrix method.
  - 11. The computer readable medium according to the claim 9, wherein said generating step further comprises a second step of generation of a pair of pointers between a parent node and at least one child node having failed to generate said relation, said parent node being permitted to generate said pair of pointers and not having reached a predetermined number of pointers indicating child nodes.
  - 12. The computer readable medium according to the claim 9, wherein said information processing is selected from the group consisting of information retrieval, information detecting, information ranking, information tracking and any combination thereof.

13. A computer executable program stored in a computer readable medium for information processing being possible to be implemented into a computer, said information being stored in a database as documents and including attributes, said information at least including a vector of numeral elements and information identifiers to form a matrix, said vector being a node in a hierarchy structure of said information, said computer program executing the steps of:
- transforming documents in the database into vectors using a vector space model to create a document-keyword matrix;
  
  reducing a dimension of said matrix to a predetermined order to provide a dimension reduced matrix;
  
  randomly assigning vectors of said dimension-reduced matrix to a set of nodes;
  
  constructing a hierarchy structure of said nodes, where the document-keyword vectors are introduced with the hierarchy structure using distance between the document-keyword vectors said hierarchy structure being layered with hierarchy levels starting from a top node;
  
  determining parent nodes and child nodes thereof between adjacent hierarchy levels, said parent nodes being included in an upper level and said child nodes being included in a lower level;
  
  generating relations between said parent nodes and said child nodes by providing pointers to said parent nodes and said child nodes in relation to said distance;
  
  registering pointers by starting from a node pair having closest distance until a predetermined number of pairs being generated,providing a similarity-based query to rank said nodes with respect to said query;
  
  executing a similarity-based information retrieval using the document-keyword matrix;
  
  selecting said nodes to generate a cluster including said ranked nodes with respect to said query.
- View Dependent Claims (14, 15, 16)
- - 14. A computer executable program according to the claim 13, wherein said reduction step comprises the step of reducing dimension of said matrix using latent semantic indexing or the covariance matrix method.
  - 15. A computer executable program according to the claim 13, wherein said generating step further comprises the second step of generating another pair of pointers between a parent node and at least one child node having failed to generate said relation, said parent node being permitted to generate said pair of pointers and not having reached a predetermined number of pointers indicating child nodes.
  - 16. A computer executable program according to the claim 13, wherein said information processing is selected from the group consisting of information retrieval, information detecting, information ranking, information tracking and any combination thereof.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Rakuten Group, Inc.
Original Assignee
International Business Machines Corporation
Inventors
Kobayashi, Mei, Houle, Michael Edward, Aono, Masaki
Primary Examiner(s)
Ali; Mohammad
Assistant Examiner(s)
Ahluwalia; Navneet K

Application Number

US10/370,224
Publication Number

US 20040162834A1
Time in Patent Office

1,539 Days
Field of Search

707 1- 10, 707100-1041, 707200-205
US Class Current

1/1
CPC Class Codes

G06F 16/334   Query execution G06F16/335 ...

G06F 16/3344   using natural language anal...

G06F 16/3347   using vector based model

Y10S 707/99936   Pattern matching access

Y10S 707/99943   Generating database or data...

Information processing using a hierarchy structure of randomized samples

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Information processing using a hierarchy structure of randomized samples

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links