Information processing using a hierarchy structure of randomized samples
First Claim
1. A method for information processing, said information being stored in a database of documents and including attributes, said information at least including a vector of numeral elements and information identifiers to form a matrix, said vector being a node in a hierarchy structure of said information, said method comprising the steps of:
- transforming documents in the database into vectors using a vector space model to create a document-keyword matrix;
reducing a dimension of said matrix to a predetermined order to provide a dimension reduced matrix;
randomly assigning vectors of said dimension-reduced matrix to a set of nodes;
constructing a hierarchy structure of said nodes, where the document-keyword vectors are introduced with the hierarchy structure using distance between the document-keyword vectors said hierarchy structure being layered with hierarchy levels starting from a top node;
determining parent nodes and child nodes thereof between adjacent hierarchy levels, said parent nodes being included in an upper level and said child nodes being included in a lower level;
generating relations between said parent nodes and said child nodes by providing pointers to said parent nodes and said child nodes in relation to said distance;
registering pointers by starting from a node pair having closest distance until a predetermined number of pairs being generated,providing a similarity-based query to rank said nodes with respect to said query;
executing a similarity-based information retrieval using the document-keyword matrix;
selecting said nodes to generate a cluster including said ranked nodes with respect to said query.
4 Assignments
0 Petitions
Accused Products
Abstract
A method is provided for retrieving information from massive databases (i.e., databases with millions of documents) in real time, that allows users to control the trade-off between accuracy in retrieved results and response times. The method may be applied to databases with contents, i.e., documents which have been modeled with a clearly defined metric that enables computation of distances between any two documents, so that pairs of documents which are “closer” with respect to the metric are more similar than pairs of documents which are “further apart”. Our method can be applied to similarity ranking and/or can be combined together with other methods to increase the scalability of information retrieval, detection, ranking, and tracking.
-
Citations
16 Claims
-
1. A method for information processing, said information being stored in a database of documents and including attributes, said information at least including a vector of numeral elements and information identifiers to form a matrix, said vector being a node in a hierarchy structure of said information, said method comprising the steps of:
-
transforming documents in the database into vectors using a vector space model to create a document-keyword matrix; reducing a dimension of said matrix to a predetermined order to provide a dimension reduced matrix; randomly assigning vectors of said dimension-reduced matrix to a set of nodes; constructing a hierarchy structure of said nodes, where the document-keyword vectors are introduced with the hierarchy structure using distance between the document-keyword vectors said hierarchy structure being layered with hierarchy levels starting from a top node; determining parent nodes and child nodes thereof between adjacent hierarchy levels, said parent nodes being included in an upper level and said child nodes being included in a lower level; generating relations between said parent nodes and said child nodes by providing pointers to said parent nodes and said child nodes in relation to said distance; registering pointers by starting from a node pair having closest distance until a predetermined number of pairs being generated, providing a similarity-based query to rank said nodes with respect to said query; executing a similarity-based information retrieval using the document-keyword matrix; selecting said nodes to generate a cluster including said ranked nodes with respect to said query. - View Dependent Claims (2, 3, 4)
-
-
5. An information processing system comprising a computer, an output/input interface and a database, said information being stored as documents in a database and including attributes, said information at least including a vector of numeral elements and information identifiers to form a matrix, said vector being a node in a hierarchy structure of said information, said information processing system comprising:
-
means for transforming the documents in the database into vectors using a vector space model to create a document-keyword matrix; means for reducing a dimension of said matrix to a predetermined order to provide a dimension reduced matrix; means for randomly assigning vectors of said dimension reduction matrix to a set of nodes; means for constructing a hierarchy structure of said nodes, where the document-keyword vectors are introduced with the hierarchy structure using distance between the document-keyword vectors, said hierarchy structure being layered with hierarchy levels starting from a top node; means for determining parent nodes and child nodes thereof between adjacent hierarchy levels, said parent nodes being included in an upper level and said child nodes being included in a lower level; means for generating relations between said parent nodes and said child nodes by providing pointers to said parent nodes and said child nodes in relation to said distance;
registering pointers by starting from a node pair having closest distance until a predetermined number of pairs being generated;means for providing a similarity based query to rank said nodes with respect to said query; means for selecting said nodes to generate a cluster including said ranked nodes with respect to said query. - View Dependent Claims (6, 7, 8)
-
-
9. A computer readable medium storing a computer readable program for executing a method for information processing in a computer, said information being stored in a database as documents and including attributes, said information at least including a vector of numeral elements and information identifiers to form a matrix, said vector being a node in a hierarchy structure of said information, said method comprising the steps of:
-
transforming documents in the database into vectors using a vector space model to create a document-keyword matrix; reducing a dimension of said matrix to a predetermined order to provide a dimension reduced matrix; randomly assigning vectors of said dimension-reduced matrix to a set of nodes; constructing a hierarchy structure of said nodes, where the document-keyword vectors are introduced with the hierarchy structure using distance between the document-keyword vectors said hierarchy structure being layered with hierarchy levels starting from a top node; determining parent nodes and child nodes thereof between adjacent hierarchy levels, said parent nodes being included in an upper level and said child nodes being included in a lower level; generating relations between said parent nodes and said child nodes by providing pointers to said parent nodes and said child nodes in relation to said distance;
registering pointers by starting from a node pair having closest distance until a predetermined number of pairs being generated,providing a similarity-based query to rank said nodes with respect to said query; executing a similarity-based information retrieval using the document-keyword matrix; selecting said nodes to generate a cluster including said ranked nodes with respect to said query. - View Dependent Claims (10, 11, 12)
-
-
13. A computer executable program stored in a computer readable medium for information processing being possible to be implemented into a computer, said information being stored in a database as documents and including attributes, said information at least including a vector of numeral elements and information identifiers to form a matrix, said vector being a node in a hierarchy structure of said information, said computer program executing the steps of:
-
transforming documents in the database into vectors using a vector space model to create a document-keyword matrix; reducing a dimension of said matrix to a predetermined order to provide a dimension reduced matrix; randomly assigning vectors of said dimension-reduced matrix to a set of nodes;
constructing a hierarchy structure of said nodes, where the document-keyword vectors are introduced with the hierarchy structure using distance between the document-keyword vectors said hierarchy structure being layered with hierarchy levels starting from a top node;determining parent nodes and child nodes thereof between adjacent hierarchy levels, said parent nodes being included in an upper level and said child nodes being included in a lower level; generating relations between said parent nodes and said child nodes by providing pointers to said parent nodes and said child nodes in relation to said distance;
registering pointers by starting from a node pair having closest distance until a predetermined number of pairs being generated,providing a similarity-based query to rank said nodes with respect to said query; executing a similarity-based information retrieval using the document-keyword matrix; selecting said nodes to generate a cluster including said ranked nodes with respect to said query. - View Dependent Claims (14, 15, 16)
-
Specification