Method and system for optimally searching a document database using a representative semantic space
First Claim
1. A method for comparing documents for similarity comprising:
- generating a first pseudo-document vector utilizing a first document and a term-by-concept matrix determined from a set of documents that does not include the first document;
generating a second pseudo-document vector utilizing a second document and said term-by-concept matrix; and
determining similarity of said first document and said second document based on said first pseudo-document vector and said second pseudo-document vector.
23 Assignments
0 Petitions
Accused Products
Abstract
A term-by-document matrix is compiled from a corpus of documents representative of a particular subject matter that represents the frequency of occurrence of each term per document. A weighted term dictionary is created using a global weighting algorithm and then applied to the term-by-document matrix forming a weighted term-by-document matrix. A term vector matrix and a singular value concept matrix are computed by singular value decomposition of the weighted term-document index. The k largest singular concept values are kept and all others are set to zero thereby reducing to the concept dimensions in the term vector matrix and a singular value concept matrix. The reduced term vector matrix, reduced singular value concept matrix and weighted term-document dictionary can be used to project pseudo-document vectors representing documents not appearing in the original document corpus in a representative semantic space. The similarities of those documents can be ascertained from the position of their respective pseudo-document vectors in the representative semantic space.
-
Citations
31 Claims
-
1. A method for comparing documents for similarity comprising:
-
generating a first pseudo-document vector utilizing a first document and a term-by-concept matrix determined from a set of documents that does not include the first document; generating a second pseudo-document vector utilizing a second document and said term-by-concept matrix; and determining similarity of said first document and said second document based on said first pseudo-document vector and said second pseudo-document vector. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method for searching a collection of documents, the method comprising:
-
defining a representative semantic space, wherein the defining includes; compiling a corpus having a plurality of documents; identifying terms occurring in the corpus; constructing a weighted term-by-document matrix from weighted term frequencies of the corpus; decomposing the weighted term-by-document matrix into at least a term-by-concept matrix and a singular value matrix; and forming a reduced term-by-concept matrix and a reduced singular value matrix; compiling a document collection having at least one document that is not in the corpus; and using the representative semantic space to generate a pseudo-document vector for each document in a set of multiple documents in the document collection. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A system comprising:
-
a network; a server connected to said network, said server comprising; an interface for interfacing with the network; a memory, said memory containing instructions for; identifying terms occurring in a plurality of documents; determining for each document a local term weight for each identified term; creating a weighted term-by-document matrix using at least the local term weights for each of the respective identified terms; decomposing the weighted term-by-document matrix into at least a term-by-concept matrix and a singular value matrix; using the term-by-concept matrix and the singular value matrix to determine a pseudo-document vector for a document not included in said plurality of documents; and a processor for executing instructions from memory. - View Dependent Claims (21)
-
-
22. A program product stored on a computer readable storage medium for performing a method, the program product comprising:
-
instructions for generating a first pseudo-document vector using a first document and a term-by-concept matrix determined from a set of documents that does not include the first document; instructions for generating a second pseudo-document vector using a second document and the term-by-concept matrix; and instructions for determining similarity of said first document and said second document based on said first pseudo-document vector and said second pseudo-document vector. - View Dependent Claims (23, 24, 25)
-
-
26. A computer implemented search method that comprises:
-
applying a representative semantic space to a document set to obtain a set of pseudo-document vectors; applying the representative semantic space to a query to obtain a query pseudo-document vector; and ranking documents from the document set by comparing the corresponding pseudo-document vectors to the query pseudo-document vector. - View Dependent Claims (27, 28, 29, 30, 31)
-
Specification