Method and system for optimally searching a document database using a representative semantic space
First Claim
1. A method for comparing documents for similarity by representative latent semantic analysis comprising:
- generating a weighted term-by-document matrix for a group of selected documents, said group of selected documents being representative of a similarity criteria;
decomposing said weighted term-by-document matrix into a term matrix of terms occurring said group of selected documents and a reduced concepts matrix of concepts, at least a portion of said concepts indicative of latent semantics in at least one of said selected documents;
generating a first pseudo-document vector utilizing a first document and said reduced concepts matrix;
generating a second pseudo-document vector utilizing a second document and said reduced concepts matrix;
comparing said first pseudo-document vector to said second pseudo-document vector; and
determining similarity of said first document and said second document based on the comparison of said first pseudo-document vector to said second pseudo-document vector.
22 Assignments
0 Petitions
Accused Products
Abstract
A term-by-document matrix is compiled from a corpus of documents representative of a particular subject matter that represents the frequency of occurrence of each term per document. A weighted term dictionary is created using a global weighting algorithm and then applied to the term-by-document matrix forming a weighted term-by-document matrix. A term vector matrix and a singular value concept matrix are computed by singular value decomposition of the weighted term-document index. The k largest singular concept values are kept and all others are set to zero thereby reducing to the concept dimensions in the term vector matrix and a singular value concept matrix. The reduced term vector matrix, reduced singular value concept matrix and weighted term-document dictionary can be used to project pseudo-document vectors representing documents not appearing in the original document corpus in a representative semantic space. The similarities of those documents can be ascertained from the position of their respective pseudo-document vectors in the representative semantic space.
680 Citations
100 Claims
-
1. A method for comparing documents for similarity by representative latent semantic analysis comprising:
-
generating a weighted term-by-document matrix for a group of selected documents, said group of selected documents being representative of a similarity criteria;
decomposing said weighted term-by-document matrix into a term matrix of terms occurring said group of selected documents and a reduced concepts matrix of concepts, at least a portion of said concepts indicative of latent semantics in at least one of said selected documents;
generating a first pseudo-document vector utilizing a first document and said reduced concepts matrix;
generating a second pseudo-document vector utilizing a second document and said reduced concepts matrix;
comparing said first pseudo-document vector to said second pseudo-document vector; and
determining similarity of said first document and said second document based on the comparison of said first pseudo-document vector to said second pseudo-document vector. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A method for comparing documents for similarity by representative latent semantic analysis comprising:
-
selecting documents based on a similarity criteria;
generating a weighted term-by-document matrix for said selected documents;
decomposing said weighted term-by-document matrix into a reduced concepts matrix, a term matrix, and a document matrix;
discarding said document matrix;
generating a first pseudo-document vector utilizing a first document, said reduced concepts matrix and said term matrix;
generating a second pseudo-document vector utilizing a second document, said reduced concepts matrix and said term matrix;
comparing said first pseudo-document vector to said second pseudo-document vector, and determining similarity of said first document and said second document based on the comparison of said first pseudo-document vector to said second pseudo-document vector. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30, 31)
-
-
32. A method for comparing documents for similarity by representative latent semantic analysis comprising:
-
creating a concept database comprising;
compiling a corpus from a plurality of documents representative of a similarity criteria;
identifying terms occurring in the corpus;
finding a global term weight for each identified term based on a frequency of occurrences in the corpus;
finding a local term weight for each identified term based on a frequency of occurrences in each of the plurality of documents in the corpus;
organizing said local term weights in a term-by-document matrix for the identified terms;
creating a weighted term-by-document matrix by applying global term weights to corresponding local term weights for each of the respective identified terms in the term-by-document matrix;
decomposing the weighted term-by-document matrix into at least a term matrix and a concepts matrix;
forming a reduced concepts matrix by eliminating indices in said concept matrix identifying weak concepts; and
forming a reduced term matrix by eliminating indices in said term matrix corresponding to the indices eliminated from the concept matrix;
creating a document database from a collection of documents and said concept database comprising;
compiling said collection of documents;
generating a plurality of pseudo-document vectors utilizing said collection of documents and each of said reduced concepts matrix and said term matrix of said concept database; and
determining document similarity between documents in the collection of documents comprising;
comparing pseudo-document vectors from the document database for respective documents in the collection of documents; and
determining similarity of said respective documents based on the comparison of the respective pseudo-document vectors from the document database. - View Dependent Claims (33, 34, 35, 36, 37)
-
-
38. A method for comparing documents for similarity by representative latent semantic analysis comprising:
-
defining a representative semantic space;
creating a first pseudo-document vector for a first document using at least a portion of the definition for the representative semantic space;
creating a second pseudo-document vector for a second document using at least a portion of the definition for the representative semantic space;
comparing the first pseudo-document vector to the second pseudo-document vector, and determining similarity of the first document and the second document based on the comparison of the first pseudo-document vector to the second pseudo-document vector. - View Dependent Claims (39, 40, 41, 42, 43, 44)
-
-
45. A method for comparing resumè
- s for similarity by representative latent semantic analysis comprising;
generating a weighted term-by-resumè
matrix for a group of resumè
s;
decomposing said weighted term-by-resumè
matrix into a term matrix of terms occurring in said group of resumè
s and a reduced concepts matrix of concepts, at least a portion of said concepts indicative of latent semantics in at least one of said resumè
s;
identifying a collection of resumè
s, wherein at least one resumè
in the collection of resumè
s is not included in said group of resumè
s;
generating a plurality of pseudo-resumè
vectors utilizing said collection of resumè
s and said reduced concepts matrix;
generating a query pseudo-document vector utilizing a query document and said reduced concepts matrix;
comparing said query pseudo-document vector to said plurality of pseudo-resumè
vectors; and
ranking said plurality of resumè
based on similarity between said query pseudo-document vector to said plurality of pseudo-resumè
vectors. - View Dependent Claims (46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57)
- s for similarity by representative latent semantic analysis comprising;
-
58. A method for comparing patent disclosures for similarity by representative latent semantic analysis comprising:
-
generating a weighted term-by-patent matrix for a group of patent disclosures;
decomposing said weighted term-by-patent disclosure matrix into a term matrix of terms occurring in said group of patent disclosures and a reduced concepts matrix of concepts, at least a portion of said concepts indicative of latent semantics in at least one of said patent disclosures;
generating a plurality of pseudo-patent vectors utilizing said reduced concepts matrix and a plurality of patent disclosures;
generating a query pseudo-document vector utilizing a query document and said reduced concepts matrix;
comparing said query pseudo-document vector to said plurality of pseudo-patent disclosure vectors; and
ranking said plurality of patent disclosures based on similarity between said query pseudo-document vector to said plurality of pseudo-patent vectors. - View Dependent Claims (59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71)
-
-
72. A system for comparing documents for similarity by representative latent semantic analysis comprising:
-
a network containing a plurality of network elements;
a server connected to said network, said server comprising;
an interface for interfacing with the network;
a memory, said memory containing instructions for;
identifying terms occurring in the plurality of documents;
determining a global term weight for each identified term;
determining a local term weight for each identified term;
organizing said local term weights in a term-by-document matrix for the identified terms;
creating a weighted term-by-document matrix for global term weights and local term weight for each of the respective identified terms in the weighted term-by-document matrix;
decomposing the weighted term-by-document matrix into at least a term matrix and a reduced concepts matrix;
generating a pseudo-document vector utilizing a document, said reduced concepts matrix, and said term matrix;
comparing pseudo-document vectors; and
determining similarity of respective documents based on the comparison of their respective pseudo-document vectors; and
a processor for executing instructions from memory. - View Dependent Claims (73, 74, 75, 76, 77, 78)
-
-
79. A search engine method comparing documents for similarity by representative latent semantic analysis comprising:
-
receiving a corpus of documents;
creating a concept database from said corpus of documents, said concept database including at least a term matrix and a concepts matrix;
receiving a collection of documents, wherein at least one document in said collection of documents is not in said corpus of documents;
creating a document database from said collection of documents and said concept database, said documents database including at least one pseudo-document vector for each of at least two documents in said collection of documents, said at least one pseudo-document vector being created utilizing said each of at least two documents in said collection of documents and each of said reduced concepts matrix and said term matrix of said concept database; and
determining document similarity between at least said each of at least two documents in said collection of documents using said pseudo-document vectors in said document database. - View Dependent Claims (80)
-
-
81. A data processing system implemented program product for performing a method for comparing documents for similarity by representative latent semantic analysis, the program product comprising:
-
instructions for generating a weighted tern-by-document matrix for a group of selected documents, said group of selected documents being representative of a similarity criteria;
instructions for decomposing said weighted term-by-document matrix into a term matrix of terms occurring said group of selected documents and a reduced concepts matrix of concepts, at least a portion of said concepts indicative of latent semantics in at least one of said selected documents;
instructions for generating a first pseudo-document vector utilizing a first document and said reduced concepts matrix;
instructions for generating a second pseudo-document vector utilizing a second document and said reduced concepts matrix;
comparing said first pseudo-document vector to said second pseudo-document vector; and
instructions for determining similarity of said first document and said second document based on the comparison of said first pseudo-document vector to said second pseudo-document vector. - View Dependent Claims (82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)
-
Specification