Text influenced molecular indexing system and computer-implemented and/or computer-assisted method for same
First Claim
1. A method of calculating similarity or substantial similarity between a first chemical descriptor and at least one other chemical descriptor in a matrix representing a plurality of chemical and textual descriptors, comprising the steps of:
- (a) creating at least one chemical descriptor and at least one textual descriptor for each compound in a collection of compounds;
(b) preparing a descriptor matrix X, wherein the descriptor matrix comprises;
a plurality of columns, each column representing a text source containing textual and chemical descriptions, and;
a plurality of rows, each row comprising a descriptor associated with each respective text source, wherein the entries in the descriptor matrix indicate the number of times a descriptor occurs in each respective text source;
(c) performing a singular value decomposition (SVD) of the descriptor matrix to produce resultant matrices;
(d) using at least one of the resultant matrices to compute the similarity between the first chemical descriptor di and the at least one other chemical descriptor dj; and
(e) outputting at least a subset of the at least one other chemical descriptor ranked in order of similarity to the first chemical descriptor.
2 Assignments
0 Petitions
Accused Products
Abstract
An extension of the vector space model for computing chemical similarity using textual and chemical descriptors is described. The method uses a chemical and/or textual description of a molecule/chemical and a decomposes a molecule/chemical descriptor matrix by a suitable technique such as singular value decomposition to create a low dimensional representation of the original descriptor space. Similarities between a user probe and the textual and/or chemical descriptors are then computed and ranked.
37 Citations
58 Claims
-
1. A method of calculating similarity or substantial similarity between a first chemical descriptor and at least one other chemical descriptor in a matrix representing a plurality of chemical and textual descriptors, comprising the steps of:
-
(a) creating at least one chemical descriptor and at least one textual descriptor for each compound in a collection of compounds;
(b) preparing a descriptor matrix X, wherein the descriptor matrix comprises;
a plurality of columns, each column representing a text source containing textual and chemical descriptions, and;
a plurality of rows, each row comprising a descriptor associated with each respective text source, wherein the entries in the descriptor matrix indicate the number of times a descriptor occurs in each respective text source;
(c) performing a singular value decomposition (SVD) of the descriptor matrix to produce resultant matrices;
(d) using at least one of the resultant matrices to compute the similarity between the first chemical descriptor di and the at least one other chemical descriptor dj; and
(e) outputting at least a subset of the at least one other chemical descriptor ranked in order of similarity to the first chemical descriptor. - View Dependent Claims (2, 3, 4, 5, 6, 7)
generating matrices P, Σ
, and QT, such that descriptor matrix X=PΣ
QT, whereinP is a mxr matrix, called the left singular matrix (r is the rank of X), and its columns are the eigenvectors of XXT corresponding to nonzero eigenvalues;
Q is a nxr matrix, called the right singular matrix, whose columns are the eigenvectors of XTX corresponding to nonzero eigenvalues; and
Σ
is a rxr diagonal matrix whose nonzero elements, σ
1, σ
2, . . . , σ
r called singular values, are the square roots of the eigenvalues and have the property that σ
1≧
σ
2≧
. . . ≧
σ
r.
-
-
5. The method as recited in claim 4 wherein said computing step comprises the step of computing the dot product between the ith and jth rows of the matrix PΣ
- .
-
6. The method as recited in claim 1 wherein the first chemical descriptor is initially an ad hoc query vector q, further comprising the step of:
-
determining a matrix Xk, wherein Xk is the matrix of rank k which is equivalent to PkΣ
kQTk, and is the least squares closest to X; and
projecting the ad hoc query vector onto Xk.
-
-
7. The method as recited in claim 6 wherein the ad hoc query vector q is defined as being equal to qTPΣ
-
−
1k.
-
−
-
8. A method of calculating similarity or substantial similarity between a first document Vi and at least one other document Vj in a matrix representing a plurality of chemical and textual descriptors, comprising the steps of:
-
(a) creating at least one chemical descriptor and at least one text descriptor for each compound in each document;
(b) preparing a descriptor matrix X, wherein the descriptor matrix comprises;
a plurality of columns, each column representing a text source containing textual and chemical descriptions, and;
a plurality of rows, each row comprising a descriptor associated with each respective text source, wherein the entries in the descriptor matrix indicate the number of times a descriptor occurs in each respective text source;
(c) performing a singular value decomposition (SVD) of the descriptor matrix to produce resultant matrices;
(d) using at least one of the resultant matrices to compute the similarity between the first document and the at least one other document; and
e) outputting at least a subset of the at least one other document ranked in order of similarity to the first document. - View Dependent Claims (9, 10, 11, 12, 13, 14)
generating matrices P, Σ
, and QT, such that descriptor matrix X=PΣ
QT, whereinP is a mxr matrix, called the left singular matrix (r is the rank of X), and its columns are the eigenvectors of XXT corresponding to nonzero eigenvalues;
Q is a nxr matrix, called the right singular matrix, whose columns are the eigenvectors of XTX corresponding to nonzero eigenvalues; and
Σ
is a rxr diagonal matrix whose nonzero elements, σ
1, σ
2, . . . , σ
r called singular values, are the square roots of the eigenvalues and have the property that σ
1≧
σ
2≧
. . . ≧
σ
r.
-
-
12. The method as recited in claim 11 wherein said computing step comprises the step of computing the dot product between the ith and jth rows of the matrix QΣ
- .
-
13. The method as recited in claim 8 wherein the first document is initially an ad hoc query vector q, further comprising the step of:
-
determining a matrix Xk, wherein Xk is the matrix of rank k which is equivalent to PkΣ
kQTk, and is the least squares closest to X; and
projecting the ad hoc query vector onto Xk.
-
-
14. The method as recited in claim 13 wherein the ad hoc query vector q is defined as being equal to qTPΣ
- 1k.
-
15. A method of calculating similarity or substantial similarity between a chemical descriptor dj and at least one document Vi in a matrix representing a plurality of chemical and textual descriptors, comprising the steps of:
-
(a) creating at least one chemical descriptor and at least one text descriptor for each compound in each document;
(b) preparing a descriptor matrix X, wherein the descriptor matrix comprises;
a plurality of columns, each column representing a text source containing textual and chemical descriptions, and;
a plurality of rows, each row comprising a descriptor associated with each respective text source, wherein the entries in the descriptor matrix indicate the number of times a descriptor occurs in each respective text source;
(c) performing a singular value decomposition (SVD) of the descriptor matrix to produce resultant matrices;
(d) using at least one of the resultant matrices to compute the similarity between at least one of the at least one document Vi and chemical descriptor dj; and
e) outputting at least a subset of the at least one document ranked in order of similarity to the chemical descriptor. - View Dependent Claims (16, 17, 18, 19, 20, 21)
generating matrices P, Σ
, and QT, such that descriptor matrix X=PΣ
QT, whereinP is a mxr matrix, called the left singular matrix (r is the rank of X), and its columns are the eigenvectors of XXT corresponding to nonzero eigenvalues;
Q is a nxr matrix, called the right singular matrix, whose columns are the eigenvectors of XTX corresponding to nonzero eigenvalues; and
Σ
is a rxr diagonal matrix whose nonzero elements, σ
1, σ
2, . . . , σ
r called singular values, are the square roots of the eigenvalues and have the property that σ
1≧
σ
2≧
. . . ≧
σ
r.
-
-
19. The method as recited in claim 18 wherein said computing step comprises the step of computing the dot product between the ith row of the matrix PΣ
- and the jth row of the matrix QΣ
.
- and the jth row of the matrix QΣ
-
20. The method as recited in claim 15 wherein the chemical descriptor is initially an ad hoc query vector q, further comprising the step of:
-
determining a matrix Xk, wherein Xk is the matrix of rank k which is equivalent to PkΣ
kQTk, and is the least squares closest to X; and
projecting the ad hoc query vector onto Xk.
-
-
21. The method as recited in claim 20 wherein the ad hoc query vector q is defined as being equal to qTPΣ
- 1k.
-
22. A method of calculating similarity or substantial similarity between a textual descriptor dj and at least one document Vi in a matrix representing a plurality of chemical and textual descriptors, comprising the steps of:
-
(a) creating at least one chemical descriptor and at least one textual descriptor for each compound in each document;
(b) preparing a descriptor matrix X, wherein the descriptor matrix comprises;
a plurality of columns, each column representing a text source containing textual and chemical descriptions, and;
a plurality of rows, each row comprising a descriptor associated with each respective text source, wherein the entries in the descriptor matrix indicate the number of times a descriptor occurs each respective text source;
(c) performing a singular value decomposition (SVD) of the descriptor matrix to produce resultant matrices;
(d) using at least one of the resultant matrices to compute the similarity between at least one of the at least one document Vi and textual descriptor dj; and
e) outputting at least a subset of the at least one document ranked in order of similarity to the chemical descriptor. - View Dependent Claims (23, 24, 25, 26, 27, 28)
generating matrices P, Σ
, and QT, such that descriptor matrix X=PΣ
QT, whereinP is a mxr matrix, called the left singular matrix (r is the rank of X), and its columns are the eigenvectors of XXT corresponding to nonzero eigenvalues;
Q is a nxr matrix, called the right singular matrix, whose columns are the eigenvectors of XTX corresponding to nonzero eigenvalues; and
Σ
is a rxr diagonal matrix whose nonzero elements, σ
1, σ
2, . . . , σ
r called singular values, are the square roots of the eigenvalues and have the property that σ
1≧
σ
2≧
. . . ≧
σ
r.
-
-
26. The method as recited in claim 25 wherein said computing step comprises the step of computing the dot product between the ith row of the matrix PΣ
- and the jth row of the matrix QΣ
.
- and the jth row of the matrix QΣ
-
27. The method as recited in claim 22 wherein the textual descriptor dj is initially an ad hoc query vector q, further comprising the step of:
-
determining a matrix Xk, wherein Xk is the matrix of rank k which is equivalent to PkΣ
kQTk, and is the least squares closest to X; and
projecting the ad hoc query vector onto Xk.
-
-
28. The method as recited in claim 27 wherein the ad hoc query vector q is defined as being equal to qTPΣ
- 1k.
-
29. A computer readable medium including instructions being executable by a computer, the instructions instructing the computer to generate a searchable representation of chemical structures, the instructions comprising:
-
(a) creating at least one chemical descriptor and at least one text descriptor for each compound in a collection of compounds;
(b) preparing a descriptor matrix X, wherein the descriptor matrix comprises a plurality of columns, each column representing a text source containing textual and chemical descriptions, and;
a plurality of rows, each row comprising a descriptor associated with each respective text source, wherein the entries in the descriptor matrix indicate the number of times a descriptor occurs in each respective text source;
(c) performing singular value decomposition (SVD) of the descriptor matrix to produce resultant matrices;
(d) using at least one of the resultant matrices to compute the similarity between the first chemical descriptor di and the at least one other chemical descriptor dj; and
e) outputting at least a subset of the at least one other chemical descriptor ranked in order of similarity to the first chemical descriptor. - View Dependent Claims (30, 31, 32, 33, 34, 35)
generating matrices P, Σ
, and QT, such that descriptor matrix X=PΣ
QT, wherein;
P is a mxr matrix, called the left singular matrix (r is the rank of X), and its columns are the eigenvectors of XXT corresponding to nonzero eigenvalues;
Q is a nxr matrix, called the right singular matrix, whose columns are the eigenvectors of XTX corresponding to nonzero eigenvalues; and
Σ
is a rxr diagonal matrix whose nonzero elements, σ
1, σ
2, . . . , σ
r, called singular values, are the square roots of the eigenvalues and have the property that σ
1≧
σ
2≧
. . . ≧
σ
r.
-
-
33. The computer readable medium as recited in claim 32 wherein said computing step comprises the step of computing the dot product between the ith and jth rows of the matrix PΣ
- .
-
34. The computer readable medium as recited in claim 29 wherein the first chemical descriptor is initially an ad hoc query vector q, further comprising the step of:
-
determining a matrix Xk, wherein Xk is the matrix of rank k which is equivalent to PkΣ
kQTk, and is the least squares closest to X; and
projecting the ad hoc query vector onto Xk.
-
-
35. The computer readable medium as recited in claim 34 wherein the ad hoc query vector q is defined as being equal to qTPΣ
-
−
1k.
-
−
-
36. A computer readable medium for calculating the similarity between a first text source and at least one other text source in a matrix comprising a plurality of chemical and textual descriptors, comprising the steps of:
-
(a) creating at least one chemical descriptor and at least one text descriptor for each compound in each text source;
(b) preparing a descriptor matrix X, wherein the descriptor matrix comprises;
a plurality of columns, each column representing a text source containing textual and chemical descriptions, and;
a plurality of rows, each row comprising a descriptor associated with each respective text source, wherein the entries in the descriptor matrix indicate the number of times a descriptor occurs in each respective text source;
(c) performing a singular value decomposition (SVD) of the descriptor matrix to produce resultant matrices;
(d) using at least one of the resultant matrices to compute the similarity between the first text source Vi and the at least one other test source Vj; and
e) outputting at least a subset of the at least one other test source ranked in order of similarity to the first text source. - View Dependent Claims (37, 38, 39, 40, 41, 42)
generating matrices P, Σ
, and QT, such that descriptor matrix X=PΣ
QT, whereinP is a mxr matrix, called the left singular matrix (r is the rank of x), and its columns are the eigenvectors of XXT corresponding to nonzero eigenvalues;
Q is a nxr matrix, called the right singular matrix, whose columns are the eigenvectors of XTX corresponding to nonzero eigenvalues; and
Σ
is a rxr diagonal matrix whose nonzero elements, σ
1, σ
2, . . . , σ
r called singular values, are the square roots of the eigenvalues and have the property that σ
1≧
σ
2≧
. . . ≧
σ
r.
-
-
40. The computer readable medium as recited in claim 39 wherein said computing step comprises the step of computing the dot product between the ith and jth rows of the matrix QΣ
- .
-
41. The computer readable medium as recited in claim 36 wherein the first document is initially an ad hoc query vector q, further comprising the step of:
-
determining a matrix Xk, wherein Xk is the matrix of rank k which is equivalent to PkΣ
kQTk, and is the least squares closest to X; and
projecting the ad hoc query vector onto Xk.
-
-
42. The computer readable medium as recited in claim 41 wherein the ad hoc query vector q is defined as being equal to qTPΣ
- 1k.
-
43. A computer readable medium for calculating the similarity between a chemical descriptor dj and at least one text source Vi and, in a matrix comprising a plurality of chemical and textual descriptors, comprising the steps of:
-
(a) creating at least one chemical descriptor and at least one text descriptor for each compound in each text source;
(b) preparing a descriptor matrix X, wherein the descriptor matrix comprises;
a plurality of columns, each column representing a text source containing textual and chemical descriptions, and;
a plurality of rows, each row comprising a descriptor associated with each respective text source, wherein the entries in the descriptor matrix indicate the number of times a descriptor occurs in a text source;
(c) performing a singular value decomposition (SVD) of the descriptor matrix to produce resultant matrices;
(d) using at least one of the resultant matrices to compute the similarity between at least one of the at least one text source Vi and chemical descriptor dj; and
e) outputting at least a subset of the at least one text source ranked in order of similarity to the chemical descriptor. - View Dependent Claims (44, 45, 46, 47, 48, 49)
generating matrices P, Σ
, and QT, such that descriptor matrix X=PΣ
QT, whereinP is a mxr matrix, called the left singular matrix (r is the rank of X), and its columns are the eigenvectors of XXT corresponding to nonzero eigenvalues;
Q is a nxr matrix, called the right singular matrix, whose columns are the eigenvectors of XTX corresponding to nonzero eigenvalues; and
Σ
is a rxr diagonal matrix whose nonzero elements, σ
1, σ
2, . . . , σ
r called singular values, are the square roots of the eigenvalues and have the property that σ
1≧
σ
2≧
. . . ≧
σ
r.
-
-
47. The computer readable medium as recited in claim 46 wherein said computing step comprises the step of computing the dot product between the ith row of the matrix PΣ
- and the jth row of the matrix QΣ
.
- and the jth row of the matrix QΣ
-
48. The computer readable medium as recited in claim 43 wherein the chemical descriptor is initially an ad hoc query vector q, further comprising the step of:
-
determining a matrix Xk, wherein Xk is the matrix of rank k which is equivalent to PkΣ
kQTk, and is the least squares closest to X; and
projecting the ad hoc query vector onto Xk.
-
-
49. The computer readable medium as recited in claim 48 wherein the ad hoc query vector q is defined as being equal to qTPΣ
- 1k.
-
50. A computer readable medium for calculating the similarity between a textual descriptor dj and at least one text source Vi in a matrix comprising a plurality of chemical and textual descriptors, comprising the steps of:
-
(a) creating at least one chemical descriptor and at least one textual descriptor for each compound in each text source;
(b) preparing a descriptor matrix X, wherein the descriptor matrix comprises a plurality of columns, each column representing a test source containing textual and chemical descriptions, and;
a plurality of rows, each row comprising a descriptor associated with each respective text source, wherein the entries in the descriptor matrix indicate the number of times a descriptor occurs in a text source;
(c) performing a singular value decomposition (SVD) of the descriptor matrix to produce resultant matrices;
(d) using at least one of the resultant matrices to compute the similarity between at least one of the at least one text source Vi and textual descriptor dj and e) outputting at least a subset of the at least one text source ranked in order of similarity to the chemical descriptor. - View Dependent Claims (51, 52, 53, 54, 55, 56)
generating matrices P, Σ
, and QT, such that descriptor matrix X=PΣ
QT, whereinP is a mxr matrix, called the left singular matrix (r is the rank of X), and its columns are the eigenvectors of XXT corresponding to nonzero eigenvalues;
Q is a nxr matrix, called the right singular matrix, whose columns are the eigenvectors of XTX corresponding to nonzero eigenvalues; and
Σ
is a rxr diagonal matrix whose nonzero elements, σ
1, σ
2, . . . , σ
r called singular values, are the square roots of the eigenvalues and have the property that σ
1≧
σ
2≧
. . . σ
r.
-
-
54. The computer readable medium as recited in claim 53 wherein said computing step comprises the step of computing the dot product between the ith row of the matrix PΣ
- and the jth row of the matrix QΣ
.
- and the jth row of the matrix QΣ
-
55. The computer readable medium as recited in claim 50 wherein the textual descriptor dj is initially an ad hoc query vector q, further comprising the step of:
-
determining a matrix Xk, wherein Xk is the matrix of rank k which is equivalent to PkΣ
kQTk, and is the least squares closest to X; and
projecting the ad hoc query vector onto Xk.
-
-
56. The computer readable medium as recited in claim 55 wherein the ad hoc query vector q is defined as being equal to qTPΣ
- 1k.
-
57. A method of calculating similarity or substantial similarity between a first chemical descriptor and at least one other chemical descriptor in a matrix representing a plurality of chemical and textual descriptors, comprising the steps of:
-
(a) creating at least one chemical descriptor and at least one textual descriptor for each compound in a collection of compounds;
(b) preparing a descriptor matrix X, wherein the descriptor matrix comprises;
a text source containing textual and chemical descriptions, and;
a descriptor associated with each respective text source, wherein the entries in the descriptor matrix indicate the relevancy of a descriptor with respect to a text source;
(c) performing a singular value decomposition (SVD) of the descriptor matrix to produce resultant matrices;
(d) using at least one of the resultant matrices to compute the similarity between the first chemical descriptor di and the at least one other chemical descriptor dj; and
e) outputting at least a subset of the at least one other chemical descriptor ranked in order of similarity to the first chemical descriptor.
-
-
58. A method of calculating similarity or substantial similarity between a first chemical descriptor and at least one other chemical descriptor in a matrix representing a plurality of chemical and textual descriptors, comprising the steps of:
-
(a) creating at least one chemical descriptor and at least one textual descriptor for each compound in a collection of compounds;
(b) preparing a descriptor matrix X, wherein the descriptor matrix comprises;
a text source containing textual and chemical descriptions, and;
a descriptor associated with each respective text source, wherein the entries in the descriptor matrix indicate the relevancy of a descriptor with respect to a text source;
(c) performing a decomposition operation on the descriptor matrix to produce resultant matrices;
(d) using at least one of the resultant matrices to compute the similarity between the first chemical descriptor di and the at least one other chemical descriptor dj; and
e) outputting at least a subset of the at least one other chemical descriptor ranked in order of similarity to the first chemical descriptor.
-
Specification