Method for fast large scale data mining using logistic regression
First Claim
1. A computer-implemented method of using previously acquired search results to perform a new search using a computer, the method comprising:
- accessing a database of search terms and user selection clicks on retrieved documents;
calculating a matrix X, wherein matrix X is defined as a matrix having rows corresponding to rows of the database and columns corresponding to search terms in the database row;
calculating a vector Y, wherein Y is defined as a representation of a document chosen by a user corresponding to a row of the database;
calculating B vector values using the X matrix, the Y vector, and an iteratively reweighted least squares (IRLS) method, where the IRLS method executes in a time proportional to a time complexity function of O(d*k*k+imax*a2), where k is the average number of nonzero elements per row in X, d is the number of data point rows in X, a is the number of columns in X and imax is a constant;
storing the B vectors;
receiving a new set of search terms and generating a corresponding search vector z;
calculating the dot product of B and z to produce weighted scalar values of relevant documents in the database; and
ordering the relevant documents in the database using the weighted scalar values; and
displaying, on a computer monitor, an ordered list of the relevant documents corresponding to the new set of search terms;
wherein the X matrix is a binary matrix wherein a value of 1 represents that a particular keyword was used; and
wherein the Y vector comprises a binary vector wherein a 1 value represents that a particular document was selected by a user.
2 Assignments
0 Petitions
Accused Products
Abstract
A classifier using a logistic regression technique permits previously acquired search results to be used to perform a new search. A user inputs search terms and queries a database of previous search results. A logistical regression calculation is performed using sets of data such that the time execution performance is at least a factor of 10 improvement over a conventional technique. In our experiments where real world data was used, the execution time was reduced up to 353 times as compared to the conventional technique. The Iteratively Reweighted Least Squares (IRLS) method is used for the logistical regression method and beta vector values are calculated from the database data set. A vector of the user input terms is multiplied by the beta vector values to produce an ordered list of documents satisfying the user search terms.
42 Citations
16 Claims
-
1. A computer-implemented method of using previously acquired search results to perform a new search using a computer, the method comprising:
-
accessing a database of search terms and user selection clicks on retrieved documents; calculating a matrix X, wherein matrix X is defined as a matrix having rows corresponding to rows of the database and columns corresponding to search terms in the database row; calculating a vector Y, wherein Y is defined as a representation of a document chosen by a user corresponding to a row of the database;
calculating B vector values using the X matrix, the Y vector, and an iteratively reweighted least squares (IRLS) method, where the IRLS method executes in a time proportional to a time complexity function of O(d*k*k+imax*a2), where k is the average number of nonzero elements per row in X, d is the number of data point rows in X, a is the number of columns in X and imax is a constant;storing the B vectors; receiving a new set of search terms and generating a corresponding search vector z; calculating the dot product of B and z to produce weighted scalar values of relevant documents in the database; and ordering the relevant documents in the database using the weighted scalar values; and displaying, on a computer monitor, an ordered list of the relevant documents corresponding to the new set of search terms; wherein the X matrix is a binary matrix wherein a value of 1 represents that a particular keyword was used; and wherein the Y vector comprises a binary vector wherein a 1 value represents that a particular document was selected by a user. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. In a system comprising a database of keywords and corresponding related documents and computer that uses a logistic regression algorithm to search the database using user keywords, the improvement comprising:
-
a processor for calculating B vector values using an X matrix, a Y vector, and an iteratively reweighted least squares (IRLS) algorithm, wherein the X matrix is defined as a matrix having rows corresponding to rows of the database and columns corresponding to search terms in the database rows, wherein the Y vector is defined as a representation of a document chosen by a user corresponding to a row of the database, and where the IRLS method executes in a time proportional to a time complexity function of O(d*k*k+imax*a2), where k is the average number of nonzero elements per row in X, d is the number of data point rows in X, a is the number of columns in X and imax is a constant; storage means for storing the B vectors for use in association with a new user query; an input interface for receiving a new set of search terms and generating a corresponding search vector z; the processor also calculating weighted scalar values of relevant documents in the database in response to the query; and
ordering the relevant documents in the database using the weighted scalar values for display to the user; andwherein the X matrix is a binary matrix wherein a value of 1 represents that a particular keyword was used, and wherein the Y vector comprises a binary vector wherein a 1 value represents that a particular document was selected by a user. - View Dependent Claims (9, 10, 11)
-
-
12. A computer-readable storage medium having computer-executable instructions for performing a search of a database of keywords and relevant documents, the computer -executable instructions comprising:
-
calculating a matrix X, wherein matrix X is defined as a matrix having rows corresponding to rows of the database and columns corresponding to search terms in the database row; calculating a vector Y, wherein Y is defined as a representation of a document chosen by a user corresponding to a row of the database; calculating and storing B vector values using the X matrix, the Y vector, and an iteratively reweighted least squares (IRLS) algorithm, where the IRLS method executes in a time proportional to a time complexity function of O(d*k*k+imax*a2), where k is the average number of nonzero elements per row in X, d is the number of data point rows in X, a is the number of columns in X and imax is a constant; receiving a new set of search terms and generating a corresponding search vector z; multiplying B and z to produce weighted scalar values of relevant documents in the database; and arranging the relevant documents in the database in an order using the weighted scalar values; wherein the instructions further comprise assigning a value of 1 to represent that a particular keyword was used in the X matrix and assigning a value of 1 to represent that a particular document was selected by a user in the Y vector. - View Dependent Claims (13, 14, 15, 16)
-
Specification