Set Similarity selection queries at interactive speeds
First Claim
1. A method for calculating a similarity score of a query set comprising a query set of tokens and a first database set comprising a first database set of tokens, wherein the first database set is one of a plurality of database sets in a data collection set, comprising the steps of:
- for each specific token in the query set, determining the number of database sets that contain the specific token;
for each specific token in the query set, calculating an idf weight, based at least in part on the number of database sets that contain the specific token and on the total number of database sets in the data collection set;
calculating a normalized length of the first database set;
calculating a normalized length of the query set; and
,calculating a similarity score based at least in part on the normalized length of the first database set, the normalized length of the query set, and the idf weight of each of the tokens in the query set.
1 Assignment
0 Petitions
Accused Products
Abstract
The similarity between a query set comprising query set tokens and a database set comprising database set tokens is determined by a similarity score. The database sets belong to a data collection set, which contains all database sets from which information may be retrieved. If the similarity score is greater than or equal to a user-defined threshold, the database set has information relevant to the query set. The similarity score is calculated with an inverse document frequency method (IDF) similarity measure independent of term frequency. The document frequency is based at least in part on the number of database sets in the data collection set and the number of database sets which contain at least one query set token. The length of the query set and the length of the database set are normalized.
-
Citations
23 Claims
-
1. A method for calculating a similarity score of a query set comprising a query set of tokens and a first database set comprising a first database set of tokens, wherein the first database set is one of a plurality of database sets in a data collection set, comprising the steps of:
-
for each specific token in the query set, determining the number of database sets that contain the specific token; for each specific token in the query set, calculating an idf weight, based at least in part on the number of database sets that contain the specific token and on the total number of database sets in the data collection set; calculating a normalized length of the first database set; calculating a normalized length of the query set; and
,calculating a similarity score based at least in part on the normalized length of the first database set, the normalized length of the query set, and the idf weight of each of the tokens in the query set. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. An apparatus for calculating a similarity score of a query set comprising a query set of tokens and a first database set comprising a first database set of tokens, wherein the first database set is one of a plurality of database sets in a data collection set, comprising:
-
means for determining for each specific token in the query set the number of database sets in the data collection set that contain the specific token; means for calculating an idf weight for each specific token in the query set, based at least in part on the number of database sets that contain the specific token and on the total number of database sets in the data collection set; means for calculating a normalized length of the first database set; means for calculating a normalized length of the query set; and
,means for calculating a similarity score based at least in part on the normalized length of the first database set, the normalized length of the query set, and the idf weight of each of the tokens in the query set. - View Dependent Claims (11, 12, 13, 14)
-
-
15. A computer readable medium storing computer program instructions for calculating a similarity score of a query set comprising a query set of tokens and a first database set comprising a first database set of tokens, wherein the first database set is one of a plurality of database sets in a data collection set, said computer program instructions defining the steps of:
-
for each specific token in the query set, determining the number of database sets in the data collection set that contain the specific token; for each specific token in the query set, calculating an idf weight based at least in part on the number of database sets that contain the specific token and on the total number of database sets in the data collection set; calculating a normalized length of the first database set; calculating a normalized length of the query set; and
,calculating a similarity score based at least in part on the normalized length of the first database set, the normalized length of the query set, and the idf weight of each of the tokens in the query set. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23)
-
Specification