Set Similarity selection queries at interactive speeds

US 20090171944A1
Filed: 01/02/2008
Published: 07/02/2009
Est. Priority Date: 01/02/2008
Status: Active Grant

First Claim

Patent Images

1. A method for calculating a similarity score of a query set comprising a query set of tokens and a first database set comprising a first database set of tokens, wherein the first database set is one of a plurality of database sets in a data collection set, comprising the steps of:

for each specific token in the query set, determining the number of database sets that contain the specific token;

for each specific token in the query set, calculating an idf weight, based at least in part on the number of database sets that contain the specific token and on the total number of database sets in the data collection set;

calculating a normalized length of the first database set;

calculating a normalized length of the query set; and

,calculating a similarity score based at least in part on the normalized length of the first database set, the normalized length of the query set, and the idf weight of each of the tokens in the query set.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The similarity between a query set comprising query set tokens and a database set comprising database set tokens is determined by a similarity score. The database sets belong to a data collection set, which contains all database sets from which information may be retrieved. If the similarity score is greater than or equal to a user-defined threshold, the database set has information relevant to the query set. The similarity score is calculated with an inverse document frequency method (IDF) similarity measure independent of term frequency. The document frequency is based at least in part on the number of database sets in the data collection set and the number of database sets which contain at least one query set token. The length of the query set and the length of the database set are normalized.

Citations

23 Claims

1. A method for calculating a similarity score of a query set comprising a query set of tokens and a first database set comprising a first database set of tokens, wherein the first database set is one of a plurality of database sets in a data collection set, comprising the steps of:
- for each specific token in the query set, determining the number of database sets that contain the specific token;
  
  for each specific token in the query set, calculating an idf weight, based at least in part on the number of database sets that contain the specific token and on the total number of database sets in the data collection set;
  
  calculating a normalized length of the first database set;
  
  calculating a normalized length of the query set; and
  
  ,calculating a similarity score based at least in part on the normalized length of the first database set, the normalized length of the query set, and the idf weight of each of the tokens in the query set.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1 further comprising the step of performing an improved no random access method.
  - 3. The method of claim 1 further comprising the step of performing a shortest-first method.
  - 4. The method of claim 1 further comprising the step of performing a hybrid method.
  - 5. The method of claim 1 further comprising the step of determining that the first database set contains information relevant to the query set by performing the steps of:
    - defining a threshold value;
      
      comparing the similarity score to the threshold value; and
      
      ,determining that the first database set contains relevant information if the similarity score is greater than or equal to the threshold value.
  - 6. The method of claim 1 further comprising the step of calculating the idf weight of a specific token according to the formula:
    - idf(qⁱ)=log₂(1+N/N(qⁱ))wherein;
      
      qⁱrepresents the specific token;
      
      idf(qⁱ) represents the idf weight of the specific token;
      
      N represents the total number of database sets in the data collection set; and
      
      ,N(qⁱ) represents the number of database sets that contain the token qⁱ.
  - 7. The method of claim 6 further comprising the step of calculating the normalized length of the first database set according to the formula:
  - 8. The method of claim 7 further comprising the step of calculating the normalized length of the query set according to the formula:
  - 9. The method of claim 8 further comprising the step of calculating the similarity score of s and q according to the formula:

10. An apparatus for calculating a similarity score of a query set comprising a query set of tokens and a first database set comprising a first database set of tokens, wherein the first database set is one of a plurality of database sets in a data collection set, comprising:
- means for determining for each specific token in the query set the number of database sets in the data collection set that contain the specific token;
  
  means for calculating an idf weight for each specific token in the query set, based at least in part on the number of database sets that contain the specific token and on the total number of database sets in the data collection set;
  
  means for calculating a normalized length of the first database set;
  
  means for calculating a normalized length of the query set; and
  
  ,means for calculating a similarity score based at least in part on the normalized length of the first database set, the normalized length of the query set, and the idf weight of each of the tokens in the query set.
- View Dependent Claims (11, 12, 13, 14)
- - 11. The apparatus of claim 10 further comprising means for processing an improved no random access process.
  - 12. The apparatus of claim 10 further comprising means for processing a shortest-first process.
  - 13. The apparatus of claim 10 further comprising means for processing a hybrid process.
  - 14. The apparatus of claim 10 further comprising means for determining that the first database set contains information relevant to the query set, comprising:
    - means for defining a threshold value;
      
      means for comparing the similarity score to the threshold value; and
      
      ,means for determining that the first database contains relevant information if the similarity score is greater than or equal to the threshold value.

15. A computer readable medium storing computer program instructions for calculating a similarity score of a query set comprising a query set of tokens and a first database set comprising a first database set of tokens, wherein the first database set is one of a plurality of database sets in a data collection set, said computer program instructions defining the steps of:
- for each specific token in the query set, determining the number of database sets in the data collection set that contain the specific token;
  
  for each specific token in the query set, calculating an idf weight based at least in part on the number of database sets that contain the specific token and on the total number of database sets in the data collection set;
  
  calculating a normalized length of the first database set;
  
  calculating a normalized length of the query set; and
  
  ,calculating a similarity score based at least in part on the normalized length of the first database set, the normalized length of the query set, and the idf weight of each of the tokens in the query set.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23)
- - 16. The computer readable medium of claim 15 wherein said computer program instructions further comprise computer program instructions defining the step of:
    - performing an improved no random access process.
  - 17. The computer readable medium of claim 15 wherein said computer program instructions further comprise computer program instructions defining the step of:
    - performing a shortest-first process.
  - 18. The computer readable medium of claim 15 wherein said computer program instructions further comprise computer program instructions defining the step of:
    - performing a hybrid process.
  - 19. The computer readable medium of claim 15 wherein said computer program instructions further comprise computer program instructions defining the step of calculating the idf weight of a specific token according to the formula:
    - idf(qⁱ)=log₂(1+N/N(qⁱ))wherein;
      
      qⁱrepresents the specific token;
      
      idf(qⁱ) represents the idf weight of the specific token;
      
      N represents the total number of database sets in the data collection set; and
      
      ,N(qⁱ) represents the number of database sets that contain the token qⁱ.
  - 20. The computer readable medium of claim 19 wherein said computer program instructions further comprise computer program instructions defining the step of calculating the normalized length of the database set according to the formula:
  - 21. The computer readable medium of claim 20 wherein said computer program instructions further comprise computer program instructions defining the step of calculating the normalized length of the query set according to the formula:
  - 22. The computer readable medium of claim 21 wherein said computer program instructions further comprise computer program instructions defining the step of calculating the similarity score according to the formula:
  - 23. The computer readable medium of claim 15, wherein said computer program instructions further comprise computer program instructions defining the step of determining that the first database set contains information relevant to the query set by performing the steps of:
    - defining a threshold value;
      
      comparing the similarity score to the threshold value; and
      
      ,determining that the first database set contains relevant information if the similarity score is greater than or equal to the threshold value.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
AT&T Labs Incorporated (AT&T, Inc.)
Original Assignee
AT&T Labs Incorporated (AT&T, Inc.)
Inventors
Srivastava, Divesh, Koudas, Nick, Hadjieleftheriou, Marios, Chandel, Amit

Granted Patent

US 7,921,100 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/2453 Query optimisation

Set Similarity selection queries at interactive speeds

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Set Similarity selection queries at interactive speeds

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links