Similarity search engine for use with relational databases

US 20030182282A1
Filed: 02/13/2003
Published: 09/25/2003
Est. Priority Date: 02/14/2002
Status: Active Grant

First Claim

Patent Images

1. A method for performing similarity searching, comprising the steps of:

receiving a request instruction from a client for initiating a similarity search;

generating one or more query commands from the request instruction, each query command designating an anchor document and at least one search document;

executing each query command, including;

computing a normalized document similarity score having a value of between 0.00 and 1.00 for each search document in each query command for indicating a degree of similarity between the anchor document and each search document;

creating a result dataset containing the computed normalized document similarity scores for each search document; and

sending a response including the result dataset to the client.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The invention provides a system and method for defining a schema and sending a query to a Similarity Search Engine to determine a quantitative assessment of the similarity of attributes between an anchor record and one or more target records. The Similarity Search Engine makes a similarity assessment in a single pass through the target records having multiple relationship characteristics. The Similarity Search Engine is a server configuration that comprises a Gateway for command and response routing, a Virtual Document Manager for document generation, a Search Manager for document scoring, and an Relational Database Management System for providing data persistence, data retrieval and access to User Defined Functions. The Similarity Search Engine uses a unique command syntax based on the Extensible Markup Language to implement functions necessary for similarity searching and scoring.

Citations

47 Claims

1. A method for performing similarity searching, comprising the steps of:
- receiving a request instruction from a client for initiating a similarity search;
  
  generating one or more query commands from the request instruction, each query command designating an anchor document and at least one search document;
  
  executing each query command, including;
  
  computing a normalized document similarity score having a value of between 0.00 and 1.00 for each search document in each query command for indicating a degree of similarity between the anchor document and each search document;
  
  creating a result dataset containing the computed normalized document similarity scores for each search document; and
  
  sending a response including the result dataset to the client.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 2. The method of claim 1, wherein the step of generating one or more query commands further comprises identifying a schema document for defining structure of search terms, mapping of datasets providing target search values to relational database locations, and designating measures, choices and weight to be used in a similarity search.
  - 3. The method of claim 1, wherein the step of computing a normalized document similarity score comprises:
    - computing attribute token similarity scores having values of between 0.00 and 1.00 for the corresponding leaf nodes of the anchor document and a search document using designated measure algorithms;
      
      multiplying each token similarity score by a designated weighting factor;
      
      aggregating the token similarity scores using designated choice algorithms for determining a document similarity score having a value of between 0.00 and 1.00 for the search document.
  - 4. The method of claim 3, wherein:
    - the step of computing attribute token similarity scores further comprises computing attribute token similarity scores in a relational database management system;
      
      the step of multiplying each token similarity score further comprises multiplying each token similarity score in a similarity search engine; and
      
      the step of aggregating the token similarity scores further comprises aggregating the token similarity scores in the similarity search engine.
  - 5. The method of claim 1, wherein the step of generating one or more query commands comprises:
    - populating an anchor document with search criteria values;
      
      identifying documents to be searched;
      
      defining semantics for overriding parameters specified in an associated schema document;
      
      defining a structure to be used by the result dataset; and
      
      imposing restrictions on the result dataset.
  - 6. The method of claim 5, wherein the step of defining semantics comprises:
    - designating overriding measures for determining attribute token similarity scores;
      
      designating overriding choice algorithms for aggregating token similarity scores into document similarity scores; and
      
      designating overriding weights to be applied to token similarity scores.
  - 7. The method of claim 5, wherein the step of imposing restrictions is selected from the group consisting of defining a range of similarity indicia scores to be selected and defining percentiles of similarity indicia scores to be selected.
  - 8. The method of claim 1, wherein the step of computing a normalized document similarity score further comprises computing a normalized document similarity score having a value of between 0.00 and 1.00, whereby a normalized similarity indicia value of 0.00 represents no similarity matching, a value of 1.00 represents exact similarity matching, and values between 0.00 and 1.00 represent degrees of similarity matching.
  - 9. The method of claim 3, wherein the step of computing attribute token similarity scores having values of between 0.00 and 1.00 further comprises computing attribute token similarity scores having values of between 0.00 and 1.00, whereby a attribute token similarity value of 0.00 represents no similarity matching, a value of 1.00 represents exact similarity matching, and values between 0.00 and 1.00 represent degrees of similarity matching.
  - 10. The method of claim 1, wherein the step of generating one or more query commands further comprises generating one or more query commands whereby each query command includes attributes of command operation, name identification, and associated schema document identification.
  - 11. The method of claim 1, further comprising:
    - receiving a schema instruction from a client;
      
      generating a schema command document comprising the steps of;
      
      defining a structure of target search terms in one or more search documents;
      
      creating a mapping of database record locations to the target search terms;
      
      listing semantic elements for defining measures, weights and choices to be used in similarity searches; and
      
      storing the schema command document into a database management system.
  - 12. The method of claim 1, further comprising the step of representing documents and commands as hierarchical XML documents.
  - 13. The method of claim 1, wherein the step of sending a response to the client further comprises sending a response including an error message and a warning message to the client.
  - 14. The method of claim 1, wherein the step of sending a response to the client further comprises sending a response to the client containing the result datasets, whereby each result dataset includes at least one normalized document similarity score, at least one search document name, a path to the search documents having a returned score, and at least one designated schema.
  - 15. The method of claim 1, further comprising:
    - receiving a statistics instruction from a client;
      
      generating a statistics command from the statistics instruction, comprising the steps of;
      
      identifying a statistics definition to be used for generating statistics;
      
      populating an anchor document with search criteria values;
      
      identifying documents to be searched;
      
      delineating semantics for overriding measures, parsers and choices defined in a semantics clause in an associated schema document;
      
      defining a structure to be used by a result dataset;
      
      imposing restrictions to be applied to the result dataset;
      
      identifying a schema to be used for the basis of generating statistics;
      
      designating a name for the target statistics table for storing results;
      
      executing the statistics command for generating a statistics schema with statistics table, mappings and measures; and
      
      storing the statistics schema in a database management system.
  - 16. The method of claim 1, further comprising the step of executing a batch command comprising executing a plurality of commands in sequence for collecting results of several related operations.
  - 17. The method of claim 3, further comprising selecting measure algorithms from the group consisting of name equivalents, foreign name equivalents, textual, sound coding, string difference, numeric, numbered difference, ranges, numeric combinations, range combinations, fuzzy, date oriented, date to range, date difference, and date combination.
  - 18. The method of claim 3, further comprising selecting choice algorithms from the group consisting of single best, greedy sum, overall sum, greedy minimum, overall minimum, and overall maximum.
  - 19. A computer-readable medium containing instructions for controlling a computer system to implement the method of claim 1.

20. A system for performing similarity searching, comprising:
- a gateway for receiving a request instruction from a client for initiating a similarity search;
  
  the gateway for generating one or more query commands from the request instruction, each query command designating an anchor document and at least one search document;
  
  a search manager for executing each query command, including;
  
  means for computing a normalized document similarity score having a value of between 0.00 and 1.00 for each search document in each query command for indicating a degree of similarity between the anchor document and each search document;
  
  means for creating a result dataset containing the computed normalized document similarity scores for each search document; and
  
  the gateway for sending a response including the result dataset to the client.
- View Dependent Claims (21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
- - 21. The system of claim 20, wherein the means for computing a normalized similarity score comprises:
    - a relational database management system for computing attribute token similarity scores having values of between 0.00 and 1.00 for the corresponding leaf nodes of the anchor document and a search document using designated measure algorithms; and
      
      the search manager for multiplying each token similarity score by a designated weighting factor and aggregating the token similarity scores using designated choice algorithms for determining a document similarity score having a value of between 0.00 and 1.00 for the search document.
  - 22. The system of claim 20, wherein:
    - each one or more query commands further comprises a measure designation; and
      
      the database management system further comprises designated measure algorithms for computing a token similarity score.
  - 23. The system of claim 20, wherein each query command comprises:
    - an anchor document populated with search criteria values;
      
      at least one search document;
      
      designated measure algorithms for determining token similarity scores;
      
      designated choice algorithms for aggregating token similarity scores into document similarity scores;
      
      designated weights for weighting token similarity scores;
      
      restrictions to be applied to a result dataset document; and
      
      a structure to be used by the result dataset.
  - 24. The system of claim 20, wherein the computed document similarity scores have a value of between 0.00 and 1.00, whereby a normalized similarity indicia value of 0.00 represents no similarity matching, a value of 1.00 represents exact similarity matching, and values between 0.00 and 1.00 represent degrees of similarity matching.
  - 25. The system of claim 21, wherein the relational database management system includes means for computing an attribute token similarity score having a value of between 0.00 and 1.00, whereby a token similarity indicia value of 0.00 represents no similarity matching, a value of 1.00 represents exact similarity matching, and values between 0.00 and 1.00 represent degrees of similarity matching.
  - 26. The system of claim 20, wherein each query command includes attributes of command operation, name identification, and associated schema document identification for providing a mapping of search documents to database management system locations.
  - 27. The system of claim 20, further comprising:
    - the gateway for receiving a schema instruction from a client;
      
      a virtual document manager for generating a schema command document;
      
      the schema command document comprising;
      
      a structure of target search terms in one or more search documents;
      
      a mapping of database record locations to the target search terms;
      
      semantic elements for defining measures, weights, and choices for use in searches; and
      
      a relational database management system for storing the schema command document.
  - 28. The system of claim 20, wherein each result dataset includes at least one normalized document similarity score, at least one search document name, a path to the search documents having a returned score and at least one designated schema.
  - 29. The system of claim 20, wherein each result dataset includes an error message and a warning message to the client.
  - 30. The system of claim 20, further comprising:
    - the gateway for receiving a statistics instruction from a client and for generating a statistics command from the statistics instruction;
      
      the search manager for identifying a statistics definition to be used for generating statistics, populating an anchor document with search criteria values, identifying documents to be searched, delineating semantics for overriding measures, weights and choices defined in a semantics clause in an associated schema document, defining a structure to be used by a result dataset, imposing restrictions to be applied to the result dataset, identifying a schema to be used for the basis of generating statistics, designating a name for the target statistics table for storing results; and
      
      a statistics processing module for executing the statistics command for generating a statistics schema with statistics table, mappings and measures, and storing the statistics schema in a database management system.
  - 31. The system of claim 20, further comprising the gateway for receiving a batch command from a client for executing a plurality of commands in sequence for collecting results of several related operations.
  - 32. The system of claim 21, wherein the measure algorithms are selected from the group consisting of name equivalents, foreign name equivalents, textual, sound coding, string difference, numeric, numbered difference, ranges, numeric combinations, range combinations, fuzzy, date oriented, date to range, date difference, and date combination.
  - 33. The system of claim 21, wherein the choice algorithms are selected from the group consisting of single best, greedy sum, overall sum, greedy minimum, overall minimum, and overall maximum.

34. A system for performing similarity searching, comprising:
- a gateway for handling all communication between a client, a virtual document manager and a search manager;
  
  the virtual document manager connected between the gateway and a relational database management system for providing document management;
  
  the search manager connected between the gateway and the relational database management system for searching and scoring documents; and
  
  the relational database management system for providing relational data management, document and measure persistence, and similarity measure execution.
- View Dependent Claims (35, 36, 37, 38)
- - 35. The system of claim 34, wherein the virtual document manager includes a relational database driver for mapping XML documents to relational database tables.
  - 36. The system of claim 34, wherein the virtual document manager includes a statistics processing module for generating statistics based on similarity search results.
  - 37. The system of claim 34, wherein the relational database management system includes means for storing and executing user defined functions.
  - 38. The system of claim 37, wherein the user defined functions include measurement algorithms for determining attribute token similarity scores.

39. A method for performing similarity searching, comprising the steps of:
- creating a search schema document by a virtual document manager;
  
  generating one or more query commands by a gateway;
  
  executing one or more query commands in a search manager and relational database management system for determining the degree of similarity between an anchor document and search documents; and
  
  assembling a result document containing document similarity scores of between 0.00 and 1.00.
- View Dependent Claims (40, 41, 42, 43, 44, 45, 46, 47)
- - 40. The method of claim 39, wherein the step of creating a schema document comprises designating a structure of search documents, datasets for mapping search document attributes to relational database locations, and semantics identifying measures for computing token attribute similarity search scores between search documents and an anchor document, weights for modulating token attribute similarity search scores, choices for aggregating token attribute similarity search scores into document similarity search scores, and paths to the search document structure attributes.
  - 41. The method of claim 39, wherein the step of generating one or more query commands comprises designating an anchor document, search or schema documents, restrictions on result sets, structure of result sets, and semantics for overriding schema document semantics including measures, weights, choices and paths.
  - 42. The method of claim 39, wherein the step of executing one or more query commands comprises:
    - computing token attribute similarity search scores having values of between 0.00 and 1.00 for each search document and an anchor document in a relational database management system using measures; and
      
      modulating the token attribute similarity search scores using weights and aggregating the token attribute similarity scores into document similarity scores having values of between 0.00 and 1.00 in the search manager using choices.
  - 43. The method of claim 39, wherein the step of assembling a result document comprises identifying associated query commands and schema documents, document structure, paths to search terms, and similarity scores by the search manager.
  - 44. The method of claim 39, wherein the search schema, the query commands, the search documents, the anchor document and the result document are represented by hierarchical XML documents.
  - 45. The method of claim 40, further comprising selecting measure algorithms from the group consisting of name equivalents, foreign name equivalents, textual, sound coding, string difference, numeric, numbered difference, ranges, numeric combinations, range combinations, fuzzy, date oriented, date to range, date difference, and date combination.
  - 46. The method of claim 40, further comprising selecting choice algorithms from the group consisting of single best, greedy sum, overall sum, greedy minimum, overall minimum, and overall maximum.
  - 47. A computer-readable medium containing instructions for controlling a computer system to implement the method of claim 39.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Fair Isaac Corporation
Original Assignee
Infoglide Software Corp. (Fair Isaac Corporation)
Inventors
Ripley, John R.

Granted Patent

US 6,829,606 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/5
CPC Class Codes

G06F 16/2455   Query execution

G06F 16/2462   Approximate or statistical ...

G06F 16/2468   Fuzzy queries

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Y10S 707/99945   Object-oriented database st...

Similarity search engine for use with relational databases

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

47 Claims

Specification

Solutions

Use Cases

Quick Links

Similarity search engine for use with relational databases

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

47 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links