System and method for performing similarity searching using pointer optimization
First Claim
1. A computer implemented method for optimizing similarity searching while detecting and scoring similarities between documents and a search criteria comprising:
- a. using a set of hierarchical documents having root, interior and leaf nodes, wherein the leaf nodes contain data items;
b. using assigned unique identifiers assigned to each unique data item contained within each leaf node in the set of documents, wherein each associated unique identifier, unique data item and leaf node are entered into a data structure called a data band, the unique identifiers being unique within a selected context in the set of hierarchical documents;
c. for each data item in each leaf node, computing a data item numerical score that represents a quantitative measure of the similarity between the data item in the leaf node and the search criteria using a scoring method; and
d. using parent score computing and weighting algorithms, computing interior and root node scores from the leaf node data item similarity scores;
e. whereby the search criteria is created by a user for specifying a scoring method, parent score algorithm, weighting algorithms, and indices linking each root, interior and leaf node connection.
5 Assignments
0 Petitions
Accused Products
Abstract
The present invention is a computer-implemented method for optimizing the detecting and scoring of similarities between documents in a source database and search criteria. It uses a set of hierarchical documents having root, interior and leaf nodes where the leaf nodes contain data items. Unique identifiers are assigned to each unique data item contained within each leaf node within the set of documents. Unique identifiers may be assigned to the data items depending upon a selected context in the hierarchy of nodes within the documents. For each data item in each leaf node, a data item score is computed that represents a similarity between the data item in the leaf node and the search criteria. A parent nodes score is then computed by combining the data item scores for all its child nodes. A data and indexing structures provides for efficient similarity searching and the quick reporting of results because the data is organized by the categories a user wants to search. Assigning a pointer or unique identifier only to unique data items causes the memory requirements to be reduced and optimizes the search time. Depending upon the particular set of documents being searched and the search criteria, significant improvements in speed and memory requirements may be achieved.
-
Citations
57 Claims
-
1. A computer implemented method for optimizing similarity searching while detecting and scoring similarities between documents and a search criteria comprising:
-
a. using a set of hierarchical documents having root, interior and leaf nodes, wherein the leaf nodes contain data items;
b. using assigned unique identifiers assigned to each unique data item contained within each leaf node in the set of documents, wherein each associated unique identifier, unique data item and leaf node are entered into a data structure called a data band, the unique identifiers being unique within a selected context in the set of hierarchical documents;
c. for each data item in each leaf node, computing a data item numerical score that represents a quantitative measure of the similarity between the data item in the leaf node and the search criteria using a scoring method; and
d. using parent score computing and weighting algorithms, computing interior and root node scores from the leaf node data item similarity scores;
e. whereby the search criteria is created by a user for specifying a scoring method, parent score algorithm, weighting algorithms, and indices linking each root, interior and leaf node connection. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55)
the root node has no parent node and the root node is a parent of at least one child node selected from the group consisting of interior nodes and leaf nodes;
the interior node has a parent node and the interior node is itself a parent node having at least one child node selected from the group consisting of interior nodes and leaf nodes; and
the leaf node is a child node that has no children and the leaf node has a parent node selected from the group consisting of root nodes and interior nodes.
-
-
3. The method of claim 2 further comprising computing a parent node score by combining the data item scores for all its child nodes.
-
4. The method of claim 3 further comprising assigning an identifier to each parent node.
-
5. The method of claim 2 further comprising a schema having a hierarchy, wherein the schema defines a hierarchy of parent and child nodes within a set of hierarchical documents and an organization of the set of hierarchical documents.
-
6. The method of claim 5 further comprising a node label assigned to each node in the schema.
-
7. The method of claim 5 further comprising, converting at least one document into at least one hierarchical document having root, interior and leaf nodes, wherein said root, interior and leaf nodes correspond to the schema.
-
8. The method of claim 7 wherein the converting comprises allowing a user to map between the schema and documents in a preexisting database to form the set of hierarchical documents.
-
9. The method of claim 8 wherein the preexisting database is a relational database.
-
10. The method of claim 5 further comprising, converting at least one document into at least one hierarchical document having at least one root node and at least one leaf node, wherein said root and leaf nodes correspond to the schema.
-
11. The method of claim 5 wherein the schema is defined by a user.
-
12. The method of claim 5 wherein the schema is retrieved from a database containing stored schemas.
-
13. The method of claim 5 wherein the schema further comprises:
-
a. a scoring method for calculating a leaf node score for each leaf node;
b. a weighting algorithm for calculating a parent node score for each leaf node when the parent node contains more than one leaf node;
c. a parent score computing algorithm for computing the similarity score of the parent node using the leaf node scores and the weighting algorithm.
-
-
14. The method of claim 2 further comprising computing an interior node score for all interior nodes by combining the scores for all the child nodes of the interior nodes.
-
15. The method of claim 14 further comprising computing a root node score by combining the interior node scores for the children of the root node.
-
16. The method of claim 1 wherein the context for a node is its position in the schema.
-
17. The method of claim 1 wherein the context for a node is the set of node labels that comprise its position in the schema.
-
18. The method of claim 1 further comprising:
-
reserving space in a score buffer for each assigned unique identifier; and
associating the score for the data item for each assigned unique identifier with its reserved space in the score buffer.
-
-
19. The method of claim 18 wherein the score buffer is indexed by the data item'"'"'s assigned unique identifier.
-
20. The method of claim 18 wherein the assigned unique identifier serves as an index to the data item.
-
21. The method of claim 20 further comprising identifying the child nodes belonging to each parent node.
-
22. The method of claim 21 wherein the identifying comprises associating the data item'"'"'s assigned unique identifier for each leaf node with its parent'"'"'s assigned identifier and saving a resulting association.
-
23. The method of claim 22 wherein the resulting association is stored in a relation band.
-
24. The method of claim 21 wherein the identifying comprises associating the child node with its parent node.
-
25. A computer-readable data transmission medium containing an association between parent and child nodes as recited in claim 24.
-
26. The method of claim 18 wherein the assigned unique identifier is the same for all identical data items for a selected context within the hierarchical database.
-
27. The method of claim 26 wherein the context is selected from the group consisting of its position in the schema and the set of node labels.
-
28. The method of claim 18 wherein the data item score is a number that represents how similar and dissimilar the data item is to the search criteria.
-
29. The method of claim 28 wherein the number representing the data item score is assigned based on a method selected from the group consisting of an algorithmic scoring method and a non-algorithmic scoring method.
-
30. The method of claim 29 wherein if the scoring method is a non-algorithmic scoring method and if the data item does not match the search criteria, assigning as the data item score a value that represents a neutral score.
-
31. The method of claim 29 wherein if a non-algorithmic scoring method is chosen generates a set of data values along with data item scores.
-
32. The method of claim 31 wherein if a data item occurs within this set, the data item'"'"'s unique identifier is associated with its corresponding score.
-
33. The method of claim 32 wherein the non-algorithmic scoring method uses a user-defined table of data items, their corresponding synonyms and their scores.
-
34. The method of claim 31 wherein if the data items are not in this set, the data items are assigned a neutral score.
-
35. The method of claim 1 further comprising:
-
a. for all the data items in the set of hierarchical documents, organizing each data item in a data band according to its position in the schema and associating each data item'"'"'s assigned unique identifier with the data item and storing the association in the data band; and
b. for each child node in the set of hierarchical documents, linking each node with its parent node using a relation band according to its position in the schema, where the parent node is selected from the group consisting of interior nodes and root nodes.
-
-
36. The method of claim 35 wherein computing a data item score comprises calculating a leaf node score for each data item within each leaf node, combining all the data item scores within the leaf node into an overall leaf node score and saving the overall node score as the leaf node score.
-
37. The method of claim 36 wherein the leaf node score is saved in a leaf score buffer.
-
38. The method of claim 37 further comprising indexing the leaf score buffer by the data item'"'"'s assigned unique identifier.
-
39. The method of claim 36 further comprising using the assigned unique identifier as an index to the data item.
-
40. The method of claim 36 further comprising:
-
a. using the saved leaf node scores, selecting a parent node as the current parent node and calculating a current parent node score for all leaf nodes that have the same parent using a parent score computing algorithm and saving the current parent node score;
b. if the current parent node is a root node, saving the parent node score as a final similarity search score and processing ends; and
c. if the current parent node is an interior node;
i. saving the current parent node score as an interior node score;
ii. setting the current parent node to the parent of the interior node;
iii. using the saved interior node scores, calculating the parent node score for all interior nodes that have the same parent using a parent score computing algorithm; and
iv. repeating steps b and c until the current parent node is a root node.
-
-
41. The method of claim 40 further comprising making the final similarity score available for display.
-
42. The method of claim 40 further comprising associating the leaf and interior nodes with their parent nodes using the relation band.
-
43. The method of claim 40 wherein the parent score computing algorithm comprises determining the weight to be given to each leaf node score in calculating the current parent node score.
-
44. The method of claim 40 wherein the parent score computing algorithm is selected from the group consisting of single best, greedy sum, overall sum, greedy minimum, overall minimum and overall maximum.
-
45. Computer-readable media having computer-executable instructions for performing the method as recited in claim 40.
-
46. The method of claim 35 further comprising calculating a root node score for each root node within the set of hierarchical documents comprising:
-
a. using the relation bands, for 1 to N parent nodes, identifying the data item scores for their child nodes of the 1 to N parent nodes;
b. selecting a current parent node from the 1 to N parent nodes;
c. computing a parent score for the current parent node using the data item scores of its children and a parent score computing algorithm and saving the parent node score;
d. if the current parent node is a root node, saving the parent node score as the similarity search score and processing ends;
e. if the current parent node is not a root node, selecting another current parent node from the 1 to N parent nodes that has not had its score calculated and repeating steps c and d.
-
-
47. The method of claim 35 wherein computing a data item score comprises:
-
a. using a search criteria; and
b. comparing each data item to the search criteria and assigning a data item score that represents a degree of similarity between the search criteria and the data item.
-
-
48. The method of claim 1 further comprising for all data items in the set of hierarchical documents, organizing each data item within each leaf node in a data band according to its position in the schema and associating each data item'"'"'s assigned unique identifier with the data item and storing the association in the data band.
-
49. The method of claim 48 wherein computing a data item score comprises calculating a leaf node score for each data item, combining all the data item scores within the leaf node into an overall leaf node score and saving the overall node score as the leaf node score.
-
50. The method of claim 1 wherein the search criteria is dynamically defined by a user.
-
51. The method of claim 1 wherein the search criteria is retrieved from a database of stored queries.
-
52. The method of claim 1 further comprising using the same search criteria and repeating steps a and b for each of N number of sets of hierarchical documents.
-
53. The method of claim 52 further comprising displaying the results for the N number of sets of hierarchical documents to a user.
-
54. The method of claim 1 wherein the hierarchical documents are stored in Extensible Markup Language (XML).
-
55. Computer-readable media having computer-executable instructions for performing the method as recited in claim 1.
-
56. A computer data signal embodied in a transmission medium for optimizing similarity searching while detecting and scoring similarities between documents and a search criteria comprising:
-
a. a first portion identifying a set of hierarchical documents having root, interior and leaf nodes, wherein the leaf nodes contain data items that a client computer is requesting from a server;
b. a second portion identifying assigned unique identifiers assigned to each unique data item contained within each leaf node in the set of documents that a client computer is requesting from a server, wherein each associated unique identifier, unique data item and leaf node are entered into a data structure called a data band, the unique identifiers being unique within a selected context in the set of hierarchical documents;
c. a third portion identifying a session for computing a data item numerical score that represents a quantitative measure of the similarity between each data item in each leaf node and the search criteria using a scoring method requested by the client computer during the identified session; and
d. a fourth portion identifying parent score computing and weighting algorithms for computing interior and root node scores from the leaf node data item similarity scores;
e. whereby the search criteria is created by a user for specifying a scoring method, parent score algorithm, weighting algorithms, and indices linking each root, interior and leaf node connection.
-
-
57. A computer system for optimizing similarity searching while detecting and scoring similarities between documents and a search criteria comprising:
-
a. a set of hierarchical documents that are to be similarity searched, the documents having root, interior and leaf nodes, wherein the leaf nodes contain data items, b. a unique identifier component having unique identifiers that are assigned to each unique data item contained within each leaf node in the set of documents, wherein each associated unique identifier, unique data item and leaf node are entered into a data structure called a data band, the unique identifiers being unique within a selected context in the set of hierarchical documents;
c. a data item score component that computes a data item numerical score for each data item in each leaf node, the score representing a quantitative measure of the similarity between the data item in the leaf node and the search criteria using a scoring method; and
d. parent score computing and weighting algorithms for computing interior and root node scores from the leaf node data item similarity score;
e. whereby the search criteria is created by a user for specifying a scoring method, parent score algorithm, weighting algorithms, and indices linking each root, interior and leaf node connection.
-
Specification