Method for searching in large databases of automatically recognized text
First Claim
1. A method of searching for a query word from among a plurality of words in a hierarchical data structure having branch nodes and leaf nodes, each branch node representing a respective component of one or more of the words and each leaf node representing a respective one of the words, the method comprising the steps of:
- a) selecting a root node in the hierarchical data structure as the current node;
b) identifying all possible child nodes of the current node in the hierarchical data structure;
c) calculating, for each of the identified child nodes, a respective estimated probability value for matching each component of the query word with the component associated with a respective one of the branch nodes in a path taken in the hierarchical data structure from the root node through the current node, wherein the estimated probability is based on a simplified error model which is independent of specific letters, until a sufficient amount of training data is available and after the sufficient amount of training data is available the estimated probability is based on a more detailed error model;
d) adding the identified child nodes to a list of candidate nodes;
e) selecting, from the list of candidate nodes, one node having the respective estimated probability value which is greater than any other probability value as the current node;
f) determining if the current node is a leaf node and, if so, then determining whether to store the word representing the leaf node into a list of best matches; and
g) repeating steps (b) through (g) until all components of the query word have been processed.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for searching for a query word in a database of automatically recognized text generated, for example by an optical character recognition (OCR) system or a speech recognition (SR) system finds entries that most closely match the query word. The database is indexed into a trie data structure, which represents all possible words in the database. The trie data structure has a plurality of branch nodes, each representing a letter of at least one word, and a plurality of leaf nodes, each representing a respective word. The trie data structure is searched for each query word by selecting the first letter of the query word and also selecting a root node in the trie data structure as the current node. All possible child nodes of the current node are identified. Respective estimated probability values for matching respective letters of the query word with the letters associated with the nodes in the path taken through the trie data structure are calculated for each identified child node. The identified child nodes are then placed into a list of candidate nodes. The node, in the list of candidate nodes, having the highest probability value is selected as the current node and is then deleted from the list of candidate nodes. The process repeats with this current node until a leaf node is reached. When a leaf node is reached, a determination is made whether to store the word into a list of best matches based on the probability value of the word compared to the probability values for all the words in the list of best matches.
112 Citations
11 Claims
-
1. A method of searching for a query word from among a plurality of words in a hierarchical data structure having branch nodes and leaf nodes, each branch node representing a respective component of one or more of the words and each leaf node representing a respective one of the words, the method comprising the steps of:
-
a) selecting a root node in the hierarchical data structure as the current node;
b) identifying all possible child nodes of the current node in the hierarchical data structure;
c) calculating, for each of the identified child nodes, a respective estimated probability value for matching each component of the query word with the component associated with a respective one of the branch nodes in a path taken in the hierarchical data structure from the root node through the current node, wherein the estimated probability is based on a simplified error model which is independent of specific letters, until a sufficient amount of training data is available and after the sufficient amount of training data is available the estimated probability is based on a more detailed error model;
d) adding the identified child nodes to a list of candidate nodes;
e) selecting, from the list of candidate nodes, one node having the respective estimated probability value which is greater than any other probability value as the current node;
f) determining if the current node is a leaf node and, if so, then determining whether to store the word representing the leaf node into a list of best matches; and
g) repeating steps (b) through (g) until all components of the query word have been processed. - View Dependent Claims (2, 3, 4, 5, 6, 7)
calculating the estimated probability value associated with the selected element, according to the equation;
-
-
4. A method as defined in claim 1 further comprising the steps of:
-
establishing a data structure which includes a plurality of entries, wherein each entry consists of a plurality of elements for each of the nodes in the hierarchical data structure, each entry containing elements identifying;
(1) a respective type of error condition that possibly results in a predetermined query sequence occurring; and
(2) information identifying an estimated probability that the predetermined query sequence matches any of the plurality of words represented by the respective paths in the hierarchical data structure that pass through the child node associated with the entry, if the identified type of error condition is present, wherein, step (e) includes selecting the plurality of elements from among the elements with which the entries in the data structure are associated.
-
-
5. A method as defined in claim 4 wherein, the types of error conditions include error free matches, insertion errors, deletion errors and substitution errors.
-
6. A method as defined in claim 1 further including the steps of:
-
assigning a search time budget; and
decrementing the search time budget with each execution of steps (c) through (g), and wherein, the selecting of new components in the query word in step (g) is inhibited when the search time budget has been exhausted.
-
-
7. A method as defined in claim 1 wherein, the list of best matches has a predetermined maximum number of entries and, after exceeding the predetermined number, the word in the list of best matches having the probability value which is less than any other probability value in the list of best matches is deleted from the list of best matches.
-
8. A computer readable medium encoded with a computer program which, when executed, causes a computer to search for a query word from among a plurality of words in a hierarchical data structure having branch nodes and leaf nodes, each branch node representing a respective component of one or more of the plurality of words and each leaf node representing a respective one of the plurality of words, the computer program causing the computer to perform the steps of:
-
a) selecting a root node in the hierarchical data structure as the current node;
b) identifying all possible child nodes of the current node in the hierarchical data structure;
c) calculating, for each of the identified child nodes, a respective estimated probability value for matching each component of the query word with the component associated with a respective one of the branch nodes in a path taken In the hierarchical data structure from the root node through the current node, wherein the estimated probability is based on a simplified error model, which is independent of specific letters, until a sufficient amount of training data is available and after the sufficient amount of training data is available the estimated probability is based on a more detailed error model;
d) adding the identified child nodes to a list of candidate nodes;
e) selecting, from the list of candidate nodes, one node having the respective estimated probability value which is greater than any other probability value as the current node;
f) determining if the current node is a leaf node and, if so, then determining whether to store the word representing the leaf node into a list of best matches; and
g) repeating steps (b) through (g) until all components of the query word have been matched. - View Dependent Claims (9, 10, 11)
establishing a data structure which includes a plurality of entries, wherein each entry consists of a plurality of elements for each of the nodes in the hierarchical data structure, each entry containing elements identifying;
(1) a respective type of error condition that possibly results in a query sequence occurring; and
(2) information identifying an estimated probability that the query sequence matches any of the plurality of words represented by the respective paths In the hierarchical data structure that pass through the child node associated with the entry, if the identified type of error condition is present, wherein, step (b) includes selecting the plurality of elements from among the elements with which the entries in the data structure are associated.
-
-
10. A computer readable medium according to claim 8 wherein, the hierarchical data structure is a trie data structure representing a plurality of words wherein, the trie data structure has a plurality of branch nodes, wherein, each branch node includes at least one child node, wherein the computer program, at step (c) causes the computer to perform the step of determining a probability that a next portion of the query word matches each child node of the branch node.
-
11. A computer readable medium according to claim 10 wherein, the trie data structure has N levels, ordinally numbered 0 through N-1 and step (d) includes for each candidate node, further causing the computer to perform the step of:
calculating the estimated probability value associated with the selected element, according to the equation;
Specification