Non-literal textual search using fuzzy finite-state linear non-deterministic automata
First Claim
1. A computer-implemented method for selectively retrieving information, including a plurality of stored target strings contained in a document set stored on a data storage medium and accessible by a computer processor, the method comprising the steps of:
- transmitting a search expression to the processor;
constructing a linear finite-state non-deterministic automation corresponding to the search expression wherein the automation permits transitions only from a state to itself and from the state to a next state and wherein a linear finite-state non-deterministic automation is constructed for any transmitted search expression;
applying the plurality of target strings to the automation and generating thereby a dissimilarity metric for each target string; and
producing a list of matching target strings based upon a true dissimilarity metric of each target string.
1 Assignment
0 Petitions
Accused Products
Abstract
Method and system for selectively retrieving information contained in a stored document set using a metric-based or "fuzzy" finite-state non-deterministic automation. The system receives a generalized regular search expression from a user. The system then performs prematching during which it estimates a dissimilarity metric for each target string in the stored document set with respect to the search expression. The strings are then sorted by dissimilarity metric, with the best matches, i.e., the strings having the lowest dissimilarity metrics, first. The search expression is broken down into one or more segments. A linear fizzy finite-state non-deterministic automation is constructed (501) by matching each segment of the search expression with a corresponding set of states and transitions. The automation is initialized and then processes target strings read (502) from the sorted list, thereby generating a dissimilarity value for each target string. A dissimilarity value for each string is determined based upon penalties associated with one-to-one fuzzy character substitution, exchanged adjacent characters, one-to-many, many-to-one, and many-to-many character substitution, and other differences between the search expression and a target string read from storage.
207 Citations
32 Claims
-
1. A computer-implemented method for selectively retrieving information, including a plurality of stored target strings contained in a document set stored on a data storage medium and accessible by a computer processor, the method comprising the steps of:
-
transmitting a search expression to the processor; constructing a linear finite-state non-deterministic automation corresponding to the search expression wherein the automation permits transitions only from a state to itself and from the state to a next state and wherein a linear finite-state non-deterministic automation is constructed for any transmitted search expression; applying the plurality of target strings to the automation and generating thereby a dissimilarity metric for each target string; and producing a list of matching target strings based upon a true dissimilarity metric of each target string. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A computer system for selectively retrieving information, including a plurality of stored target strings contained in a document set stored on a data storage medium and accessible by a computer processor, the system comprising:
-
a data input device providing a user-defined search expression to the processor; a finite-state deterministic automation adapted for receiving the search expression and generating a linear finite-state non-deterministic automation therefrom adapted to accept as input each distinct stored target string and to produce in response a dissimilarity metric associated with each distinct stored target string; and an output device producing a subset of the distinct stored text strings based upon the dissimilarity metrics. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28, 29)
-
-
30. A computer-program method for selectively retrieving information, including a plurality of stored target strings contained in a document set stored on a data storage medium and accessible by a computer processor, the method comprising the steps of:
-
transmitting a search expression to the processor; sorting the plurality of stored target strings by an estimated dissimilarity metric; constructing a linear finite-state non-deterministic automation corresponding to the search expression; applying the sorted target strings to the automation and generating thereby a true dissimilarity metric for each target string, wherein the true dissimilarity metric reflects predetermined differences between each target string and the search expression including; one-to-one fuzzy character substitutions; exchanged adjacent characters; and one-to-many, many-to-one, and many-to-many character substitutions; and storing target strings having the N-lowest true dissimilarity metrics in an N-item data structure. - View Dependent Claims (31, 32)
-
Specification