Method for efficiently supporting interactive, fuzzy search on structured data
First Claim
Patent Images
1. A method for searching a structured data table T with m attributes and n records, where A={a1;
- a2, ;
;
;
;
am} denotes an attribute set, R={r1;
r2, ;
;
;
, rn} denotes the record set, and W={w1;
w2, ;
;
;
;
wp} denotes a distinct word set in T, where given two words, wi and wi, “
wi≦
wj”
denotes that wi is a prefix string of wj, where a query consists of a set of prefixes Q={p1, p2, . . . , pl}, where a predicted-word set is Wkl={w|w is a member of W and kl≦
w}, the method comprising for each prefix pi finding the set of prefixes from the data set that are similar to pi, by;
determining the predicted-record set RQ={r|r is a member of R, for every i;
1≦
i≦
·
l−
1, pi appears in r, and there exists a w included in Wkl, w appears in r}; and
for a keystroke that invokes query Q, returning the top-t records in RQ for a given value t, ranked by their relevancy to the query, treating every keyword as a partial keyword, namely given an input Q={k1;
k2;
;
;
;
;
kl for each predicted record r, for each 1≦
i≦
·
l, there exists at least one predicted word wi for ki in r, since ki must be a prefix of wi,quantifying their similarity as;
sim=(ki;
wi)=|ki|/|wi|if there are multiple predicted words in r for a partial keyword kj, selecting the predicted word wi with the maximal similarity to ki and quantifying a weight of a predicted word to capture the importance of a predicted word, and taking into account the number of attributes that the l predicted words appear in, denoted as na, to combine similarity, weight and number of attributes to generate a ranking function to score r for the query Q as follows;
0 Assignments
0 Petitions
Accused Products
Abstract
A method to support efficient, interactive, and fuzzy search on text data includes an interactive, fuzzy search on structured data used in applications such as query relaxation, autocomplete, and spell checking, where inconsistencies and errors exist in user queries as well as data. It utilizes techniques to efficiently and interactively answer fuzzy queries on structured data to allow users to efficiently search for information interactively, and they can find records and documents even if these records and documents are slightly different from the user keywords.
30 Citations
39 Claims
-
1. A method for searching a structured data table T with m attributes and n records, where A={a1;
- a2, ;
;
;
;
am} denotes an attribute set, R={r1;
r2, ;
;
;
, rn} denotes the record set, and W={w1;
w2, ;
;
;
;
wp} denotes a distinct word set in T, where given two words, wi and wi, “
wi≦
wj”
denotes that wi is a prefix string of wj, where a query consists of a set of prefixes Q={p1, p2, . . . , pl}, where a predicted-word set is Wkl ={w|w is a member of W and kl≦
w}, the method comprising for each prefix pi finding the set of prefixes from the data set that are similar to pi, by;determining the predicted-record set RQ={r|r is a member of R, for every i;
1≦
i≦
·
l−
1, pi appears in r, and there exists a w included in Wkl , w appears in r}; andfor a keystroke that invokes query Q, returning the top-t records in RQ for a given value t, ranked by their relevancy to the query, treating every keyword as a partial keyword, namely given an input Q={k1;
k2;
;
;
;
;
kl for each predicted record r, for each 1≦
i≦
·
l, there exists at least one predicted word wi for ki in r, since ki must be a prefix of wi,quantifying their similarity as;
sim=(ki;
wi)=|ki|/|wi|if there are multiple predicted words in r for a partial keyword kj, selecting the predicted word wi with the maximal similarity to ki and quantifying a weight of a predicted word to capture the importance of a predicted word, and taking into account the number of attributes that the l predicted words appear in, denoted as na, to combine similarity, weight and number of attributes to generate a ranking function to score r for the query Q as follows; - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
- a2, ;
-
22. A method for fuzzy type ahead search where R is a collection of records such as the tuples in a relational table, where D is a data set of words in R, where a user inputs a keyword query letter by letter, comprising:
-
finding on-the-fly records with keywords similar to the query keywords by using edit distance to measure the similarity between strings, where the edit distance between two strings s1 and s2, denoted by ed(s1, s2), is the minimum number of single-character edit operations, where Q is the keyword query the user has input which is a sequence of keywords [w1, w2, ;
;
;
, wm];treating the last keyword wm as a partial keyword finding the keywords in the data set that are similar to query keywords, where π
is a function that quantifies the similarity between a string s and a query keyword w in D, including, but not limited to;
π
(s,w)=1−
ed(s,w)/|w|,where |w| is the length of the keyword w; and normalizing the edit distance based on the query-keyword length in order to allow more errors for longer query keywords, where d be a keyword in D, for each complete keyword wi (i=1, ;
;
;
, m−
1), defining the similarity of d to wi as;
Sim(d,wi)=π
(d,wi),since the last keyword wm is treated as a prefix condition, defining the similarity of d to wm as the maximal similarity of d'"'"'s prefixes using function π
, namely Sim(d, wm )=max prefix p of d π
(p, wm ), where τ
is a similarity threshold, where a keyword in D is similar to a query keyword w if Sim(d, w)≧
τ
, where a prefix p of a keyword in D is similar to the query keyword wm if π
(p, wm)≧
τ
, where φ
(wi) (i=1, ;
;
;
, m) denotes the set of keywords in D similar to wi, and where P(wm) denotes the set of prefixes of keywords in D similar to wm.- View Dependent Claims (23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38)
-
28. The method of claim 26 comprising partitioning the inverted lists into several groups based on their corresponding query keywords, where each query keyword w has a group of inverted lists, producing a list of record IDs sorted on their scores with respect to this keyword, and using a top-k algorithm to find the k best records for the query.
-
29. The method of claim 28 where using a top-k algorithm to find the k best records for the query comprises for each group of inverted lists for the query keyword w, retrieving the next most relevant record ID for w by building a max heap on the inverted lists comprising maintaining a cursor on each inverted list in the group, where the heap is comprised of the record IDs pointed by the cursors so far, sorted on the scores of the similar keywords in these records since each inverted list is already sorted based on the weights of its keyword in the records and all the records on this list share the same similarity between this keyword and the query keyword w the list is also sorted based on the scores of this keyword in these records, retrieving the next best record from this group by popping the top element from the heap, incrementing the cursor of the list of the popped element by one, and pushing the new element of this list to the heap, ignoring other lists that may produce this popped record, since their corresponding scores will no longer affect the score of this record with respect to the query keyword w.
-
30. The method of claim 29 where L1, :
- ;
;
, Lt are inverted lists with the similar keywords d1, ;
;
;
, dt, respectively, and further comprising sorting these inverted lists based on the similarities of their keywords to w, Sim(d1, w), ;
;
;
, sim(dt, w), constructing the max heap using the lists with the highest similarity values.
- ;
-
31. The method of claim 22 where the dataset D comprises a trie for the data keywords in D, where each trie node has a character label, where each keyword in D corresponds to a unique path from the root to a leaf node on the trie, where a leaf node has an inverted list of pairs [rid, weight]i, where rid is the ID of a record containing the leaf-node string, and weight is the weight of the keyword in the record and further comprising determining the top-k answers to the query Q in two steps comprising:
-
for each keyword wi in the query, determining the similar keywords φ
(wi) and similar prefixes P(wm) on the trie; andaccessing the inverted lists of these similar data keywords to determine the k best answers to the query.
-
-
32. The method of claim 31 where accessing the inverted lists of these similar data keywords to determine the k best answers to the query comprises randomly accessing the inverted list, in each random access, given an ID of a record r, retrieving information related to the keywords in the query Q, to determine the score F(r, Q) using a forward index in which each record has a forward list of the IDs of its keywords and their corresponding weights, where each keyword has a unique ID corresponding its leaf node on the trie, and the IDs of the keywords follow their alphabetical order.
-
33. The method of claim 32 where randomly accessing the inverted list comprises:
-
maintaining for each trie node n, a keyword range [ln, un], where In and un are the minimal and maximal keyword IDs of its leaf nodes, respectively; verifying whether record r contains a keyword with a prefix similar to wm, where for a prefix p on the trie similar to wm checking if there is a keyword ID on the forward list of r in the keyword range [lp, up] of the trie node of p, since the forward list of r sorted, this checking is performed a binary search using the lower bound lp on the forward list of r to get the smallest ID°
no less than lp, the record having a keyword similar to wm if γ
exists and is no greater than the upper bound up, i.e., γ
≦
up.
-
-
34. The method of claim 32 where randomly accessing the inverted list comprises:
-
for each prefix p similar to wm, traversing the subtrie of p and identifying its leaf nodes; for each leaf node d, for the query Q, this keyword d has a prefix similar to wm in the query, storing [Query ID, partial keyword wm, sim(p, wm)]. in order to differentiate the query from other queries in case multiple queries are answered concurrently; storing the similarity between wm and p; determining the score of this keyword in a candidate record, where in the case of the leaf node having several prefixes similar to wm, storing their maximal similarity to wm; for each keyword wi in the query, storing the same information for those trie nodes similar to wi, defining stored entries for the leaf node as its collection of relevant query keywords; using collection of relevant query keywords to efficiently check if a record r contains a complete word with a prefix similar to the partial keyword wm by scanning the forward list of r, for each of its keyword IDs, locating the corresponding leaf node on the trie, and testing whether its collection of relevant query keywords includes this query and the keyword wm, and if so, using the stored string similarity to determine the score of this keyword in the query.
-
-
35. The method of claim 31 further comprising improving sorted access by precomputing and storing the unions of some of the inverted lists on the trie, where v is a trie node, and ∪
- (v) is the union of the inverted lists of v'"'"'s leaf nodes, sorted by their record weights, and if a record appears more than once on these lists, selecting its maximal weight as its weight on list ∪
(v), where ∪
(v) is defined as the union list of node v.
- (v) is the union of the inverted lists of v'"'"'s leaf nodes, sorted by their record weights, and if a record appears more than once on these lists, selecting its maximal weight as its weight on list ∪
-
36. The method of claim 35 where v is a trie node comprising materializing union list ∪
- (v), and using ∪
(v) to speed up sorted access for the prefix keyword wm is that ∪
(v) is sorted based on its record weights, where the value Score(r, wm, di) of a record r on the list of a keyword di with respect to wm is based on both Weight(di, r) and Sim(di, wm ), where all the leaf nodes of v have the same similarity to wm, where all the leaf nodes of v are similar to wm, namely their similarity to wm is no less than the threshold τ
so that the sorting order of the union list ∪
(v) is also the order of the scores of the records on the leaf-node lists with respect to wm.
- (v), and using ∪
-
37. The method of claim 36 where B is a budget of storage space available to materialize union lists comprising selecting trie nodes to materialize their union lists to maximize the performance of queries, where a node is defined as “
- materialized”
if its union list has been materialized, where for a query Q with a prefix keyword wm, some of the trie nodes have their union lists materialized, where v is the highest trie node that is usable for the max heap of wm, and for which ∪
(v) has not been materialized, where for each nonleaf trie descendant c of v, such that no node on the path from v to c (including c) has been materialized comprising;performing a cost-based analysis to quantify the benefit of materializing ∪
(c) on the performance of operations on the max heap of wm based on reduction of traversal time, reduction of heap-construction time and reduction of sorted-access time, the overall benefit Bv of materializing v for the query keyword query wm being;
Bv=Breduction of traversal time+Breduction of heap-construction time+Av*Breduction of sorted-access time,where Av is the number of sorted accesses on ∪
(v) for each query andthen summing the benefits of materializing its union list to all the queries in the query workload or trie according to probability of occurrence of the query, and recomputing Bv the benefit Bv of materializing other affected nodes after the benefit of each node is computed until the given budget B of storage space is realized.
- materialized”
-
38. The method of claim 37 where selecting trie nodes to materialize their union lists to maximize the performance of queries comprises randomly select trie nodes, selecting nodes top down from the trie root, or selecting nodes bottom up from the leaf nodes.
-
-
39. A software product including instructions stored on a non-transitory tangible medium for controlling a computer for searching a structured data table T with m attributes and n records, where A={a1;
- a2;
;
;
;
;
am} denotes an attribute set, R={r1;
;
;
;
;
rn} denotes the record set and W={w1;
w2;
;
;
;
;
wp} denotes a distinct word set in T where given two words, wi and wj, “
wi≦
wj”
denotes that wi is a prefix string of wj, where a query consists of a set of prefixes Q={p1, p2, . . . , pl}, where a predicted-word set is Wki ={w|w is a member of W and kl≦
w}, the method comprising for each prefix pi finding the set of prefixes from the data set that are similar to pi, by;determining the predicted-record set RQ={r|r is a member of R for ever i;
1≦
i≦
·
l−
1, pi appears in r, and there exists a w included in Wki , w appears in r}; andfor a keystroke that invokes query Q, returning the top-t records in RQ for a given value t, ranked by their relevancy to the query, treating every keyword as a partial keyword, namely given an input query Q={k1;
k2;
;
;
;
;
kl}, for each predicted record r, for each 1≦
i≦
·
l, there exists at least one predicted word wi for ki in r, since ki must be a prefix of wi, quantifying similarity as;
sim(ki;
wi)=|ki|/|wi|if there are multiple predicted words in r for a partial keyword ki, selecting the predicted word wi with the maximal similarity to ki and quantifying a weight of a predicted word to capture the importance of a predicted word, and taking into account the number of attributes that the l predicted words a ear in denoted as na, to combine similarity, weight and number of attributes to generate a ranking function to score r for the query Q as follows;
- a2;
Specification