Method for efficiently supporting interactive, fuzzy search on structured data

US 8,073,869 B2
Filed: 07/02/2009
Issued: 12/06/2011
Est. Priority Date: 07/03/2008
Status: Active Grant

First Claim

Patent Images

1. A method for searching a structured data table T with m attributes and n records, where A={a₁;

a₂, ;

;

;

;

a_m} denotes an attribute set, R={r₁;

r₂, ;

;

;

, r_n} denotes the record set, and W={w₁;

w₂, ;

;

;

;

w_p} denotes a distinct word set in T, where given two words, w_iand w_i, “

w_i≦

w_j”

denotes that w_iis a prefix string of w_j, where a query consists of a set of prefixes Q={p₁, p₂, . . . , p_l}, where a predicted-word set is W_k_l={w|w is a member of W and k_l≦

w}, the method comprising for each prefix p_ifinding the set of prefixes from the data set that are similar to p_i, by;

determining the predicted-record set R_Q={r|r is a member of R, for every i;

1≦

i≦

·

l−

1, p_iappears in r, and there exists a w included in W_k_l, w appears in r}; and

for a keystroke that invokes query Q, returning the top-t records in R_Qfor a given value t, ranked by their relevancy to the query, treating every keyword as a partial keyword, namely given an input Q={k₁;

k₂;

;

;

;

;

k_lfor each predicted record r, for each 1≦

i≦

·

l, there exists at least one predicted word w_ifor k_iin r, since k_imust be a prefix of w_i,quantifying their similarity as;

sim=(k_i;

w_i)=|k_i|/|w_i|if there are multiple predicted words in r for a partial keyword k_j, selecting the predicted word w_iwith the maximal similarity to k_iand quantifying a weight of a predicted word to capture the importance of a predicted word, and taking into account the number of attributes that the l predicted words appear in, denoted as n_a, to combine similarity, weight and number of attributes to generate a ranking function to score r for the query Q as follows;

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method to support efficient, interactive, and fuzzy search on text data includes an interactive, fuzzy search on structured data used in applications such as query relaxation, autocomplete, and spell checking, where inconsistencies and errors exist in user queries as well as data. It utilizes techniques to efficiently and interactively answer fuzzy queries on structured data to allow users to efficiently search for information interactively, and they can find records and documents even if these records and documents are slightly different from the user keywords.

30 Citations

View as Search Results

39 Claims

1. A method for searching a structured data table T with m attributes and n records, where A={a₁;
- a₂, ;
  
  ;
  
  ;
  
  ;
  
  a_m} denotes an attribute set, R={r₁;
  
  r₂, ;
  
  ;
  
  ;
  
  , r_n} denotes the record set, and W={w₁;
  
  w₂, ;
  
  ;
  
  ;
  
  ;
  
  w_p} denotes a distinct word set in T, where given two words, w_iand w_i, “
  
  w_i≦
  
  w_j”
  
  denotes that w_iis a prefix string of w_j, where a query consists of a set of prefixes Q={p₁, p₂, . . . , p_l}, where a predicted-word set is W_k_l={w|w is a member of W and k_l≦
  
  w}, the method comprising for each prefix p_ifinding the set of prefixes from the data set that are similar to p_i, by;
  
  determining the predicted-record set R_Q={r|r is a member of R, for every i;
  
  1≦
  
  i≦
  
  ·
  
  l−
  
  1, p_iappears in r, and there exists a w included in W_k_l, w appears in r}; and
  
  for a keystroke that invokes query Q, returning the top-t records in R_Qfor a given value t, ranked by their relevancy to the query, treating every keyword as a partial keyword, namely given an input Q={k₁;
  
  k₂;
  
  ;
  
  ;
  
  ;
  
  ;
  
  k_lfor each predicted record r, for each 1≦
  
  i≦
  
  ·
  
  l, there exists at least one predicted word w_ifor k_iin r, since k_imust be a prefix of w_i,quantifying their similarity as;
  
  sim=(k_i;
  
  w_i)=|k_i|/|w_i|if there are multiple predicted words in r for a partial keyword k_j, selecting the predicted word w_iwith the maximal similarity to k_iand quantifying a weight of a predicted word to capture the importance of a predicted word, and taking into account the number of attributes that the l predicted words appear in, denoted as n_a, to combine similarity, weight and number of attributes to generate a ranking function to score r for the query Q as follows;
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
- - 2. The method of claim 1 where returning the top-t records in R_Qfor a given value t, ranked by their relevancy to the query comprises finding a trie node corresponding to a keyword in a trie with inverted lists on leaf nodes by traversing the trie from the root;
    - locating leaf descendants of the trie node corresponding to the keyword, and retrieving the corresponding predicted words and the predicted records on inverted lists.
  - 3. The method of claim 2 where returning the top-t records in R_Qfor a given value t, ranked by their relevancy to the query comprises tokenizing a query string into several keywords, k₁;
    - k₂;
      
      ;
      
      ;
      
      ;
      
      ;
      
      k_i;
      
      for each keyword k_i(1≦
      
      i≦
      
      l−
      
      1) determining only a predicted word, k_i, and one predicted-record list of a trie node corresponding to k_i, denoted as l_i, where q predicted words for k_i, and their corresponding predicted record lists are l_l1;
      
      l_l2;
      
      ;
      
      ;
      
      ;
      
      ;
      
      l_lq, and determining the predicted records by ∩
      
      _i=1^l−
      
      1l_i∩
      
      (∪
      
      _j=1^ql_l_j) namely taking the union of the lists of predicted keywords for partial words, and intersecting the union of lists of predicted keywords for partial words with lists of the complete keywords.
  - 4. The method of claim 3 where determining the predicted records by ∩
    - _i=1^l−
      
      1l_l∩
      
      (∪
      
      _j=1^q) comprises determining the union I_l=∪
      
      _j=1^qI_l_jof the predicted-record lists of the partial keyword k_ito generate an ordered predicted list by using a sort-merge algorithm and then determining the intersection of several lists ∩
      
      _i=1^lI_lby using a merge-join algorithm to intersect the lists, assuming these lists are pre-sorted or determining whether each record on the short lists appears in other long lists by doing a binary search or a hash-based lookup.
  - 5. The method of claim 2 where returning the top-t records in R_Qfor a given value t, ranked by their relevancy to the query comprises among the union lists ∪
    - ₁, ∪
      
      ₂, . . . , ∪
      
      _t, of the leaf nodes of each prefix node identifying the shortest union list, verifying each record ID on the shortest list by checking if it exists on all the other union lists by maintaining a forward list maintained for each record r, which is a sorted list of IDs of keywords in r, denoted as F_r, so that each prefix p_ihas a range of keyword IDs [MinId_i, MaxId_i], verifying whether r appears on a union list ∪
      
      _kof a query prefix p_kfor a record r on the shortest union list by testing if p_kappears in the forward list F_ras a prefix by performing a binary search for MinId_kon the forward list F_rto get a lower bound Id_lb, and check if Id_lbis no larger than MaxId_k, where the probing succeeds if the condition holds, and fails otherwise.
  - 6. The method of claim 5 where each query keyword has multiple active nodes of similar prefixes, instead of determining the union of the leaf nodes of one prefix node, determining the unions of the leaf nodes for all active nodes of a prefix keyword, estimating the lengths of these union lists to find a shortest one, for each record r on the shortest union list, for each of the other query keywords, for each of its active nodes, testing if the corresponding similar prefix appears in the record r as a prefix using the forward list of r, F_r.
  - 7. The method of claim 1 where returning the top-t records in R_Qfor a given value t, ranked by their relevancy to the query comprises maintaining a session cache for each user where each session cache keeps keywords that the user has input in the past and other information for each keyword, including its corresponding trie node and the top-t predicted records.
  - 8. The method of claim 7 where maintaining a session cache for each user comprises inputting in a query string c₁c₂:
    - ;
      
      ;
      
      c_xletter by letter, where p_i=c₁c₂;
      
      ;
      
      ;
      
      c₁is a prefix query (1≦
      
      i≦
      
      ·
      
      x) and where n_iis the trie node corresponding to p_i·, and after inputting in the prefix query p_i, storing node n_ifor p_iand its top-t predicted records, inputting a new character c_x+1at the end of the previous query string c₁c₂;
      
      ;
      
      ;
      
      c_x, determining whether node n_xthat has been kept for p_xhas a child with a label of c_x+1, if so, locating leaf descendants of node n_x+1, and retrieving corresponding predicted words and the predicted records, otherwise, there is no word that has a prefix of p_x+1, and then returning an empty answer.
  - 9. The method of claim 8 for a keystroke that invokes query Q, returning the top-t records in R_Qfor a given value t, ranked by their relevancy to the query comprises matching prefixes which includes the similarity between a query keyword and its best matching prefix;
    - predicted keywords where different predicted keywords for the same prefix can have different weights, and record weights where different records have different weights, where a query is Q={p₁, p₂, . . . }, where p′
      
      _iis the best matching prefix for p_i, and where k_iis the best predicted keyword for p′
      
      _i, where sim(p_i, p′
      
      _i) is an edit similarity between p′
      
      _iand p_iand where the score of a record r for Q can be defined as;
      
      Score(r,Q)=Σ
      
      _i[sim(p_i,p′
      
      _i)+α
      
      *|p′
      
      _i|−
      
      |k_i|)+β
      
      *score(r,k_i)],where α and
      
      β
      
      are weights (0<
      
      β
      
      <
      
      α
      
      <
      
      1), and score(r, k_i) is a score of record r for keyword k_i.
  - 10. The method of claim 7 further comprising modifying a previous query string arbitrarily, or copying and pasting a completely different string for a new query string, among all the keywords input by the user, identifying the cached keyword that has the longest prefix with the new query.
  - 11. The method of claim 7 where prefix queries p₁, p₂;
    - ;
      
      ;
      
      ;
      
      ;
      
      p_xhave been cached, further comprising inputting a new query p′
      
      =c₁c₂;
      
      ;
      
      ;
      
      c_ic′
      
      ;
      
      ;
      
      ;
      
      c_y, finding p_ithat has a longest prefix with p′
      
      , using node n_iof p_ito incrementally answer the new query p′
      
      by inserting the characters after the longest prefix of the new query c′
      
      ;
      
      ;
      
      ;
      
      c_yone by one, if there exists a cached keyword p_i=p′
      
      , using the cached top-t records of p_ito directly answer the query p′
      
      ;
      
      otherwise if there is no such cached keyword, answering the query without use of any cache.
  - 12. The method of claim 7 where maintaining a session cache for each user comprises:
    - caching query results and using them to answer subsequent queries;
      
      increasing the edit-distance threshold δ
      
      as a query string is getting longer in successive queries;
      
      using pagination to show query results in different pages to partially traverse the shortest list, until enough results have been obtained for a first page, continuing traversing the shortest list to determine more query results and caching them;
      
      orretrieving the top-k records according a ranking function, for a predefined constant k, verifying each record accessed in the traversal by probing the keyword range using the forward list of the record, caching records that pass verification, then when answering a query incrementally, first verifying each record in the cached result of the previous increment of the query by probing the keyword range, if the results from the cache are insufficient to compute the new top-k, resuming the traversal on the list starting from the stopping point of the previous query, until we have enough top-k results for the new query.
  - 13. The method of claim 1 for searching a structured data table T with the query Q={k₁;
    - k₂;
      
      ;
      
      ;
      
      ;
      
      ;
      
      k_l}, where an edit distance between two strings s₁and s₂, denoted by ed(s₁, s₂), is the minimum number of edit operations of single characters needed to transform the first string so to the second string s₂, and an edit-distance threshold δ
      
      , for 1≦
      
      i≦
      
      l, where a predicted-word set W_k_jfor k_iis {w|∃
      
      w′
      
      ≦
      
      w, w∈
      
      W, ed(k_i,w′
      
      )≦
      
      δ
      
      ), where a predicted-record set R_Qis {r|r∈
      
      R, ∀
      
      1≦
      
      l, ∃
      
      w_i∈
      
      W_i, w_lappears in r} comprising determining the top-t predicted records in R_Qranked by their relevancy to Q with the edit-distance threshold δ
      
      .
  - 14. The method of claim 13 further comprising:
    - inputting a keyword k, storing a set of active nodes φ
      
      _k={[n, ξ
      
      _n]}, where n is an active node for k, and ξ
      
      _n=ed(k;
      
      n)≦
      
      δ
      
      ,inputting one more letter after k, andfinding only the descendants of the active nodes of k as active nodes of the new query which comprises initializing an active-node set for an empty keyword ε
      
      , i.e., φ
      
      _ε={[n;
      
      ξ
      
      _n]|ξ
      
      _n=|n|≦
      
      δ
      
      }, namely including all trie nodes n whose corresponding string has a length |n| within the edit-distance threshold δ
      
      , inputting a query string c₁c₂;
      
      ;
      
      ;
      
      c_xletter by letter as follows;
      
      after inputting in a prefix query p_i=c₁c₂;
      
      ;
      
      ;
      
      c_i(i≦
      
      x), storing an active-node set φ
      
      _pfor p_i, when inputting a new character c_x+1and submitting a new query p_x+1, incrementally determining the active-node set 0_p_x+1for p_x+1by using 0_p_xas follows;
      
      for each [n;
      
      ξ
      
      _n]in 0_p_x, we consider whether the descendants of n are active nodes for p_x+1, for the node n, if ξ
      
      _n+1<
      
      δ
      
      , then n is an active node for p_x+1, then storing [n;
      
      ξ
      
      _n+1] into 0_p_x+1for each child n_cof node n, (1) the child node n_chas a character different from c_x+1, ed(n_s;
      
      p_x+1)≦
      
      ed(n;
      
      p_x)+1=ξ
      
      _n+1, if ξ
      
      _n+1≦
      
      ·
      
      δ
      
      , then n_sis an active node for the new string, then storing [n_s;
      
      ξ
      
      _n+1] into 0_p_x+1, or (2) the child node n_chas a label c_x+1is denoted as a matching node n_m, ed(n_m;
      
      p_x+1)≦
      
      ·
      
      ed(n;
      
      p_x)=ξ
      
      _n≦
      
      δ
      
      , so that n_mis an active node of the new string, then storing [n_m;
      
      ξ
      
      _n] into 0_p_x+1, but if the distance for the node n_mis smaller than δ
      
      , i.e., ξ
      
      _n<
      
      δ
      
      , then for each n_m'"'"'s descendant d that is at most δ
      
      −
      
      ξ
      
      _nletters away from n_m, adding [d;
      
      ξ
      
      _d] to the active-node set for the new string p_x+1, where ξ
      
      _d=ξ
      
      _n+|d|−
      
      |n_m|.
  - 15. The method of claim 14 where during storing set 0_p_x+1, it is possible to add two new pairs [v;
    - ξ
      
      ₁] and [v;
      
      ξ
      
      ₂] for the same trie node v in which case storing the one of the new pairs [v;
      
      ξ
      
      ₁] and [v;
      
      ξ
      
      ₂] for the same trie node v with the smaller edit distance.
  - 16. The method of claim 13 where given two words w_iand w_j, their normalized edit distance is:
    - ned(w_i;
      
      w_j)=ed(w_i;
      
      w_j)/max(|w_i|;
      
      |w_j|);
      
      where |w_i| denotes the length of w_i, where given an input keyword and one of its predicted words, the prefix of the predicted word with the minimum ned is defined as a best predicted prefix, and the corresponding normalized edit distance is defined as the “
      
      minimal normalized edit distance,”
      
      denoted as “
      
      mned” and
      
      where returning the top-t records in R_Qfor a given value t, ranked by their relevancy to the query comprises determining if k_iis a complete keyword, then using ned to quantify the similarity;
      
      otherwise, if k_iis a partial keyword, then using mned to quantify their similarity, namely quantifying similarity of two words using;
      
      sim(k_i;
      
      w_i)=γ
      
      *(1−
      
      ned(k_i;
      
      w_i))+(1−
      
      γ
      
      )*(1−
      
      mned(k_i;
      
      w_i));
      
      where γ
      
      is a tuning parameter between 0 and 1.
  - 17. The method of claim 1 where determining the predicted-record set R_Qcomprises determining possible multiple words which have a prefix similar to a partial keyword, including multiple trie nodes corresponding to these words defined as the active nodes for the keyword k, locating leaf descendants of the active nodes, and determining the predicted records corresponding to these leaf nodes.
  - 18. The method of claim 1 where returning the top-t records in R_Qfor a given value t, ranked by their relevancy to the query comprises for a given query Q={p₁, p₂, . . . , p_i}, where {k_i₁, k_i₂, . . . } is the set of keywords that share the prefix p_i, where L_i_jis the inverted list of k_i_j, and ∪
    - _i=∪
      
      _jL_i_jis the union of the lists for p_i· for each prefix p_i, determining the corresponding union list ∪
      
      _ion the fly and intersecting the union lists of different keywords to find ∩
      
      _iU_i.
  - 19. The method of claim 1 where returning the top-t records in R_Qfor a given value t, ranked by their relevancy to the query comprises for a given query Q={p₁, p₂, . . . , p_i} where {k_i₁, k_i₂, . . . } is the set of keywords that share the prefix p_i, where L_i_jis the inverted list of k_i_j, and ∪
    - _i=∪
      
      _jL_i_jis the union of the lists for p_i·for each prefix p_i, predetermining and storing the union list ∪
      
      _iof each prefix p_i, and intersecting the union lists ∩
      
      _iU_iof query keywords when a query is initiated.
  - 20. The method of claim 1 where returning the top-t records in R_Qfor a given value t, ranked by their relevancy to the query comprises assigning to each record a score on a list, and combining the scores of the records on different lists using an aggregation function to determine an overall relevance of the record to the query, where the aggregation function is monotonic, for each data keyword similar to a keyword in the query providing an inverted list sorted based on the weight of the keyword in a record, and accessing the sorted inverted lists.
  - 21. The method of claim 1 where the data table T is structured into a trie and further comprising linking each node on the trie corresponding to a word, w_i, to each node corresponding to the synonyms of the word, w_i, in the trie and vise versa to return both w_iand its synonyms using the link when the word, w_iis retrieved.

22. A method for fuzzy type ahead search where R is a collection of records such as the tuples in a relational table, where D is a data set of words in R, where a user inputs a keyword query letter by letter, comprising:
- finding on-the-fly records with keywords similar to the query keywords by using edit distance to measure the similarity between strings, where the edit distance between two strings s₁and s₂, denoted by ed(s₁, s₂), is the minimum number of single-character edit operations, where Q is the keyword query the user has input which is a sequence of keywords [w₁, w₂, ;
  
  ;
  
  ;
  
  , w_m];
  
  treating the last keyword w_mas a partial keyword finding the keywords in the data set that are similar to query keywords, where π
  
  is a function that quantifies the similarity between a string s and a query keyword w in D, including, but not limited to;
  
  π
  
  (s,w)=1−
  
  ed(s,w)/|w|, where |w| is the length of the keyword w; and
  
  normalizing the edit distance based on the query-keyword length in order to allow more errors for longer query keywords, where d be a keyword in D, for each complete keyword w_i(i=1, ;
  
  ;
  
  ;
  
  , m−
  
  1), defining the similarity of d to w_ias;
  
  Sim(d,w_i)=π
  
  (d,w_i),since the last keyword w_mis treated as a prefix condition, defining the similarity of d to w_mas the maximal similarity of d'"'"'s prefixes using function π
  
  , namely Sim(d, w_m)=max prefix p of d π
  
  (p, w_m), where τ
  
  is a similarity threshold, where a keyword in D is similar to a query keyword w if Sim(d, w)≧
  
  τ
  
  , where a prefix p of a keyword in D is similar to the query keyword w_mif π
  
  (p, w_m)≧
  
  τ
  
  , where φ
  
  (w_i) (i=1, ;
  
  ;
  
  ;
  
  , m) denotes the set of keywords in D similar to w_i, and where P(w_m) denotes the set of prefixes of keywords in D similar to w_m.
- View Dependent Claims (23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38)
- - 23. The method of claim 22 further comprising:
    - ranking each record r in R based on its relevance to the query, where F(;
      
      ) is a ranking function that takes the query Q and a record r∈
      
      R;
      
      determining a score F(r, Q) as the relevance of the record r to the query Q, andgiven a positive integer k, determining the k best records in R ranked by their relevance to Q based on the score F(r, Q).
  - 24. The method of claim 23 where determining a score F(r, Q) as the relevance of the record r to the query Q comprises determining the relevance score F(r, Q) based on the similarities of the keywords in r and those keywords in the query given that a keyword d in the record could have different prefixes with different similarities to the partial keyword w_m, by taking their maximal value as the overall similarity between d and w_m, where a keyword in record r has a weight with respect to r, such as the term frequency TF and inverse document frequency IDF of the keyword in the record.
  - 25. The method of claim 23 where determining the score F(r, Q) comprises:
    - for each keyword w in the query, determining a score of the keyword with respect to the record r and the query, denoted by Score(r, w, Q); and
      
      determining the score F(r, Q) by applying a monotonic function on the Score(r, w, Q)'"'"'s for all the keywords w in the query.
  - 26. The method of claim 25 where d is a keyword in record r such that d is similar to the query keyword w, d ∈
    - φ
      
      (w), where Score(r, w, d) denotes the relevance of this query keyword w to the record keyword d, where the relevance value Score(r, w, Q) for a query keyword w in the record is the maximal value of the Score(r, w, d)'"'"'s for all the keywords d in the record, where determining a score of the keyword with respect to the record r and the query, denoted by Score(r, w, Q) comprises finding the most relevant keyword in a record to a query keyword when computing the relevance of the query keyword to the record as an indicator of how important this record is to the user query.
  - 27. The method of claim 26 where F(r, Q) is
- 28. The method of claim 26 comprising partitioning the inverted lists into several groups based on their corresponding query keywords, where each query keyword w has a group of inverted lists, producing a list of record IDs sorted on their scores with respect to this keyword, and using a top-k algorithm to find the k best records for the query.
- 29. The method of claim 28 where using a top-k algorithm to find the k best records for the query comprises for each group of inverted lists for the query keyword w, retrieving the next most relevant record ID for w by building a max heap on the inverted lists comprising maintaining a cursor on each inverted list in the group, where the heap is comprised of the record IDs pointed by the cursors so far, sorted on the scores of the similar keywords in these records since each inverted list is already sorted based on the weights of its keyword in the records and all the records on this list share the same similarity between this keyword and the query keyword w the list is also sorted based on the scores of this keyword in these records, retrieving the next best record from this group by popping the top element from the heap, incrementing the cursor of the list of the popped element by one, and pushing the new element of this list to the heap, ignoring other lists that may produce this popped record, since their corresponding scores will no longer affect the score of this record with respect to the query keyword w.
- 30. The method of claim 29 where L₁, :
  - ;
    
    ;
    
    , L_tare inverted lists with the similar keywords d₁, ;
    
    ;
    
    ;
    
    , d_t, respectively, and further comprising sorting these inverted lists based on the similarities of their keywords to w, Sim(d₁, w), ;
    
    ;
    
    ;
    
    , sim(d_t, w), constructing the max heap using the lists with the highest similarity values.
- 31. The method of claim 22 where the dataset D comprises a trie for the data keywords in D, where each trie node has a character label, where each keyword in D corresponds to a unique path from the root to a leaf node on the trie, where a leaf node has an inverted list of pairs [rid, weight]i, where rid is the ID of a record containing the leaf-node string, and weight is the weight of the keyword in the record and further comprising determining the top-k answers to the query Q in two steps comprising:
  - for each keyword w_iin the query, determining the similar keywords φ
    
    (w_i) and similar prefixes P(w_m) on the trie; and
    
    accessing the inverted lists of these similar data keywords to determine the k best answers to the query.
- 32. The method of claim 31 where accessing the inverted lists of these similar data keywords to determine the k best answers to the query comprises randomly accessing the inverted list, in each random access, given an ID of a record r, retrieving information related to the keywords in the query Q, to determine the score F(r, Q) using a forward index in which each record has a forward list of the IDs of its keywords and their corresponding weights, where each keyword has a unique ID corresponding its leaf node on the trie, and the IDs of the keywords follow their alphabetical order.
- 33. The method of claim 32 where randomly accessing the inverted list comprises:
  - maintaining for each trie node n, a keyword range [ln, un], where In and un are the minimal and maximal keyword IDs of its leaf nodes, respectively;
    
    verifying whether record r contains a keyword with a prefix similar to w_m, where for a prefix p on the trie similar to w_mchecking if there is a keyword ID on the forward list of r in the keyword range [l_p, u_p] of the trie node of p, since the forward list of r sorted, this checking is performed a binary search using the lower bound l_pon the forward list of r to get the smallest ID°
    
    no less than l_p,the record having a keyword similar to w_mif γ
    
    exists and is no greater than the upper bound u_p, i.e., γ
    
    ≦
    
    u_p.
- 34. The method of claim 32 where randomly accessing the inverted list comprises:
  - for each prefix p similar to w_m, traversing the subtrie of p and identifying its leaf nodes;
    
    for each leaf node d, for the query Q, this keyword d has a prefix similar to w_min the query, storing[Query ID, partial keyword w_m, sim(p, w_m)].in order to differentiate the query from other queries in case multiple queries are answered concurrently;
    
    storing the similarity between w_mand p;
    
    determining the score of this keyword in a candidate record, where in the case of the leaf node having several prefixes similar to w_m, storing their maximal similarity to w_m;
    
    for each keyword w_iin the query, storing the same information for those trie nodes similar to w_i, defining stored entries for the leaf node as its collection of relevant query keywords;
    
    using collection of relevant query keywords to efficiently check if a record r contains a complete word with a prefix similar to the partial keyword w_mby scanning the forward list of r, for each of its keyword IDs, locating the corresponding leaf node on the trie, and testing whether its collection of relevant query keywords includes this query and the keyword w_m, and if so, using the stored string similarity to determine the score of this keyword in the query.
- 35. The method of claim 31 further comprising improving sorted access by precomputing and storing the unions of some of the inverted lists on the trie, where v is a trie node, and ∪
  - (v) is the union of the inverted lists of v'"'"'s leaf nodes, sorted by their record weights, and if a record appears more than once on these lists, selecting its maximal weight as its weight on list ∪
    
    (v), where ∪
    
    (v) is defined as the union list of node v.
- 36. The method of claim 35 where v is a trie node comprising materializing union list ∪
  - (v), and using ∪
    
    (v) to speed up sorted access for the prefix keyword w_mis that ∪
    
    (v) is sorted based on its record weights, where the value Score(r, w_m, d_i) of a record r on the list of a keyword d_iwith respect to w_mis based on both Weight(di, r) and Sim(di, w_m), where all the leaf nodes of v have the same similarity to w_m, where all the leaf nodes of v are similar to w_m, namely their similarity to w_mis no less than the threshold τ
    
    so that the sorting order of the union list ∪
    
    (v) is also the order of the scores of the records on the leaf-node lists with respect to w_m.
- 37. The method of claim 36 where B is a budget of storage space available to materialize union lists comprising selecting trie nodes to materialize their union lists to maximize the performance of queries, where a node is defined as “
  - materialized”
    
    if its union list has been materialized, where for a query Q with a prefix keyword w_m, some of the trie nodes have their union lists materialized, where v is the highest trie node that is usable for the max heap of w_m, and for which ∪
    
    (v) has not been materialized, where for each nonleaf trie descendant c of v, such that no node on the path from v to c (including c) has been materialized comprising;
    
    performing a cost-based analysis to quantify the benefit of materializing ∪
    
    (c) on the performance of operations on the max heap of w_mbased on reduction of traversal time, reduction of heap-construction time and reduction of sorted-access time, the overall benefit B_vof materializing v for the query keyword query w_mbeing;
    
    B_v=B_{reduction of traversal time}+B_{reduction of heap-construction time}+A_v*B_{reduction of sorted-access time},where A_vis the number of sorted accesses on ∪
    
    (v) for each query andthen summing the benefits of materializing its union list to all the queries in the query workload or trie according to probability of occurrence of the query, andrecomputing B_vthe benefit B_vof materializing other affected nodes after the benefit of each node is computed until the given budget B of storage space is realized.
- 38. The method of claim 37 where selecting trie nodes to materialize their union lists to maximize the performance of queries comprises randomly select trie nodes, selecting nodes top down from the trie root, or selecting nodes bottom up from the leaf nodes.

39. A software product including instructions stored on a non-transitory tangible medium for controlling a computer for searching a structured data table T with m attributes and n records, where A={a₁;
- a₂;
  
  ;
  
  ;
  
  ;
  
  ;
  
  a_m} denotes an attribute set, R={r₁;
  
  ;
  
  ;
  
  ;
  
  ;
  
  r_n} denotes the record set and W={w₁;
  
  w₂;
  
  ;
  
  ;
  
  ;
  
  ;
  
  w_p} denotes a distinct word set in T where given two words, w_iand w_j, “
  
  w_i≦
  
  w_j”
  
  denotes that w_iis a prefix string of w_j, where a query consists of a set of prefixes Q={p₁, p₂, . . . , p_l}, where a predicted-word set is W_k_i={w|w is a member of W and k_l≦
  
  w}, the method comprising for each prefix p_ifinding the set of prefixes from the data set that are similar to p_i, by;
  
  determining the predicted-record set R_Q={r|r is a member of R for ever i;
  
  1≦
  
  i≦
  
  ·
  
  l−
  
  1, p_iappears in r, and there exists a w included in W_k_i, w appears in r}; and
  
  for a keystroke that invokes query Q, returning the top-t records in R_Qfor a given value t, ranked by their relevancy to the query, treating every keyword as a partial keyword, namely given an input query Q={k₁;
  
  k₂;
  
  ;
  
  ;
  
  ;
  
  ;
  
  k_l}, for each predicted record r, for each 1≦
  
  i≦
  
  ·
  
  l, there exists at least one predicted word w_ifor k_iin r, since k_imust be a prefix of w_i, quantifying similarity as;
  
  sim(k_i;
  
  w_i)=|k_i|/|w_i|if there are multiple predicted words in r for a partial keyword k_i, selecting the predicted word w_iwith the maximal similarity to k_iand quantifying a weight of a predicted word to capture the importance of a predicted word, and taking into account the number of attributes that the l predicted words a ear in denoted as n_a, to combine similarity, weight and number of attributes to generate a ranking function to score r for the query Q as follows;

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Regents of the University of California (University of California)
Original Assignee
Regents of the University of California (University of California)
Inventors
Li, Chen, Ji, Shengyue, Li, Guoliang, Wang, Jiannan, Feng, Jianhua
Primary Examiner(s)
Lovel, Kimberly
Assistant Examiner(s)
Uddin, Mohammed R

Application Number

US12/497,489
Publication Number

US 20100010989A1
Time in Patent Office

887 Days
Field of Search

707/780, 707/999.004
US Class Current

707/780
CPC Class Codes

G06F 16/334 Query execution G06F16/335 ...

G06F 40/232 Orthographic correction, e....

Method for efficiently supporting interactive, fuzzy search on structured data

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

30 Citations

39 Claims

Specification

Solutions

Use Cases

Quick Links

Method for efficiently supporting interactive, fuzzy search on structured data

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

30 Citations

39 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links