Hybrid approach to collating unicode text strings consisting primarily of ASCII characters

US 10,089,282 B1
Filed: 01/31/2018
Issued: 10/02/2018
Est. Priority Date: 11/06/2016
Status: Active Grant

First Claim

Patent Images

1. A method of collating text strings having Unicode encoding, comprising:

at a computer having one or more processors, and memory storing one or more programs configured for execution by the one or more processors;

receiving a first text string S=s₁s₂. . . s_nhaving Unicode encoding and a second text string T=t₁t₂. . . t_mhaving Unicode encoding, wherein n and m are positive integers, s₁, s₂, . . . , s_nand t₁, t₂, . . . , t_mare Unicode characters, and S is not identical to T;

(1) identifying a positive integer p with s₁=t₁, s₂=t₂, . . . , s_p−

1=t_p−

1and s_p≠

t_p, wherein at least one of s_pand t_pis a non-ASCII character;

(2) looking up the characters s_pand t_pin a predefined lookup table to determine a weight v_pfor the character s_pand a weight w_pfor the character t_p;

(3) when at least one of s_pand t_pis not found in the lookup table, determining the collation order of the strings S and T using Unicode weights for the corresponding strings s_ps_p+1. . . s_nand t_pt_p+1. . . t_m;

(4) when both s_pand t_pare found in the lookup table and v_p<

w_p, determining that S is collated before T;

(5) when both s_pand t_pare found in the lookup table and w_p<

v_p, determining that T is collated before S;

(6) when both s_pand t_pare found in the lookup table, v_p=w_p, and s_p+1. . . s_n=t_p+1. . . t_m, determining that S and T have the same collation position; and

when both s_pand t_pare found in the lookup table, v_p=w_p, and s_p+1. . . s_n≠

t_p+1. . . t_m, determining the collation order of S and T recursively according to steps (1)-(6) using the suffix strings s_p+1. . . s_nand t_p+1. . . t_m.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Collating text strings having Unicode encoding includes receiving two text strings S=s₁s₂. . . s_nand T=t₁t₂. . . t_m. When the two text strings are not identical, there is a smallest positive integer p for which the two text strings differ. The process looks up the characters s_pand t_pin a predefined lookup table. If either of these characters is missing from the lookup table, the collation of the text strings is determined using the standard Unicode comparison of the text strings s_ps_p+1. . . s_nand t_pt_p+1. . . t_m. Otherwise, the lookup table assigns weights v_pand w_pfor the characters s_pand t_p. When v_p≠w_p, these weights define the collation order of the strings S and T. When v_p=w_p, the collation of S and T is determined recursively using the suffix strings s_p+1. . . s_nand t_p+1. . . t_m.

Citations

20 Claims

1. A method of collating text strings having Unicode encoding, comprising:
- at a computer having one or more processors, and memory storing one or more programs configured for execution by the one or more processors;
  
  receiving a first text string S=s₁s₂. . . s_nhaving Unicode encoding and a second text string T=t₁t₂. . . t_mhaving Unicode encoding, wherein n and m are positive integers, s₁, s₂, . . . , s_nand t₁, t₂, . . . , t_mare Unicode characters, and S is not identical to T;
  
  (1) identifying a positive integer p with s₁=t₁, s₂=t₂, . . . , s_p−
  
  1=t_p−
  
  1and s_p≠
  
  t_p, wherein at least one of s_pand t_pis a non-ASCII character;
  
  (2) looking up the characters s_pand t_pin a predefined lookup table to determine a weight v_pfor the character s_pand a weight w_pfor the character t_p;
  
  (3) when at least one of s_pand t_pis not found in the lookup table, determining the collation order of the strings S and T using Unicode weights for the corresponding strings s_ps_p+1. . . s_nand t_pt_p+1. . . t_m;
  
  (4) when both s_pand t_pare found in the lookup table and v_p<
  
  w_p, determining that S is collated before T;
  
  (5) when both s_pand t_pare found in the lookup table and w_p<
  
  v_p, determining that T is collated before S;
  
  (6) when both s_pand t_pare found in the lookup table, v_p=w_p, and s_p+1. . . s_n=t_p+1. . . t_m, determining that S and T have the same collation position; and
  
  when both s_pand t_pare found in the lookup table, v_p=w_p, and s_p+1. . . s_n≠
  
  t_p+1. . . t_m, determining the collation order of S and T recursively according to steps (1)-(6) using the suffix strings s_p+1. . . s_nand t_p+1. . . t_m.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the lookup table includes lookup values for each non-control ASCII character plus a plurality of accented Roman characters.
  - 3. The method of claim 1, wherein each weight in the lookup table is encoded as a respective single byte.
  - 4. The method of claim 1, further comprising, when m≠
    - n, padding the shorter of the text strings S and T on the right so that the text strings S and T have the same length.
  - 5. The method of claim 4, wherein the padding comprises ASCII null characters.
  - 6. The method of claim 1, wherein the Unicode weights for the strings s_ps_p+1. . . s_nand t_pt_p+1. . . t_mare computed, the computation comprising:
    - for each character, performing a lookup in a Unicode weight table to identify a respective primary weight, a respective accent weight, and a respective case-weight;
      
      forming a primary Unicode weight w_pas a concatenation of the identified primary weights;
      
      forming an accent Unicode weight w_aas a concatenation of the identified accent weights;
      
      forming a case Unicode weight w_cas a concatenation of the identified case weights; and
      
      forming the Unicode weight as a concatenation w_p+w_a+w_cof the primary Unicode weight, the accent Unicode weight, and the case Unicode weight.
  - 7. The method of claim 6, wherein the collation order is in accordance with a specified language, and the Unicode weight table is selected according to the specified language.

8. A computing device, comprising:
- one or more processors;
  
  memory; and
  
  one or more programs stored in the memory and configured for execution by the one or more processors, the one or more programs comprising instructions for;
  
  receiving a first text string S=s₁s₂. . . s_nhaving Unicode encoding and a second text string T=t₁t₂. . . t_mhaving Unicode encoding, wherein n and m are positive integers, s₁, s₂, . . . , s_nand t₁, t₂, . . . , t_mare Unicode characters, and S is not identical to T;
  
  (1) identifying a positive integer p with s₁=t₁, s₂=t₂, . . . , s_p−
  
  1=t_p−
  
  1and s_p≠
  
  t_p, wherein at least one of s_pand t_pis a non-ASCII character;
  
  (2) looking up the characters s_pand t_pin a predefined lookup table to determine a weight v_pfor the character s_pand a weight w_pfor the character t_p;
  
  (3) when at least one of s_pand t_pis not found in the lookup table, determining the collation order of the strings S and T using Unicode weights for the corresponding strings s_ps_p+1. . . s_nand t_pt_p+1. . . t_m;
  
  (4) when both s_pand t_pare found in the lookup table and v_p<
  
  w_p, determining that S is collated before T;
  
  (5) when both s_pand t_pare found in the lookup table and w_p<
  
  v_p, determining that T is collated before S;
  
  (6) when both s_pand t_pare found in the lookup table, v_p=w_p, and s_p+1. . . s_n=t_p+1. . . t_m, determining that S and T have the same collation position; and
  
  when both s_pand t_pare found in the lookup table, v_p=w_p, and s_p+1. . . s_n≠
  
  t_p+1. . . t_m, determining the collation order of S and T recursively according to steps (1)-(6) using the suffix strings s_p+1. . . s_nand t_p+1. . . t_m.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The computing device of claim 8, wherein the lookup table includes lookup values for each non-control ASCII character plus a plurality of accented Roman characters.
  - 10. The computing device of claim 8, wherein each weight in the lookup table is encoded as a respective single byte.
  - 11. The computing device of claim 8, wherein the one or more programs further comprise instructions padding the shorter of the text strings S and T on the right so that the text strings S and T have the same length when m≠
    - n.
  - 12. The computing device of claim 11, wherein the padding comprises ASCII null characters.
  - 13. The computing device of claim 8, wherein the one or more programs comprise instructions for computing the Unicode weights for the strings s_ps_p+1. . . s_nand t_pt_p+1. . . t_mare computed, the computation comprising:
    - for each character, performing a lookup in a Unicode weight table to identify a respective primary weight, a respective accent weight, and a respective case-weight;
      
      forming a primary Unicode weight w_pas a concatenation of the identified primary weights;
      
      forming an accent Unicode weight w_aas a concatenation of the identified accent weights;
      
      forming a case Unicode weight w_cas a concatenation of the identified case weights; and
      
      forming the Unicode weight as a concatenation w_p+w_a+w_cof the primary Unicode weight, the accent Unicode weight, and the case Unicode weight.
  - 14. The computing device of claim 13, wherein the collation order is in accordance with a specified language, and the Unicode weight table is selected according to the specified language.

15. A non-transitory computer readable storage medium storing one or more programs configured for execution by a computing device having one or more processors and memory, the one or more programs comprising instructions for:
- receiving a first text string S=s₁s₂. . . s_nhaving Unicode encoding and a second text string T=t₁t₂. . . t_mhaving Unicode encoding, wherein n and m are positive integers, s₁, s₂, . . . , s_nand t₁, t₂, . . . , t_mare Unicode characters, and S is not identical to T;
  
  (1) identifying a positive integer p with s₁=t₁, s₂=t₂, . . . , s_p−
  
  1=t_p−
  
  1and s_p≠
  
  t_p, wherein at least one of s_pand t_pis a non-ASCII character;
  
  (2) looking up the characters s_pand t_pin a predefined lookup table to determine a weight v_pfor the character s_pand a weight w_pfor the character t_p;
  
  (3) when at least one of s_pand t_pis not found in the lookup table, determining the collation order of the strings S and T using Unicode weights for the corresponding strings s_ps_p+1. . . s_nand t_pt_p+1. . . t_m;
  
  (4) when both s_pand t_pare found in the lookup table and v_p<
  
  w_p, determining that S is collated before T;
  
  (5) when both s_pand t_pare found in the lookup table and w_p<
  
  v_p, determining that T is collated before S;
  
  (6) when both s_pand t_pare found in the lookup table, v_p=w_p, and s_p+1. . . s_n=t_p+1. . . t_m, determining that S and T have the same collation position; and
  
  when both s_pand t_pare found in the lookup table, v_p=w_p, and s_p+1. . . s_n≠
  
  t_p+1. . . t_m, determining the collation order of S and T recursively according to steps (1)-(6) using the suffix strings s_p+1. . . s_nand t_p+1. . . t_m.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The computer readable storage medium of claim 15, wherein the lookup table includes lookup values for each non-control ASCII character plus a plurality of accented Roman characters.
  - 17. The computer readable storage medium of claim 15, wherein each weight in the lookup table is encoded as a respective single byte.
  - 18. The computer readable storage medium of claim 15, wherein the one or more programs further comprise instructions padding the shorter of the text strings S and T on the right so that the text strings S and T have the same length when m≠
    - n.
  - 19. The computer readable storage medium of claim 15, wherein the one or more programs comprise instructions for computing the Unicode weights for the strings s_ps_p+1. . . s_nand t_pt_p+1. . . t_mare computed, the computation comprising:
    - for each character, performing a lookup in a Unicode weight table to identify a respective primary weight, a respective accent weight, and a respective case-weight;
      
      forming a primary Unicode weight w_pas a concatenation of the identified primary weights;
      
      forming an accent Unicode weight w_aas a concatenation of the identified accent weights;
      
      forming a case Unicode weight w_cas a concatenation of the identified case weights; and
      
      forming the Unicode weight as a concatenation w_p+w_a+w_cof the primary Unicode weight, the accent Unicode weight, and the case Unicode weight.
  - 20. The computer readable storage medium of claim 19, wherein the collation order is in accordance with a specified language, and the Unicode weight table is selected according to the specified language.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Tableau Software Incorporated (Salesforce.com, Inc.)
Original Assignee
Tableau Software Incorporated (Salesforce.com, Inc.)
Inventors
Neumann, Thomas, Leis, Viktor, Kemper, Alfons
Primary Examiner(s)
Mai, Lam T

Application Number

US15/885,646
Time in Patent Office

244 Days
Field of Search

341 55
US Class Current
CPC Class Codes

G06F 16/24535   of sub-queries or views

G06F 16/24537   of operators

G06F 16/24542   Plan optimisation

G06F 40/126   Character encoding

G06F 40/166   Editing, e.g. inserting or ...

H03M 7/02   Conversion to or from weigh...

H03M 7/14   Conversion to or from non-w...

H03M 7/705   Unicode

Hybrid approach to collating unicode text strings consisting primarily of ASCII characters

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Hybrid approach to collating unicode text strings consisting primarily of ASCII characters

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links