Hybrid approach to collating unicode text strings consisting primarily of ASCII characters
First Claim
1. A method of collating text strings having Unicode encoding, comprising:
- at a computer having one or more processors, and memory storing one or more programs configured for execution by the one or more processors;
receiving a first text string S=s1s2 . . . sn having Unicode encoding and a second text string T=t1t2 . . . tm having Unicode encoding, wherein n and m are positive integers, s1, s2, . . . , sn and t1, t2, . . . , tm are Unicode characters, and S is not identical to T;
(1) identifying a positive integer p with s1=t1, s2=t2, . . . , sp−
1=tp−
1 and sp≠
tp, wherein at least one of sp and tp is a non-ASCII character;
(2) looking up the characters sp and tp in a predefined lookup table to determine a weight vp for the character sp and a weight wp for the character tp;
(3) when at least one of sp and tp is not found in the lookup table, determining the collation order of the strings S and T using Unicode weights for the corresponding strings spsp+1 . . . sn and tptp+1 . . . tm;
(4) when both sp and tp are found in the lookup table and vp<
wp, determining that S is collated before T;
(5) when both sp and tp are found in the lookup table and wp<
vp, determining that T is collated before S;
(6) when both sp and tp are found in the lookup table, vp=wp, and sp+1 . . . sn=tp+1 . . . tm, determining that S and T have the same collation position; and
when both sp and tp are found in the lookup table, vp=wp, and sp+1 . . . sn≠
tp+1 . . . tm, determining the collation order of S and T recursively according to steps (1)-(6) using the suffix strings sp+1 . . . sn and tp+1 . . . tm.
0 Assignments
0 Petitions
Accused Products
Abstract
Collating text strings having Unicode encoding includes receiving two text strings S=s1s2 . . . sn and T=t1t2 . . . tm. When the two text strings are not identical, there is a smallest positive integer p for which the two text strings differ. The process looks up the characters sp and tp in a predefined lookup table. If either of these characters is missing from the lookup table, the collation of the text strings is determined using the standard Unicode comparison of the text strings spsp+1 . . . sn and tptp+1 . . . tm. Otherwise, the lookup table assigns weights vp and wp for the characters sp and tp. When vp≠wp, these weights define the collation order of the strings S and T. When vp=wp, the collation of S and T is determined recursively using the suffix strings sp+1 . . . sn and tp+1 . . . tm.
-
Citations
20 Claims
-
1. A method of collating text strings having Unicode encoding, comprising:
-
at a computer having one or more processors, and memory storing one or more programs configured for execution by the one or more processors; receiving a first text string S=s1s2 . . . sn having Unicode encoding and a second text string T=t1t2 . . . tm having Unicode encoding, wherein n and m are positive integers, s1, s2, . . . , sn and t1, t2, . . . , tm are Unicode characters, and S is not identical to T; (1) identifying a positive integer p with s1=t1, s2=t2, . . . , sp−
1=tp−
1 and sp≠
tp, wherein at least one of sp and tp is a non-ASCII character;(2) looking up the characters sp and tp in a predefined lookup table to determine a weight vp for the character sp and a weight wp for the character tp; (3) when at least one of sp and tp is not found in the lookup table, determining the collation order of the strings S and T using Unicode weights for the corresponding strings spsp+1 . . . sn and tptp+1 . . . tm; (4) when both sp and tp are found in the lookup table and vp<
wp, determining that S is collated before T;(5) when both sp and tp are found in the lookup table and wp<
vp, determining that T is collated before S;(6) when both sp and tp are found in the lookup table, vp=wp, and sp+1 . . . sn=tp+1 . . . tm, determining that S and T have the same collation position; and when both sp and tp are found in the lookup table, vp=wp, and sp+1 . . . sn≠
tp+1 . . . tm, determining the collation order of S and T recursively according to steps (1)-(6) using the suffix strings sp+1 . . . sn and tp+1 . . . tm. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computing device, comprising:
-
one or more processors; memory; and one or more programs stored in the memory and configured for execution by the one or more processors, the one or more programs comprising instructions for; receiving a first text string S=s1s2 . . . sn having Unicode encoding and a second text string T=t1t2 . . . tm having Unicode encoding, wherein n and m are positive integers, s1, s2, . . . , sn and t1, t2, . . . , tm are Unicode characters, and S is not identical to T; (1) identifying a positive integer p with s1=t1, s2=t2, . . . , sp−
1=tp−
1 and sp≠
tp, wherein at least one of sp and tp is a non-ASCII character;(2) looking up the characters sp and tp in a predefined lookup table to determine a weight vp for the character sp and a weight wp for the character tp; (3) when at least one of sp and tp is not found in the lookup table, determining the collation order of the strings S and T using Unicode weights for the corresponding strings spsp+1 . . . sn and tptp+1 . . . tm; (4) when both sp and tp are found in the lookup table and vp<
wp, determining that S is collated before T;(5) when both sp and tp are found in the lookup table and wp<
vp, determining that T is collated before S;(6) when both sp and tp are found in the lookup table, vp=wp, and sp+1 . . . sn=tp+1 . . . tm, determining that S and T have the same collation position; and when both sp and tp are found in the lookup table, vp=wp, and sp+1 . . . sn≠
tp+1 . . . tm, determining the collation order of S and T recursively according to steps (1)-(6) using the suffix strings sp+1 . . . sn and tp+1 . . . tm. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A non-transitory computer readable storage medium storing one or more programs configured for execution by a computing device having one or more processors and memory, the one or more programs comprising instructions for:
-
receiving a first text string S=s1s2 . . . sn having Unicode encoding and a second text string T=t1t2 . . . tm having Unicode encoding, wherein n and m are positive integers, s1, s2, . . . , sn and t1, t2, . . . , tm are Unicode characters, and S is not identical to T; (1) identifying a positive integer p with s1=t1, s2=t2, . . . , sp−
1=tp−
1 and sp≠
tp, wherein at least one of sp and tp is a non-ASCII character;(2) looking up the characters sp and tp in a predefined lookup table to determine a weight vp for the character sp and a weight wp for the character tp; (3) when at least one of sp and tp is not found in the lookup table, determining the collation order of the strings S and T using Unicode weights for the corresponding strings spsp+1 . . . sn and tptp+1 . . . tm; (4) when both sp and tp are found in the lookup table and vp<
wp, determining that S is collated before T;(5) when both sp and tp are found in the lookup table and wp<
vp, determining that T is collated before S;(6) when both sp and tp are found in the lookup table, vp=wp, and sp+1 . . . sn=tp+1 . . . tm, determining that S and T have the same collation position; and when both sp and tp are found in the lookup table, vp=wp, and sp+1 . . . sn≠
tp+1 . . . tm, determining the collation order of S and T recursively according to steps (1)-(6) using the suffix strings sp+1 . . . sn and tp+1 . . . tm. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification