Clustering strings using N-grams
First Claim
1. A method implemented in a computer system said computer system having a memory and processor, for clustering a string, the string including a plurality of characters, the method including:
- identifying R unique n-grams T1 . . . R in the string;
for every unique n-gram TS;
if the frequency of TS in a set of n-gram statistics is not greater than a first threshold;
clustering the string with a cluster associated with TS;
otherwise;
for every other n-gram TV in the string T1 . . . R, except S;
concluding that the frequency of n-gram TV is greater than the first threshold, and in response;
if the frequency of n-gram pair TS-TV is not greater than a second threshold;
clustering the string with a cluster associated with the n-gram pair TS-TV;
otherwise;
for every other n-gram TX in the string T1 . . . R, except S and V;
clustering the string with a cluster associated with the n-gram triple TS-TV-TX;
where T1 . . . R is a set of n-grams, R is the number of elements in T1 . . . R, and TS, TV, and TX are members of T1 . . . R, and S, V, and X are integer indexes to identify members of T1 . . . R.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and computer program for clustering a string are described. The string includes a plurality of characters. R unique n-grams T1 . . . R are identified in the string. For every unique n-gram TS, if the frequency of TS in a set of n-gram statistics is not greater than a first threshold, the string is associated with a cluster associated with TS. Otherwise, for every other n-gram TV in the string T1 . . . R, except S, if the frequency of n-gram TV is greater than the first threshold, and if the frequency of n-gram pair TS-TV is not greater than a second threshold, the string is associated with a cluster associated with the n-gram pair TS-TV. Otherwise, for every other n-gram TX in the string T1 . . . R, except S and V, the string is associated with a cluster associated with the n-gram triple TS-TV-TX. Otherwise, nothing is done.
-
Citations
7 Claims
-
1. A method implemented in a computer system said computer system having a memory and processor, for clustering a string, the string including a plurality of characters, the method including:
-
identifying R unique n-grams T1 . . . R in the string; for every unique n-gram TS; if the frequency of TS in a set of n-gram statistics is not greater than a first threshold; clustering the string with a cluster associated with TS; otherwise; for every other n-gram TV in the string T1 . . . R, except S; concluding that the frequency of n-gram TV is greater than the first threshold, and in response; if the frequency of n-gram pair TS-TV is not greater than a second threshold;
clustering the string with a cluster associated with the n-gram pair TS-TV;otherwise;
for every other n-gram TX in the string T1 . . . R, except S and V;
clustering the string with a cluster associated with the n-gram triple TS-TV-TX;where T1 . . . R is a set of n-grams, R is the number of elements in T1 . . . R, and TS, TV, and TX are members of T1 . . . R, and S, V, and X are integer indexes to identify members of T1 . . . R. - View Dependent Claims (2, 3)
-
-
4. A method implemented in a computer system said computer system having a memory and processor, for clustering a string, the string including a plurality of characters, the method including:
-
identifying R unique n-grams T1 . . . R in the string; for every unique n-gram TS; if the frequency of TS in a set of n-gram statistics is not greater than a first threshold; clustering the string with a cluster associated with TS; otherwise; for i=1 to Y; for every unique set of i n-grams TU in the string T1 . . . R, except S; if the frequency of the n-gram set TS-TU is not greater than a second threshold;
clustering the string with a cluster associated with the n-gram set TS-TU;if the string has not been associated with a cluster with this value of TS; for every unique set of Y+1 n-grams TUY in the string T1 . . . R, except S; clustering the string with a cluster associated with the Y+2 n-gram group TS-TUY, where T1 . . . R is a set of n-grams, R is the number of elements in T1 . . . R, and TS, TV, and TX are members of T1 . . . R, and S, V, and X are integer indexes to identify members of T1 . . . R. - View Dependent Claims (5, 6, 7)
-
Specification