×

Clustering strings using N-grams

  • US 7,644,076 B1
  • Filed: 09/12/2003
  • Issued: 01/05/2010
  • Est. Priority Date: 09/12/2003
  • Status: Expired due to Term
First Claim
Patent Images

1. A method implemented in a computer system said computer system having a memory and processor, for clustering a string, the string including a plurality of characters, the method including:

  • identifying R unique n-grams T1 . . . R in the string;

    for every unique n-gram TS;

    if the frequency of TS in a set of n-gram statistics is not greater than a first threshold;

    clustering the string with a cluster associated with TS;

    otherwise;

    for every other n-gram TV in the string T1 . . . R, except S;

    concluding that the frequency of n-gram TV is greater than the first threshold, and in response;

    if the frequency of n-gram pair TS-TV is not greater than a second threshold;



    clustering the string with a cluster associated with the n-gram pair TS-TV;

    otherwise;



    for every other n-gram TX in the string T1 . . . R, except S and V;



    clustering the string with a cluster associated with the n-gram triple TS-TV-TX;

    where T1 . . . R is a set of n-grams, R is the number of elements in T1 . . . R, and TS, TV, and TX are members of T1 . . . R, and S, V, and X are integer indexes to identify members of T1 . . . R.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×