Method and system for performing phrase/word clustering and cluster merging
First Claim
1. A method for classifying character strings, comprising:
- receiving at least one character string, wherein each character string comprises a word or a phrase;
clustering a first character string with another character string into a first group, when said first character string satisfies a predetermined degree of commonality with said another character string; and
selecting at least one of said character strings in each of said groups to be a topic.
1 Assignment
0 Petitions
Accused Products
Abstract
Text classification has become an important aspect of information technology. Present text classification techniques range from simple text matching to more complex clustering methods. Clustering describes a process of discovering structure in a collection of characters. The invention automatically analyzes a text string and either updates an existing cluster or creates a new cluster. To that end, the invention may use a character n-gram matching process in addition to other heuristic-based clustering techniques. In the character n-gram matching process, each text string is first normalized using several heuristics. It is then divided into a set of overlapping character n-grams, where n is the number of adjacent characters. If the commonality between the text string and the existing cluster members satisfies a pre-defined threshold, the text string is added to the cluster. If, on the other hand, the commonality does not satisfy the pre-defined threshold, a new cluster may be created. Each cluster may have a selected topic name. The topic name allows whole clusters to be compared in a similar way to the individual clusters, and merged when a predetermined level of commonality exists between the subject clusters. The topic name also may be used as a suggested alternative to the text string. In this instance, the topic name of the cluster to which the text string was added may be outputted as an alternative to the text string.
32 Citations
34 Claims
-
1. A method for classifying character strings, comprising:
-
receiving at least one character string, wherein each character string comprises a word or a phrase; clustering a first character string with another character string into a first group, when said first character string satisfies a predetermined degree of commonality with said another character string; and selecting at least one of said character strings in each of said groups to be a topic. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system for classifying character strings, comprising:
-
an input device for receiving at least one character string; a clustering component for placing a first character string with another character string into a group, when said first character string satisfies a predetermined degree of commonality with said another character string; and a selection component for selecting at least one of said character strings in each of said groups to be a topic. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A computer-readable medium having computer-executable instructions for steps comprising:
-
receiving at least one character string, wherein each character string comprises a word or a phrase; clustering a first character string with another character string into a first group, when said first character string satisfies a predetermined degree of commonality with said another character string; and selecting at least one of said character strings in each of said groups to be a topic. - View Dependent Claims (16, 17, 18, 19, 20)
-
-
21. A method for suggesting alternative words or phrases, comprising:
-
receiving a first word or phrase; creating a cluster of said first word or phrase with another word or phrase; and outputting a topic, wherein said topic is at least one of said words or phrases that satisfy a predetermined criteria. - View Dependent Claims (22, 23)
-
-
24. A method for searching a database, comprising:
-
receiving a first word or phrase; clustering said first word or phrase with another word or phrase; and searching said database for a topic, wherein said topic is at least one of said words or phrases that satisfy a predetermined criteria. - View Dependent Claims (25, 26)
-
-
27. A database search engine system, comprising:
-
an editorial database that stores one or more clusters of words and phrases, wherein each cluster is identified by one or more topic names; a search engine database that stores a catalogue of items; and a computer coupled to said editorial database and said search engine database, wherein said computer receives a query relevant to said catalogue of items and compares said query to said one or more clusters of words and phrases stored in said editorial database, and wherein said computer queries said search engine database with a modified query, wherein said modified query is one or more of said topic names that satisfy a predetermined commonality with said query. - View Dependent Claims (28, 29, 30, 31, 32, 33)
-
-
34. A system for suggesting alternative words or phrases, comprising:
-
an editorial database that stores one or more clusters of words and phrases, wherein each cluster is identified by one or more topic names; a computer coupled to said editorial database, wherein said computer receives a query and segments said query into a first plurality of character sets and said words or phrases in said editorial database into another plurality of character sets, and wherein said computer further outputs at least one alternative word or phrase that satisfied a predetermined commonality with said query.
-
Specification