Method for automatically finding frequently asked questions in a helpdesk data set
First Claim
1. A method for automatically classifying frequently asked questions, comprising:
- generating a dictionary including a subset of words contained in a document set based on a frequency of occurrence of each word in the document set;
generating a count of occurrences of each word in the dictionary within each document in the document set;
partitioning the set of documents into a plurality of clusters, each cluster containing at least one document;
for each cluster, sorting dictionary terms with reference to occurrence frequency within the cluster;
determining a search space by selecting candidate dictionary terms within a desired depth of search;
selecting a plurality of terms from the candidate dictionary terms that correspond to a predetermined level of detail;
identifying a set of examples containing the selected set of terms;
setting the identified set of examples as a frequently asked question;
wherein setting the identified set of examples includes the step of determining if the number of identified set of examples exceeds zero; and
wherein if the number of identified set of examples exceeds zero, selecting an overlap between the identified set of examples and other sets of examples is less than a predetermined value, P, then setting the identified set of examples as a frequently asked question.
3 Assignments
0 Petitions
Accused Products
Abstract
A system and method automatically identify candidate helpdesk problem categories that are most amenable to automated solutions. The system generates a dictionary wherein each word in the text data set is identified, and the number of documents containing these words is counted, and a corresponding count is generated. The documents are partitioned into clusters. For each generated cluster, the system sorts the dictionary terms in order of decreasing occurrence frequency. It then determines a search space by selecting the top dictionary terms as specified by a user defined depth of search. Next, the system chooses a set of terms from the search space as specified by a user-defined value indicating the desired level of detail. For each possible combination of frequent terms in the search space, the system finds the set of examples containing all the terms, and then determines if the frequency is sufficiently high and the overlap sufficiently low for this candidate set of examples to be a frequently asked question.
-
Citations
22 Claims
-
1. A method for automatically classifying frequently asked questions, comprising:
-
generating a dictionary including a subset of words contained in a document set based on a frequency of occurrence of each word in the document set;
generating a count of occurrences of each word in the dictionary within each document in the document set;
partitioning the set of documents into a plurality of clusters, each cluster containing at least one document;
for each cluster, sorting dictionary terms with reference to occurrence frequency within the cluster;
determining a search space by selecting candidate dictionary terms within a desired depth of search;
selecting a plurality of terms from the candidate dictionary terms that correspond to a predetermined level of detail;
identifying a set of examples containing the selected set of terms;
setting the identified set of examples as a frequently asked question;
wherein setting the identified set of examples includes the step of determining if the number of identified set of examples exceeds zero; and
wherein if the number of identified set of examples exceeds zero, selecting an overlap between the identified set of examples and other sets of examples is less than a predetermined value, P, then setting the identified set of examples as a frequently asked question. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
wherein if the number of identified set of examples exceeds zero, comparing the identified set of examples to the centroid.
-
-
5. The method of claim 3, further including preparing a report listing frequently asked questions having the user-selected confidence.
-
6. The method of claim 1, wherein sorting includes sorting the dictionary terms in order of decreasing occurrence frequency within the cluster.
-
7. The method of claim 1, further including generating a name for each cluster.
-
8. The method of claim 1, further including displaying a table including a name of each cluster and a frequency of occurrence of the frequently asked question.
-
9. A system for automatically classifying frequently asked questions, comprising:
-
a dictionary including a subset of words contained in a document set based on a frequency of occurrence of each word in the document set;
a count of occurrences of each word in the dictionary generated within each document in the document set;
a cluster module that partitions the set of documents into a plurality of clusters, each cluster containing at least one document, wherein dictionary terms for each cluster are sorted with reference to occurrence freguency;
a processing routine that determines a search space by selecting candidate dictionary terms within a desired depth of search, and that selects a plurality of terms from the candidate dictionary terms that correspond to a predetermined level of detail;
wherein the processing routine selects a set of examples containing the selected set of terms;
wherein the processing routine further sets the identified set of examples as a frequently asked question, and determines if the number of identified set of examples exceeds zero; and
wherein if the number of identified set of examples exceeds zero, the processing routine selects an overlap between the identified set of examples and other sets of examples is less than a predetermined value, P, then sets the identified set of examples as a frequently asked question. - View Dependent Claims (10, 11, 12, 13, 14, 15)
-
-
16. A computer program product for automatically classifying frequently asked questions, comprising:
-
a dictionary including a subset of words contained in a document set based on a frequency of occurrence of each word in the document set;
means for generating a count of occurrences of each word in the dictionary within each document in the document set;
means for partitioning the set of documents into a plurality of clusters, each cluster containing at least one document, means for sorting dictionary terms for each cluster with reference to occurrence frequency;
means for determining a search space by selecting candidate dictionary terms within a desired depth of search, and that selects a plurality of terms from the candidate dictionary terms that correspond to a predetermined level of detail, wherein the means for determining the search space identifies a set of examples containing the selected set of terms;
wherein the means for determining the search space further sets the identified set of examples as a frequently asked question, and determines if the number of identified set of examples exceeds zero; and
wherein if the number of identified set of examples exceeds zero, means for determining the search space selects an overlap between the identified set of examples and other sets of examples is less than a predetermined value, P, then sets the identified set of examples as a frequently asked question. - View Dependent Claims (17, 18, 19, 20, 21, 22)
-
Specification