Method for automatically finding frequently asked questions in a helpdesk data set

US 6,804,670 B2
Filed: 08/22/2001
Issued: 10/12/2004
Est. Priority Date: 08/22/2001
Status: Active Grant

First Claim

Patent Images

1. A method for automatically classifying frequently asked questions, comprising:

generating a dictionary including a subset of words contained in a document set based on a frequency of occurrence of each word in the document set;

generating a count of occurrences of each word in the dictionary within each document in the document set;

partitioning the set of documents into a plurality of clusters, each cluster containing at least one document;

for each cluster, sorting dictionary terms with reference to occurrence frequency within the cluster;

determining a search space by selecting candidate dictionary terms within a desired depth of search;

selecting a plurality of terms from the candidate dictionary terms that correspond to a predetermined level of detail;

identifying a set of examples containing the selected set of terms;

setting the identified set of examples as a frequently asked question;

wherein setting the identified set of examples includes the step of determining if the number of identified set of examples exceeds zero; and

wherein if the number of identified set of examples exceeds zero, selecting an overlap between the identified set of examples and other sets of examples is less than a predetermined value, P, then setting the identified set of examples as a frequently asked question.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method automatically identify candidate helpdesk problem categories that are most amenable to automated solutions. The system generates a dictionary wherein each word in the text data set is identified, and the number of documents containing these words is counted, and a corresponding count is generated. The documents are partitioned into clusters. For each generated cluster, the system sorts the dictionary terms in order of decreasing occurrence frequency. It then determines a search space by selecting the top dictionary terms as specified by a user defined depth of search. Next, the system chooses a set of terms from the search space as specified by a user-defined value indicating the desired level of detail. For each possible combination of frequent terms in the search space, the system finds the set of examples containing all the terms, and then determines if the frequency is sufficiently high and the overlap sufficiently low for this candidate set of examples to be a frequently asked question.

Citations

22 Claims

1. A method for automatically classifying frequently asked questions, comprising:
- generating a dictionary including a subset of words contained in a document set based on a frequency of occurrence of each word in the document set;
  
  generating a count of occurrences of each word in the dictionary within each document in the document set;
  
  partitioning the set of documents into a plurality of clusters, each cluster containing at least one document;
  
  for each cluster, sorting dictionary terms with reference to occurrence frequency within the cluster;
  
  determining a search space by selecting candidate dictionary terms within a desired depth of search;
  
  selecting a plurality of terms from the candidate dictionary terms that correspond to a predetermined level of detail;
  
  identifying a set of examples containing the selected set of terms;
  
  setting the identified set of examples as a frequently asked question;
  
  wherein setting the identified set of examples includes the step of determining if the number of identified set of examples exceeds zero; and
  
  wherein if the number of identified set of examples exceeds zero, selecting an overlap between the identified set of examples and other sets of examples is less than a predetermined value, P, then setting the identified set of examples as a frequently asked question.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein setting the identified set of examples further includes removing frequently asked questions whose frequencies occur below a user-selected confidence.
  - 3. The method of claim 2, further including specifying the user-selected confidence by defining a maximum number of frequently asked questions.
  - 4. The method of claim 3, further including generating a centroid for each cluster in the search space;
    - and
5. The method of claim 3, further including preparing a report listing frequently asked questions having the user-selected confidence.
6. The method of claim 1, wherein sorting includes sorting the dictionary terms in order of decreasing occurrence frequency within the cluster.
7. The method of claim 1, further including generating a name for each cluster.
8. The method of claim 1, further including displaying a table including a name of each cluster and a frequency of occurrence of the frequently asked question.

9. A system for automatically classifying frequently asked questions, comprising:
- a dictionary including a subset of words contained in a document set based on a frequency of occurrence of each word in the document set;
  
  a count of occurrences of each word in the dictionary generated within each document in the document set;
  
  a cluster module that partitions the set of documents into a plurality of clusters, each cluster containing at least one document, wherein dictionary terms for each cluster are sorted with reference to occurrence freguency;
  
  a processing routine that determines a search space by selecting candidate dictionary terms within a desired depth of search, and that selects a plurality of terms from the candidate dictionary terms that correspond to a predetermined level of detail;
  
  wherein the processing routine selects a set of examples containing the selected set of terms;
  
  wherein the processing routine further sets the identified set of examples as a frequently asked question, and determines if the number of identified set of examples exceeds zero; and
  
  wherein if the number of identified set of examples exceeds zero, the processing routine selects an overlap between the identified set of examples and other sets of examples is less than a predetermined value, P, then sets the identified set of examples as a frequently asked question.
- View Dependent Claims (10, 11, 12, 13, 14, 15)
- - 10. The system of claim 9, further including a database system that generates a centroid for each cluster in the search space.
  - 11. The system of claim 10, wherein if the number of identified set of examples exceeds zero, the database system compares the identified set of examples to the centroid.
  - 12. The system of claim 9, wherein the processing routine prepares a report listing frequently asked questions having a user-selected confidence.
  - 13. The system of claim 9, wherein the cluster module sorts the dictionary terms in order of decreasing occurrence frequency within the cluster.
  - 14. The system of claim 10, wherein the database system generates a name for each cluster.
  - 15. The system of claim 9, further including a display that displays a table including a name of each cluster and a frequency of occurrence of the frequently asked question.

16. A computer program product for automatically classifying frequently asked questions, comprising:
- a dictionary including a subset of words contained in a document set based on a frequency of occurrence of each word in the document set;
  
  means for generating a count of occurrences of each word in the dictionary within each document in the document set;
  
  means for partitioning the set of documents into a plurality of clusters, each cluster containing at least one document, means for sorting dictionary terms for each cluster with reference to occurrence frequency;
  
  means for determining a search space by selecting candidate dictionary terms within a desired depth of search, and that selects a plurality of terms from the candidate dictionary terms that correspond to a predetermined level of detail, wherein the means for determining the search space identifies a set of examples containing the selected set of terms;
  
  wherein the means for determining the search space further sets the identified set of examples as a frequently asked question, and determines if the number of identified set of examples exceeds zero; and
  
  wherein if the number of identified set of examples exceeds zero, means for determining the search space selects an overlap between the identified set of examples and other sets of examples is less than a predetermined value, P, then sets the identified set of examples as a frequently asked question.
- View Dependent Claims (17, 18, 19, 20, 21, 22)
- - 17. The computer program product of claim 16, further including means for generating a centroid for each cluster in the search space.
  - 18. The computer program product of claim 17, wherein if the number of identified set of examples exceeds zero, the means for determining the search space compares the identified set of examples to the centroid.
  - 19. The computer program product of claim 16, further including means for preparing a report listing frequently asked questions having a user-selected confidence.
  - 20. The computer program product of claim 16, wherein the means for sorting sorts the dictionary terms in order of decreasing occurrence frequency within the cluster.
  - 21. The computer program product of claim 16, further including means for generating a name for each cluster.
  - 22. The computer program product of claim 16, further including means for displaying a table including a name of each cluster and a frequency of occurrence of the frequently asked question.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Lenovo PC International Limited (Lenovo Group Ltd.)
Original Assignee
International Business Machines Corporation
Inventors
Lessler, Justin Thomas, Sanchez, Michael Ponce, Spangler, William Scott, Kreulen, Jeffrey Thomas
Primary Examiner(s)
Metjahic, Safet
Assistant Examiner(s)
NGUYEN, MERILYN P

Application Number

US09/935,473
Publication Number

US 20030050908A1
Time in Patent Office

1,147 Days
Field of Search

707/3, 707/6, 707/102, 707/1, 707/7, 707/5, 382/225, 704/9, 704/4, 704/235, 704/257, 706/45, 434/350
US Class Current

1/1
CPC Class Codes

G06F 16/355   Class or cluster creation o...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99937   Sorting

Y10S 707/99943   Generating database or data...

Method for automatically finding frequently asked questions in a helpdesk data set

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Method for automatically finding frequently asked questions in a helpdesk data set

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links