Method and apparatus for text classification

US 5,371,807 A
Filed: 03/20/1992
Issued: 12/06/1994
Est. Priority Date: 03/20/1992
Status: Expired due to Fees

First Claim

Patent Images

1. A method for classifying natural language text input into a computer system, the system includes memory having a domain specific knowledge base having a plurality of categories stored therein, the method comprising the steps of:

(a) accepting as input natural language input text;

(b) parsing the natural language input text into a first list of recognized keywords;

(c) using the first list to deduce further facts from the natural language input text;

(d) compiling the deduced facts into a second list;

(e) calculating a numeric similarity score for each one of the plurality of categories in the knowledge base to indicate how similar one of the plurality of categories is to the natural language input text;

(f) applying a dynamic threshold to determine which ones of the plurality of categories are most similar to the recognized keywords of the first list, comprising the sub-steps of;

(I) calculating a value for the dynamic threshold based upon a similarity score of a most similar category and a predefined threshold offset, and(II) classifying the categories based upon their respective similarity scores by discarding categories whose similarity scores are below the threshold value;

(g) compiling the ones of the plurality of categories determined to be most similar in step (f) into a third list; and

(i) passing the first list, the second list and the third list to an external application.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A text classification system and method that can be used by an application for classifying natural language text input into a computer system having a domain specific knowledge base that includes a knowledge base having a plurality of categories. The text classification system classifies input natural language input text by first parsing the natural language input text into a first list of recognized keywords. This list is then used to deduce further facts from the natural language input text which are then compiled into a second list. Next, a numeric similarity score for each one of the plurality of categories in the knowledge base is calculated which indicates how similar one of the plurality of categories is to the natural language input text. A dynamic threshold is then applied to determine which ones of the plurality of categories are most similar to the recognized keywords of the natural language input text. A third list is compiled of the ones of the plurality of categories determined to be most similar to the recognized keywords. An optional rule base can be utilized to further refine the determination of which ones of the plurality of categories are most similar to the recognized keywords of the natural language input text. Also, an optional learning capability can be added to improve the accuracy of the text classification system.

315 Citations

24 Claims

1. A method for classifying natural language text input into a computer system, the system includes memory having a domain specific knowledge base having a plurality of categories stored therein, the method comprising the steps of:
- (a) accepting as input natural language input text;
  
  (b) parsing the natural language input text into a first list of recognized keywords;
  
  (c) using the first list to deduce further facts from the natural language input text;
  
  (d) compiling the deduced facts into a second list;
  
  (e) calculating a numeric similarity score for each one of the plurality of categories in the knowledge base to indicate how similar one of the plurality of categories is to the natural language input text;
  
  (f) applying a dynamic threshold to determine which ones of the plurality of categories are most similar to the recognized keywords of the first list, comprising the sub-steps of;
  
  (I) calculating a value for the dynamic threshold based upon a similarity score of a most similar category and a predefined threshold offset, and(II) classifying the categories based upon their respective similarity scores by discarding categories whose similarity scores are below the threshold value;
  
  (g) compiling the ones of the plurality of categories determined to be most similar in step (f) into a third list; and
  
  (i) passing the first list, the second list and the third list to an external application.
- View Dependent Claims (2, 3, 4, 5, 6, 9, 10)
- - 2. The method according to claim 1 wherein the keywords comprise words, phrases and regular expressions.
  - 3. The method according to claim 1 wherein the knowledge base includes a keyword class hierarchy structured such that keywords that share something in common are grouped into classes, each class has associated facts that are true when a member of the class is identified in the natural language input text, wherein the steps of using the first list to deduce further facts from the natural language input text and compiling the deduced facts into a second list further are performed by the steps of:
    - (a) searching the keyword class hierarchy to determine if a keyword identified in the first list is a member of a class in the keyword class hierarchy;
      
      (b) when a keyword identified in the first list is a member of a class,(i) inferring all the facts attached to that class by adding them to the second list, and(ii) adding all the facts attached to all classes above the classes of which the identified keyword is a member in the keyword class hierarchy to the second list; and
      
      (c) repeating steps (a) through (b) for each keyword in the first list.
  - 4. The method according to claim 2 wherein the knowledge base includes a keyword class hierarchy structured such that keywords that share something in common are grouped into classes, each class has associated facts that are true when a member of the class is identified in the natural language input text, wherein the step of using the first list to deduce further facts from the natural language input text further comprises the step of substituting general descriptions of an identified keyword in the first list in an attempt to match other phrases that could not be matched explicitly so that a group of similar keywords can be grouped into a class and a word can be attached to the class to be used as a substitute for matching phrases.
  - 5. The method according to claim 1 wherein the knowledge base includes a keyword class hierarchy structured such that keywords that share something in common are grouped into classes, each class has associated facts that are true when a member of the class is identified in the natural language input text, wherein the steps of using the first list to deduce further facts from the natural language input text and compiling the deduced facts into a second list further are performed by the steps of:
    - (a) searching the keyword class hierarchy for all classes of which an identified keyword in the first list is a member;
      
      (b) adding all facts associated with each one of the classes of which the identified keyword is a member to a global list of deduced facts;
      
      (c) recursively applying step (b) on all classes above the classes of which the identified keyword is a member in the keyword class hierarchy; and
      
      (d) repeating steps (a) through (c) for each keyword in the first list.
  - 6. The method according to claim 1 wherein the knowledge base includes a lexicon that includes words, phrases and expressions, and a keyword class hierarchy structured such that keywords that share something in common are grouped into classes, each class has associated facts that are true when a member of the class is identified in the natural language input text, wherein the step of using the first list to deduce further facts from the natural language input text further comprises the steps of:
    - (a) searching the keyword class hierarchy for all classes of which an identified keyword in the first list is a member;
      
      (b) locating all substitution keywords associated with each class of which the identified keyword is a member;
      
      (c) retrieving the located substitution keywords;
      
      (d) substituting the located substitution keywords for the identified keyword;
      
      (e) using the located substitution keywords to identify matches between the located substitution keywords and phrases in the lexicon;
      
      (f) recursively applying steps (b) through (e) on all classes above the classes of which the identified keyword is a member in the keyword class hierarchy; and
      
      (g) repeating steps (a) through (f) for each keyword in the first list.
  - 9. The method according to claim 1 wherein the domain specific knowledge base further includes a rule base, the method further comprising the steps of:
    - (a) utilizing the rule base to select certain ones of the plurality of categories determined to be most similar to the recognized keywords over other ones of the plurality of categories based on the first and second lists; and
      
      (b) modifying the third list of the most similar categories to include the certain ones of the plurality of categories selected.
  - 10. The method according to claim 1 wherein the domain specific knowledge base includes a knowledge base of keyword/category profiles, each category in the keyword/category profiles knowledge base having an associated profile which indicates what information provides evidence for a given category, the keyword/profile weight knowledge base is arranged to have associated with each keyword in a profile a profile weight that represents the amount of evidence a keyword provides for a given category, the method further comprising the step of adjusting the profile weights in the keyword/category profiles in the domain specific knowledge base based upon the ones of the plurality of categories determined most relevant to the natural language input text and a second ones of the plurality of categories determined most relevant to the natural language input text by an external source.

7. A text classification system comprising:
- memory;
  
  a domain specific knowledge base stored in said memory having a plurality of categories, the domain specific knowledge base includes a knowledge base of keyword/category profiles, each category in the keyword/category profiles knowledge base having an associated profile which indicates what information provides evidence for a given category, the keyword/profile weight knowledge base arranged to have associated with each keyword in a profile a profile weight that represents the amount of evidence a keyword provides for a given category; and
  
  a computer coupled to the memory, the computer including;
  
  a natural language module for accepting as input into the computer natural language input text, the natural language module includes means for parsing the natural language input text into a first list of recognized keywords;
  
  an intelligent inferencer module for using the first list to deduce further facts from the information explicitly stated in the natural language input text, the intelligent inferencer module includes means for compiling the deduced facts into a second list;
  
  a similarity measuring module for calculating a numeric similarity score for each one of the plurality of categories in the knowledge base to indicate how similar one of the plurality of categories is to the natural language input text, the similarity measuring module includes;
  
  means for applying a dynamic threshold to determine which ones of the plurality of categories are most similar to the recognized keywords of the natural language input text, andmeans for compiling the ones of the plurality of categories determined to be most similar into a third list; and
  
  a relevance feedback learning module for adjusting the profile weights in the keyword/category profiles in the domain specific knowledge base based upon the ones of the plurality of categories determined most relevant to the natural language input text by the similarity measuring module and a second ones of the plurality of categories determined most relevant to the natural language input text by an external source.

8. A method for classifying natural language text input into a computer system, the system includes memory having a domain specific knowledge base having a plurality of categories stored therein, the method comprising the steps of:
- (a) accepting as input natural language input text;
  
  (b) parsing the natural language input text into a first list of recognized keywords;
  
  (c) using the first list to deduce further facts from the natural language input text;
  
  (d) compiling the deduced facts into a second list;
  
  (e) calculating a numeric similarity score for each one of the plurality of categories in the knowledge base to indicate how similar one of the plurality of categories is to the natural language input text;
  
  (f) applying a dynamic threshold to determine which ones of the plurality of categories are most similar to the recognized keywords of the first list, the step of applying a dynamic threshold further comprising the sub-steps of;
  
  (1) calculating a value for the dynamic threshold based upon a similarity score of a most similar category and a predefined threshold offset, and(2) classifying the categories based upon their respective similarity scores by discarding categories whose similarity scores are below the threshold value; and
  
  (g) compiling the ones of the plurality of categories determined to be most similar in step (f) into a third list.

11. A method for routing customer service requests by a computer system in a customer support center which includes support groups to service customer requests, the computer system including a call handling system, a text classification system and memory having a domain specific knowledge base having a plurality of categories stored therein representative of the support groups within the customer support center, each support group being identified by a name, the method comprising the steps of:
- (a) receiving a customer service request by the computer system from the call handling system;
  
  (b) passing the customer service request to the text classification system to determine where to route the customer service request within the customer support center;
  
  (c) parsing the customer service request into a first list of recognized keywords;
  
  (d) using the first list to deduce further facts from the customer service request;
  
  (e) compiling the deduced facts into a second list;
  
  (f) calculating a numeric similarity score for each one of the plurality of categories in the knowledge base to indicate how similar each one of the plurality of categories is to the the customer service request;
  
  (g) applying a dynamic threshold to identify which one of the support groups should handle the customer service request by determining which ones of the plurality of categories are most similar to the recognized keywords of the customer service request;
  
  (h) compiling the ones of the plurality of categories determined to be most similar in step (g) into a third list;
  
  (i) passing the first list, the second list and the third list back to the call handling system; and
  
  (j) routing the customer service request to the identified one of the support groups.
- View Dependent Claims (13)
- - 13. The method according to claim 11 or 12 wherein the domain specific knowledge base includes a knowledge base of keyword/category profiles, each category in the keyword/category profiles knowledge base having an associated profile which indicates what information provides evidence for a given category, the keyword/profile weight knowledge base is arranged to have associated with each keyword in a profile a profile weight that represents the amount of evidence a keyword provides for a given category, the method further comprising the step of adjusting the profile weights in the keyword/category profiles in the domain specific knowledge base based upon the one of the support groups selected to handle the customer service request and a second one of the support groups determined most relevant to the natural language input text by an external source.

12. A method for routing customer service requests by a computer system in a customer support center which includes support groups to service customer requests, the computer system including a call handling system, a text classification system and memory having a domain specific knowledge base having a plurality of categories stored therein representative of the support groups within the customer support center, each support group being identified by a name, and a rule base, the method comprising the steps of:
- (a) receiving a customer service request by the computer system from the call handling system;
  
  (b) passing the customer service request to the text classification system to determine where to route the customer service request within the customer support center;
  
  (c) parsing the customer service request into a first list of recognized keywords;
  
  (d) using the first list to deduce further facts from the customer service request;
  
  (e) compiling the deduced facts into a second list;
  
  (f) calculating, utilizing the first list, a numeric similarity score for each one of the plurality of categories in the knowledge base to indicate how similar each one of the plurality of categories is to the customer service request;
  
  (g) applying a dynamic threshold to identify which support groups should handle the customer service request by determining which ones of the plurality of categories are most similar to the recognized keywords of the customer service request;
  
  (h) compiling the ones of the plurality of categories determined to be most similar in step (g) into a third list;
  
  (i) utilizing the rule base to select certain ones of the plurality of categories determined to be most similar to the recognized keywords over other ones of the plurality of categories based on the first and second lists;
  
  (j) modifying the third list of the most similar categories to include the certain ones of the plurality of categories selected;
  
  (k) passing the first list, the second list and the third list back to the call handling system; and
  
  (l) routing the customer service request to the selected one of the support groups.

14. A text classification system comprising:
- a memory;
  
  a domain specific knowledge base stored in said memory having a plurality of categories wherein the domain specific knowledge base includes a knowledge base of keyword/category profiles, each category in the keyword/category profiles knowledge base having an associated profile which indicates what information provides evidence for a given category, the keyword/profile knowledge base is arranged to have associated with each keyword in a profile a profile weight that represents the amount of evidence a keyword provides for a given category; and
  
  a computer coupled to the memory, the computer including;
  
  means for accepting as input into the computer, natural language input text,means for parsing the natural language input text into a first list of recognized keywords,means for using the first list to deduce further facts from the natural language input text,means for compiling the deduced facts into a second list,means for calculating a numeric similarity score for each one of the plurality of categories in the knowledge base to indicate how similar one of the plurality of categories is to the natural language input text,means for applying a dynamic threshold to determine which ones of the plurality of categories are most similar to the recognized keywords of the first list,means for adjusting the profile weights in the keyword/categories determined to be the most relevant to the natural language input text and a second ones of the plurality of categories determined most relevant to the natural language input text by an external source,means for compiling the ones of the plurality of categories determined to be most similar into a third list, andmeans for passing the first list, the second list and the third list to an external application.
- View Dependent Claims (15, 16, 17, 19)
- - 15. The text classification system according to claim 14 wherein the keywords comprises words, phrases and regular expressions.
  - 16. The text classification system according to claim 14 wherein the domain specific knowledge base further includes a rule base and the computer further comprises:
    - means for utilizing the rule base to select certain ones of the plurality of categories that were determined to be most similar to the recognized keywords over other ones of the plurality of categories based on the first and second lists; and
      
      means for modifying the third list of the most similar categories to include the certain ones of the plurality of categories selected.
  - 17. The text classification system according to claim 14 wherein the domain specific knowledge base includes a knowledge base of keyword/category profiles, each category in the keyword/category profiles knowledge base having an associated profile which indicates what information provides evidence for a given category, the keyword/profile weight knowledge base is arranged to have associated with each keyword in a profile a profile weight that represents the amount of evidence a keyword provides for a given category, wherein the computer further comprises means for adjusting the profile weights in the keyword/category profiles in the domain specific knowledge base based upon the ones of the plurality of categories determined most relevant to the natural language input text and a second ones of the plurality of categories determined most relevant to the natural language input text by an external source.
  - 19. The text classification system according to claim 14 wherein the means for applying a dynamic threshold further includes:
    - means for calculating a value for the dynamic threshold based upon a similarity score of a most similar category and a predefined threshold offset; and
      
      means for classifying the categories based upon their respective similarity scores by discarding categories whose similarity scores are below the threshold value.

18. A method for classifying natural language text input into a computer system, the system includes memory having a domain specific knowledge base having a plurality of categories stored therein and including a rule base, the method comprising the steps of:
- (a) accepting as input natural language input text;
  
  (b) parsing the natural language input text into a first list of recognized keywords;
  
  (c) using the first list to deduce further facts from the natural language input text;
  
  (d) compiling the deduced facts into a second list;
  
  (e) calculating a numeric similarity score for each one of the plurality of categories in the knowledge base to indicate how similar one of the plurality of categories is to the natural language input text;
  
  (f) applying a dynamic threshold to determine which ones of the plurality of categories are most similar to the recognized keywords of the first list;
  
  (g) compiling the ones of the plurality of categories determined to be most similar in step (f) into a third list;
  
  (h) utilizing the rule base to select certain ones of the plurality of categories determined to be most similar to the recognized keywords over other ones of the plurality of categories based on the first and second lists; and
  
  (i) modifying the third list of the most similar categories to include the certain ones of the plurality of categories selected.

20. A method for classifying natural language text input into a computer system, the system includes memory having a domain specific knowledge base having a plurality of categories stored therein, the knowledge base including a lexicon that includes words, phrases and expressions and a keyword class hierarchy structured such that keywords that share something in common are grouped into classes, each class has associated facts that are true when a member of the class is identified in the natural language inputs text, the method comprising the steps of:
- (a) accepting as input natural language input text;
  
  (b) parsing the natural language input text into a first list of recognized keywords;
  
  (c) using the first list to deduce further facts from the natural language input text comprising the sub-steps of;
  
  (1) searching the keyword class hierarchy for all classes of which an identified keyword in the first list is a member,(2) locating all substitution keywords associated with each class of which the identified keyword is a member,(3) retrieving the located substitution keywords,(4) substituting the located substitution keywords for the identified keyword,(5) using the located substitution keywords to identify matches between the located substitution keywords and phrases in the lexicon,(6) recursively applying sub-steps (2) through (5) on all classes above the classes of which the identified keyword is a member in the keyword class hierarchy, and(7) repeating sub-steps (1) through (6) for each keyword in the first list;
  
  (d) compiling the deduced facts into a second list;
  
  (e) calculating a numeric similarity score for each one of the plurality of categories in the knowledge base to indicate how similar one of the plurality of categories is to the natural language input text;
  
  (f) applying a dynamic threshold to determine which ones of the plurality of categories are most similar to the recognized keywords of the first list; and
  
  (g) compiling the ones of the plurality of categories determined to be most similar in step (f) into a third list.

21. A text classification system comprising:
- memory;
  
  a domain specific knowledge base stored in said memory having a plurality of categories, the domain specific knowledge base including a rule base; and
  
  a computer coupled to the memory, the computer including;
  
  a natural language module for accepting as input into the computer natural language input text, the natural language module includes means for parsing the natural language input text into a first list of recognized keywords;
  
  an intelligent inferencer module for using the first list to deduce further facts from the information explicitly stated in the natural language input text, the intelligent inferencer module includes means for compiling the deduced facts into a second list;
  
  a similarity measuring module for calculating a numeric similarity score for each one of the plurality of categories in the knowledge base to indicate how similar one of the plurality of categories is to the natural language input text, the similarity measuring module includes;
  
  means for applying a dynamic threshold to determine which ones of the plurality of categories are most similar to the recognized keywords of the natural language input text, andmeans for compiling the ones of the plurality of categories determined to be most similar into a third list; and
  
  a category disambiguation module for utilizing the rule base to select certain ones of the plurality of categories determined to be most similar to the recognized keywords over other ones of the plurality of categories based on the first and second lists, the category disambiguation module includes means for modifying the third list of the most similar categories to include the certain ones of the plurality of categories selected.

22. A text classification system comprising:
- a memory;
  
  a domain specific knowledge base stored in said memory having a rule base and a plurality of categories; and
  
  a computer coupled to the memory, the computer including;
  
  means for accepting as input into the computer, natural language input text,means for parsing the natural language input text into a first list of recognized keywords,means for using the first list to deduce further facts from the natural language input text,means for compiling the deduced facts into a second list,means for calculating a numeric similarity score for each one of the plurality of categories in the knowledge base to indicate how similar one of the plurality of categories is to the natural language input text,means for applying a dynamic threshold to determine which ones of the plurality of categories are most similar to the recognized keywords of the first list,means for compiling the ones of the plurality of categories determined to be most similar into a third list,means for utilizing the rule base to select certain ones of the plurality of categories that were determined to be most similar to the recognized keywords over other ones of the plurality of categories based on the first and second lists, andmeans for modifying the third list of the most similar categories to include the certain ones of the plurality of categories selected.

23. A text classification system comprising:
- a memory;
  
  a domain specific knowledge base stored in said memory having a plurality of categories; and
  
  a computer coupled to the memory, the computer including;
  
  means for accepting as input into the computer, natural language input text,means for parsing the natural language input text into a first list of recognized keywords,means for using the first list to deduce further facts from the natural language input text,means for compiling the deduced facts into a second list,means for calculating a numeric similarity score for each one of the plurality of categories in the knowledge base to indicate how similar one of the plurality of categories is to the natural language input text,means for applying a dynamic threshold to determine which ones of the plurality of categories are most similar to the recognized keywords of the first list,means for calculating a value for the dynamic threshold based upon a similarity score of a most similar category and a predefined threshold offset,means for classifying the categories based upon their respective similarity scores by discarding categories whose similarity scores are below the threshold value, andmeans for compiling the ones of the plurality of categories determined to be most similar into a third list.

24. A text classification system comprising:
- a memory;
  
  a domain specific knowledge base stored in said memory having a plurality of categories, the domain specific knowledge base including a knowledge base of keyword/category profiles, each category in the keyword/category profiles knowledge base having an associated profile which indicates what information provides evidence for a given category, the keyword/profile weight knowledge base is arranged to have associated with each keyword in a profile a profile weight that represents the amount of evidence a keyword provides for a given category; and
  
  a computer coupled to the memory, the computer including;
  
  means for accepting as input into the computer, natural language input text,means for parsing the natural language input text into a first list of recognized keywords,means for using the first list to deduce further facts from the natural language input text,means for compiling the deduced facts into a second list,means for calculating a numeric similarity score for each one of the plurality of categories in the knowledge base to indicate how similar one of the plurality of categories is to the natural language input text,means for applying a dynamic threshold to determine which ones of the plurality of categories are most similar to the recognized keywords of the first list,means for compiling the ones of the plurality of categories determined to be most similar into a third list, andmeans for adjusting the profile weights in the keyword/category profiles in the domain specific knowledge base based upon the ones of the plurality of categories determined most relevant to the natural language input text and a second ones of the plurality of categories determined most relevant to the natural language input text by an external source.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Original Assignee
Digital Equipment Corporation (HP Inc.)
Inventors
Kannan, Narasimhan, Register, Michael S.
Primary Examiner(s)
Boudreau, Leo H.
Assistant Examiner(s)
KELLEY, CHRISTOPHER S

Application Number

US07/855,378
Time in Patent Office

991 Days
Field of Search

382/36-38, 382/40, 382/14, 382/15, 395/21-23, 364/419
US Class Current

382/159
CPC Class Codes

G06F 16/353 into predefined classes

G06F 40/253 Grammatical analysis; Style...

Method and apparatus for text classification

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

315 Citations

24 Claims

Specification

Use Cases

Quick Links

Others

Method and apparatus for text classification

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

315 Citations

24 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others