Methods and systems for classifying data using a hierarchical taxonomy

US 9,367,814 B1
Filed: 06/22/2012
Issued: 06/14/2016
Est. Priority Date: 12/27/2011
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method including executing instructions stored on a computer-readable medium, the method comprising:

generating a set of document classifiers by applying a classification algorithm to a trusted corpus, wherein the trusted corpus includes a set of training documents representing a hierarchical taxonomy, the hierarchical taxonomy including a hierarchical tree structure of domain specific issues that includes multiple levels of issue categories, subcategories, and sub issues of each issue, the trusted corpus further includes previously classified documents associated with a classification confidence level above a predetermined confidence level threshold;

executing one or more of the generated document classifiers against a first plurality of input documents to create a first plurality of classified documents, wherein each classified document is associated with a classification within the taxonomy and a classification confidence level;

selecting one or more classified documents that are associated with a classification confidence level below the predetermined confidence level threshold to create a set of low-confidence documents;

disassociating the low-confidence documents from each of the associated classifications;

prompting a user to enter a new classification within the hierarchical taxonomy for at least one low-confidence document, wherein the low-confidence document is associated with the entered classification and with a predetermined confidence level to create a newly classified document in at least one of the multiple levels of issue categories, subcategories, and sub issues of each issue of the hierarchical taxonomy;

applying a highest classification confidence level to the newly classified document;

including the newly classified document in the trusted corpus to create an updated trusted corpus; and

executing one or more of the generated document classifiers, by applying the classification algorithm to the updated trusted corpus against a second plurality of input documents to create a second plurality of classified documents, wherein each classified document is associated with a classification within the taxonomy and a classification confidence level.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for classifying documents is provided. A set of document classifiers is generated by applying a classification algorithm to a trusted corpus that includes a set of training documents representing a taxonomy. One or more of the generated document classifiers are executed against a plurality of input documents to create a plurality of classified documents. Each classified document is associated with a classification within the taxonomy and a classification confidence level. One or more classified documents that are associated with a classification confidence level below a predetermined threshold value are selected to create a set of low-confidence documents. The low-confidence documents are disassociated from each of the associated classifications. A user is prompted to enter a classification within the taxonomy for at least one low-confidence document. The low-confidence document is associated with the entered classification and with a predetermined confidence level to create a newly classified document.

65 Citations

View as Search Results

20 Claims

1. A computer-implemented method including executing instructions stored on a computer-readable medium, the method comprising:
- generating a set of document classifiers by applying a classification algorithm to a trusted corpus, wherein the trusted corpus includes a set of training documents representing a hierarchical taxonomy, the hierarchical taxonomy including a hierarchical tree structure of domain specific issues that includes multiple levels of issue categories, subcategories, and sub issues of each issue, the trusted corpus further includes previously classified documents associated with a classification confidence level above a predetermined confidence level threshold;
  
  executing one or more of the generated document classifiers against a first plurality of input documents to create a first plurality of classified documents, wherein each classified document is associated with a classification within the taxonomy and a classification confidence level;
  
  selecting one or more classified documents that are associated with a classification confidence level below the predetermined confidence level threshold to create a set of low-confidence documents;
  
  disassociating the low-confidence documents from each of the associated classifications;
  
  prompting a user to enter a new classification within the hierarchical taxonomy for at least one low-confidence document, wherein the low-confidence document is associated with the entered classification and with a predetermined confidence level to create a newly classified document in at least one of the multiple levels of issue categories, subcategories, and sub issues of each issue of the hierarchical taxonomy;
  
  applying a highest classification confidence level to the newly classified document;
  
  including the newly classified document in the trusted corpus to create an updated trusted corpus; and
  
  executing one or more of the generated document classifiers, by applying the classification algorithm to the updated trusted corpus against a second plurality of input documents to create a second plurality of classified documents, wherein each classified document is associated with a classification within the taxonomy and a classification confidence level.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein generating a set of document classifiers further comprises applying a classification algorithm to a trusted corpus by applying the classification algorithm to a trusted corpus including classified documents having a classification confidence level greater than a predetermined confidence level threshold.
  - 3. The method of claim 1, wherein generating a set of document classifiers further comprises applying a classification algorithm to a trusted corpus by applying the classification algorithm to a trusted corpus including classified documents associated with a classification entered by a user.
  - 4. The method of claim 1, further comprising selecting one or more of the classified documents from the second plurality of input documents that are associated with a classification confidence below a different predetermined confidence level threshold to create a new set of low-confidence documents, wherein the different predetermined confidence level threshold is lower than the predetermined confidence level threshold.
  - 5. The method of claim 1, further comprising, when one or more of the generated document classifiers returns a plurality of classifications for a first input document of the plurality of input documents:
    - identifying a lowest common parent node of the plurality of classifications within the taxonomy; and
      
      associating the first input document with the identified lowest common parent node.
  - 6. The method of claim 1, wherein generating a set of document classifiers further comprises applying a classification algorithm to a trusted corpus by applying the classification algorithm to a trusted corpus including one or more of the following:
    - email messages, chat logs, and transcribed telephone conversations.
  - 7. The method of claim 1, wherein generating a set of document classifiers further comprises applying a classification algorithm to a trusted corpus by applying the classification algorithm to a trusted corpus including one or more of the following:
    - customer support documentation, predetermined responses to support issues, and customer service representative training documents.

8. A computer system comprising:
- a memory for storing a trusted corpus, wherein the trusted corpus includes a set of training documents representing a hierarchical taxonomy, the hierarchical taxonomy including a hierarchical tree structure of domain specific issues that includes multiple levels of issue categories, subcategories, and sub issues of each issue, the trusted corpus further includes previously classified documents associated with a classification confidence level above a predetermined confidence level threshold; and
  
  a processor coupled to the memory and programmed to;
  
  generate a set of document classifiers by applying a classification algorithm to the trusted corpus;
  
  execute one or more of the generated document classifiers against a first plurality of input documents to create a first plurality of classified documents, wherein each classified document is associated with a classification within the taxonomy and a classification confidence level;
  
  select one or more classified documents that are associated with a classification confidence level below the predetermined confidence level threshold to create a set of low-confidence documents;
  
  disassociate the low-confidence documents from each of the associated classifications;
  
  prompt a user to enter a classification within the hierarchical taxonomy for at least one low-confidence document, wherein the low-confidence document is associated with the entered classification and with a predetermined confidence level to create a newly classified document;
  
  apply a highest classification confidence level to the newly classified document;
  
  include the newly classified document in the trusted corpus to create an updated trusted corpus; and
  
  execute one or more of the generated document classifiers, by applying the classification algorithm to the updated trusted corpus against a second plurality of input documents to create a second plurality of classified documents, wherein each classified document is associated with a classification within the taxonomy and a classification confidence level.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The system of claim 8, wherein the processor is programmed to apply the classification algorithm against a trusted corpus including classified documents with a classification confidence level greater than a predetermined confidence level threshold.
  - 10. The system of claim 8, wherein the processor is programmed to apply the classification algorithm to a trusted corpus including classified documents associated with a classification entered by a user.
  - 11. The system of claim 8, wherein the processor is further programmed to select one or more of the classified documents from the second plurality of input documents that are associated with a classification confidence below a different predetermined confidence level threshold to create a new set of low-confidence documents, wherein the different predetermined confidence level threshold is lower than the predetermined confidence level threshold.
  - 12. The system of claim 8, wherein when one or more of the generated document classifiers returns a plurality of classifications for a first input document of the plurality of input documents, the processor is further programmed to:
    - identify a lowest common parent of the plurality of classifications within the taxonomy; and
      
      associate the first input document with the identified lowest common parent.
  - 13. The system of claim 8, wherein the processor is programmed to apply the classification algorithm to a trusted corpus including one or more of the following:
    - email messages, chat logs, and transcribed telephone conversations.
  - 14. The system of claim 8, wherein the processor is programmed to apply the classification algorithm to a trusted corpus including one or more of the following:
    - customer support documentation, predetermined responses to support issues, and customer service representative training documents.

15. One or more non-transitory computer-readable media having computer-executable instructions embodied thereon, wherein when executed by a computing device, the computer-executable instructions cause the computing device to:
- generate a set of document classifiers by applying a classification algorithm to a trusted corpus, wherein the trusted corpus includes a set of training documents representing a hierarchical taxonomy, the hierarchical taxonomy including a hierarchical tree structure of domain specific issues that includes multiple levels of issue categories, subcategories, and sub issues of each issue, the trusted corpus further includes previously classified documents associated with a classification confidence level above a predetermined confidence level threshold;
  
  execute one or more of the generated document classifiers against a first plurality of input documents to create a first plurality of classified documents, wherein each classified document is associated with a classification within the taxonomy and a classification confidence level;
  
  select one or more classified documents that are associated with a classification confidence level below the predetermined confidence level threshold to create a set of low-confidence documents;
  
  disassociate the low-confidence documents from each of the associated classifications;
  
  prompt a user to enter a classification within the hierarchical taxonomy for at least one low-confidence document, wherein the low-confidence document is associated with the entered classification and with a predetermined confidence level to create a newly classified document;
  
  apply a highest classification confidence level to the newly classified document;
  
  include the newly classified document in the trusted corpus to create an updated trusted corpus; and
  
  execute one or more of the generated document classifiers, by applying the classification algorithm to the updated trusted corpus against a second plurality of input documents to create a second plurality of classified documents, wherein each classified document is associated with a classification within the taxonomy and a classification confidence level.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The computer-readable media of claim 15, wherein the computer-executable instructions cause the processor to apply the classification algorithm to a trusted corpus including classified documents with a classification confidence level greater than a predetermined confidence level threshold.
  - 17. The computer-readable media of claim 15, wherein the computer-executable instructions cause the processor to apply the classification algorithm to a trusted corpus including classified documents associated with a classification entered by a user.
  - 18. The computer-readable media of claim 15, wherein the computer-executable instructions further cause the processor to select one or more of the classified documents from the second plurality of input documents that are associated with a classification confidence below a different predetermined confidence level threshold to create a new set of low-confidence documents, wherein the different predetermined confidence level threshold is lower than the predetermined confidence level threshold.
  - 19. The computer-readable media of claim 15, wherein when one or more of the generated document classifiers returns a plurality of classifications for a first input document of the plurality of input documents, the computer-executable instructions cause the processor to apply the classification algorithm to:
    - identify a lowest common parent of the plurality of classifications within the taxonomy; and
      
      associate the first input document with the identified lowest common parent.
  - 20. The computer-readable media of claim 15, wherein the computer-executable instructions cause the processor to apply the classification algorithm to a trusted corpus including one or more of the following:
    - email messages, chat logs, transcribed telephone conversations, customer support documentation, predetermined responses to support issues, and customer service representative training documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Lewis, Glenn M., Buryak, Kirill, Ben-Artzi, Aner, Peng, Jun, Benbarak, Nadav
Primary Examiner(s)
Chaki, Kakali
Assistant Examiner(s)
Hanchak, Walter

Application Number

US13/530,505
Time in Patent Office

1,453 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 16/285   Clustering or classification

G06F 16/353   into predefined classes

G06F 16/93   Document management systems

G06N 20/00   Machine learning

G06N 5/022   Knowledge engineering; Know...

G06N 7/01   Probabilistic graphical mod...

G06Q 10/00   Administration; Management

Methods and systems for classifying data using a hierarchical taxonomy

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

65 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and systems for classifying data using a hierarchical taxonomy

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

65 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links