Document categorisation system
First Claim
1. A data categorization computer system comprising:
- data processing logic having;
a clusterer module configured to apply unsupervised learning to first items of electronic data stored on a computer-readable storage medium to generate a set of clusters of related ones of said first items of electronic data based on features extracted from said first items of electronic data, said features including at least one of n-grams, words and phrases, and the clusters representing respective item categories;
an interactive cluster editor configured to display the set of clusters to a user and to modify the set of clusters based on input from the user to provide a set of training clusters;
a filter module configured to use the set of training clusters as training data for supervised learning to generate categorization data representing models that distinguish respective ones of said clusters from the other clusters;
a classifier configured to use the categorization data to categorize second items of electronic data stored on the computer-readable storage medium into the training clusters; and
a trend analyzer for determining trends of item categories over time.
2 Assignments
0 Petitions
Accused Products
Abstract
A document categorization system, including a clusterer for generating clusters of related electronic documents based on features extracted from the documents, and a filter module for generating a filter on the basis of the clusters to categorize further documents received by the system. The system may include an editor for manually browsing and modifying the clusters. The categorization of the documents is based on n-grams, which are used to determine significant features of the documents. The system includes a trend analyzer for determining trends of changing document categories over time, and for identifying novel clusters. The system may be implemented as a plug-in module for a spreadsheet application for permitting one-off or ongoing analysis of text entries in a worksheet.
-
Citations
37 Claims
-
1. A data categorization computer system comprising:
-
data processing logic having; a clusterer module configured to apply unsupervised learning to first items of electronic data stored on a computer-readable storage medium to generate a set of clusters of related ones of said first items of electronic data based on features extracted from said first items of electronic data, said features including at least one of n-grams, words and phrases, and the clusters representing respective item categories; an interactive cluster editor configured to display the set of clusters to a user and to modify the set of clusters based on input from the user to provide a set of training clusters; a filter module configured to use the set of training clusters as training data for supervised learning to generate categorization data representing models that distinguish respective ones of said clusters from the other clusters; a classifier configured to use the categorization data to categorize second items of electronic data stored on the computer-readable storage medium into the training clusters; and a trend analyzer for determining trends of item categories over time. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A data categorization process executed by a computer system, the process including:
-
using the computer system to apply unsupervised learning to first items of electronic data stored on a computer-readable storage medium to generate a set of clusters of related ones of said first items of electronic data based on features extracted from said first items of electronic data, said features including at least one of n-grams, words and phrases, and the clusters representing respective item categories; displaying the set of clusters to a user; receiving cluster modification data from the user; modifying the set of clusters on the basis of the cluster modification data to provide a set of training clusters; using the training clusters as training data for applying supervised learning to generate categorization data representing models that distinguish respective ones of said training clusters form the other training clusters; using the categorization data to categorize second items of electronic data into the training clusters; and determining one or more trends of respective item categories over time. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35)
-
-
36. A data categorization system embodied on a computer-readable storage medium for use with first items of electronic data and second items of electronic data, the data categorization system including:
-
a clusterer configured to apply unsupervised learning to the first items of electronic data to generate clusters of the first items of electronic data based on features extracted from the first items of electronic data, said features including at least one of n-grams, words and phrases; a cluster editor having user interface components for displaying the clusters of the first items of electronic data to a user of the system and for receiving cluster modification data from the user; the clusterer being configured to create or delete one or more clusters of the first items of electronic data and to move one or more of the first items of electronic data or one or more of said clusters to another cluster on the basis of the cluster modification data to form one or more modified clusters; a filter module configured to use the modified clusters as training data for supervised learning to generate classification data; a classifier for using the classification data to classify the second items of electronic data into the modified clusters; and a trend analyzer for determining trends of item classifications over time. - View Dependent Claims (37)
-
Specification