Clustering communications based on classification
First Claim
1. A computer implemented method, comprising:
- identifying a plurality of classification terms indicative of a classification;
identifying a corpus of communications from one or more databases, the corpus of communications including a plurality of communications that are not labeled with an association to the classification;
determining a cluster of the communications based on occurrence of one or more of the classification terms in the communications of the cluster;
subsequent to determining the cluster, determining a feature set based on the communications of the cluster, wherein determining the feature set comprises;
determining one or more features that are based on content that appears in a plurality of the communications of the cluster,wherein the content is in addition to the classification terms used in determining the cluster, andwherein determining the features based on the content that is in addition to the classification terms comprises determining the features based on the content appearing in the plurality of the communications of the cluster;
assigning the feature set to an indication of the classification; and
using the assigned feature set to classify an additional communication with the classification or using the assigned feature set to select a data extraction parser, for the classification, for the additional communication.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods and apparatus related to clustering documents based on one or more classification terms and optionally based on similarity of structural paths of the documents. In some implementations, the documents are communications such as structured emails or other structured communications. In some of those implementations, clustering the communications includes identifying a plurality of classification terms indicative of a classification, identifying a corpus of communications that includes communications that are not labeled with an association to the classification, and determining a cluster of the communications based on occurrence of one or more of the classification terms in the communications of the cluster.
38 Citations
25 Claims
-
1. A computer implemented method, comprising:
-
identifying a plurality of classification terms indicative of a classification; identifying a corpus of communications from one or more databases, the corpus of communications including a plurality of communications that are not labeled with an association to the classification; determining a cluster of the communications based on occurrence of one or more of the classification terms in the communications of the cluster; subsequent to determining the cluster, determining a feature set based on the communications of the cluster, wherein determining the feature set comprises; determining one or more features that are based on content that appears in a plurality of the communications of the cluster, wherein the content is in addition to the classification terms used in determining the cluster, and wherein determining the features based on the content that is in addition to the classification terms comprises determining the features based on the content appearing in the plurality of the communications of the cluster; assigning the feature set to an indication of the classification; and using the assigned feature set to classify an additional communication with the classification or using the assigned feature set to select a data extraction parser, for the classification, for the additional communication. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A system including memory and one or more processors configured to execute instructions stored in the memory, comprising instructions to:
-
identify a plurality of classification terms indicative of a classification; identify a corpus of communications from one or more databases, the corpus of communications including a plurality of communications that are not labeled with an association to the classification; determine a cluster of the communications based on occurrence of one or more of the classification terms in the communications of the cluster; subsequent to determining the cluster, determine a feature set based on the communications of the cluster, wherein the instructions to determine the feature set comprise instructions to determine one or more features that are based on content that appears in a plurality of the communications of the cluster, wherein the content is in addition to the classification terms used in determining the cluster, and wherein the instructions to determine the features based on the content that is in addition to the classification terms include instructions to determine the content based on the content appearing in the plurality of the communications of the cluster; assign the feature set to an indication of the classification; and using the assigned feature set to classify an additional communication with the classification or using the assigned feature set to select a data extraction parser, for the classification, for the additional communication. - View Dependent Claims (22, 23, 24, 25)
-
Specification