Generating and applying data extraction templates
First Claim
1. A computer-implemented method for generating and applying data extraction templates to extract transient content from plain text communications created automatically using templates, comprising:
- grouping a corpus of plain text communications into a plurality of clusters based on one or more shared attributes;
classifying one or more plain text segments of each plain text communication of a particular cluster as fixed in response to a determination that a count of occurrences of the one or more plain text segments across the particular cluster satisfies a criterion;
classifying one or more remaining plain text segments of each plain text communication of the particular cluster as transient;
generating a tree to represent sequences of classified plain text segments associated with each plain text communication of the particular cluster, wherein the tree includes at least a first branch to represent a first sequence of classified plain text segments corresponding to a first plain text communication of the particular cluster and a second branch to represent at least part of a second sequence of classified plain text segments corresponding to a second plain text communication of the particular cluster, wherein the second sequence of classified plain text segments is different than the first sequence of classified plain text segments;
generating, based on the tree, a data extraction template to extract, from one or more subsequent plain text communications, content associated with transient segments;
extracting content associated with at least one transient segment from a given subsequent plain text communication addressed to a user by applying the data extraction template to the given subsequent plain text communication; and
rating the extracting performed on the given subsequent plain text communication based on how closely a sequence of classified plain text segments generated for the given subsequent plain text communication traverses a branch of the tree.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, apparatus, systems, and computer-readable media are provided for generating and applying data extraction templates. In various implementations, a corpus of plain text communications such as emails may be grouped into clusters based on one or more similarities between the plain text communications. One or more segments of communications of a particular cluster may be classified as transient based on textual pattern matching. One or more other segments of the communications of the particular cluster may be classified as transient based on various criteria. One or more transient segments may be assigned a generic and/or specific semantic data type and/or a confidentiality designation based on various signals. A data extraction template may be generated to extract, from subsequent plain text communications, content associated with transient (and in some cases, non-confidential) segments.
52 Citations
16 Claims
-
1. A computer-implemented method for generating and applying data extraction templates to extract transient content from plain text communications created automatically using templates, comprising:
-
grouping a corpus of plain text communications into a plurality of clusters based on one or more shared attributes; classifying one or more plain text segments of each plain text communication of a particular cluster as fixed in response to a determination that a count of occurrences of the one or more plain text segments across the particular cluster satisfies a criterion; classifying one or more remaining plain text segments of each plain text communication of the particular cluster as transient; generating a tree to represent sequences of classified plain text segments associated with each plain text communication of the particular cluster, wherein the tree includes at least a first branch to represent a first sequence of classified plain text segments corresponding to a first plain text communication of the particular cluster and a second branch to represent at least part of a second sequence of classified plain text segments corresponding to a second plain text communication of the particular cluster, wherein the second sequence of classified plain text segments is different than the first sequence of classified plain text segments; generating, based on the tree, a data extraction template to extract, from one or more subsequent plain text communications, content associated with transient segments; extracting content associated with at least one transient segment from a given subsequent plain text communication addressed to a user by applying the data extraction template to the given subsequent plain text communication; and rating the extracting performed on the given subsequent plain text communication based on how closely a sequence of classified plain text segments generated for the given subsequent plain text communication traverses a branch of the tree. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system including memory and one or more processors operable to execute instructions stored in the memory, comprising instructions to:
-
group a corpus of plain text communications into a plurality of clusters based on one or more shared attributes; classify one or more plain text segments of each plain text communication of a particular cluster as fixed in response to a determination that a count of occurrences of the one or more plain text segments across the particular cluster satisfies a criterion; classify one or more remaining plain text segments of each plain text communication of the particular cluster as transient; generate a tree to represent sequences of classified plain text segments associated with each plain text communication of the particular cluster, wherein the tree includes at least a first branch to represent a first sequence of classified plain text segments corresponding to a first plain text communication of the particular cluster and a second branch to represent at least part of a second sequence of classified plain text segments corresponding to a second plain text communication of the particular cluster, wherein the second sequence of classified plain text segments is different than the first sequence of classified plain text segments; generate, based on the tree, a data extraction template to extract, from one or more subsequent plain text communications, content associated with transient segments; extract content associated with at least one transient segment from a given subsequent plain text communication addressed to a user by applying the data extraction template to the given subsequent plain text communication; and rate the extraction performed on the given subsequent plain text communication based on how closely a sequence of classified plain text segments generated for the given subsequent plain text communication traverses a branch of the tree. - View Dependent Claims (11, 12, 13, 14, 15)
-
-
16. At least one non-transitory computer-readable medium comprising instructions that, when execution by a computing system, cause the computing system to perform the following operations:
-
grouping a corpus of plain text communications into a plurality of clusters based on one or more shared attributes; classifying one or more plain text segments of each plain text communication of a particular cluster as fixed in response to a determination that a count of occurrences of the one or more plain text segments across the particular cluster satisfies a criterion; classifying one or more remaining plain text segments of each plain text communication of the particular cluster as transient; generating a tree to represent sequences of classified plain text segments associated with each plain text communication of the particular cluster, wherein the tree includes at least a first branch to represent a first sequence of classified plain text segments corresponding to a first plain text communication of the particular cluster and a second branch to represent at least part of a second sequence of classified plain text segments corresponding to a second plain text communication of the particular cluster, wherein the second sequence of classified plain text segments is different than the first sequence of classified plain text segments; generating, based on the tree, a data extraction template to extract, from one or more subsequent plain text communications, content associated with transient segments; extracting content associated with at least one transient segment from a given subsequent plain text communication addressed to a user by applying the data extraction template to the given subsequent plain text communication; and rating the extracting performed on the given subsequent plain text communication based on how closely a sequence of classified plain text segments generated for the given subsequent plain text communication traverses a branch of the tree.
-
Specification