Template-based structured document classification and extraction
First Claim
1. A computer-implemented method, comprising:
- identifying a data extraction template generated from a cluster of electronic messages that share at least some underlying structural and textual similarities;
applying features of the cluster of electronic messages as input to one or more category machine learning models, wherein the one or more category machine learning models are trained to classify electronic messages into one or more of a plurality of document categories;
determining a document category associated with the data extraction template based on output generated over the one or more category machine learning models based on the input provided to the one or more category machine learning models;
applying the same features or different features of the cluster of electronic messages as input to one or more extraction machine learning models, wherein the one or more extraction machine learning models are trained to provide one or more locations of one or more transient fields in electronic messages, and wherein the one or more extraction machine learning models are selected from a plurality of extraction machine learning models based on the determined document category;
determining one or more locations of one or more transient fields in the cluster of electronic messages based on output generated from the one or more extraction machine learning models based on the input provided to the one or more extraction machine learning models;
storing, in computer memory, a first association between the data extraction template and the determined one or more transient field locations in the cluster of electronic messages;
extracting at least two data points from a given electronic message of a user that shares at least some structural and textual similarities with the cluster of electronic messages, wherein the extracting is based on the first association; and
providing the at least two extracted data points for surfacing to the user via one or more computing devices operated by the user.
2 Assignments
0 Petitions
Accused Products
Abstract
Techniques are described herein for automatically generating data extraction templates for structured documents (e.g., B2C emails, invoices, bills, invitations, etc.), and for assigning classifications to those data extraction templates to streamline data extraction from subsequent structured documents. In various implementations, a data extraction template generated from a cluster of structured documents that share fixed content may be identified. Features of the cluster of structured documents may be applied as input to extraction machine learning model(s) trained to provide location(s) of transient field(s) in structured documents, to determine location(s) of transient field(s) in the cluster of structured documents. An association between the data extraction template and the determined transient field location(s) may be stored. Based on the association, data point(s) may be extracted from a given structured document of a user that shares fixed content with the cluster of structured documents. The extracted data point(s) may be surfaced to the user.
19 Citations
18 Claims
-
1. A computer-implemented method, comprising:
-
identifying a data extraction template generated from a cluster of electronic messages that share at least some underlying structural and textual similarities; applying features of the cluster of electronic messages as input to one or more category machine learning models, wherein the one or more category machine learning models are trained to classify electronic messages into one or more of a plurality of document categories; determining a document category associated with the data extraction template based on output generated over the one or more category machine learning models based on the input provided to the one or more category machine learning models; applying the same features or different features of the cluster of electronic messages as input to one or more extraction machine learning models, wherein the one or more extraction machine learning models are trained to provide one or more locations of one or more transient fields in electronic messages, and wherein the one or more extraction machine learning models are selected from a plurality of extraction machine learning models based on the determined document category; determining one or more locations of one or more transient fields in the cluster of electronic messages based on output generated from the one or more extraction machine learning models based on the input provided to the one or more extraction machine learning models; storing, in computer memory, a first association between the data extraction template and the determined one or more transient field locations in the cluster of electronic messages; extracting at least two data points from a given electronic message of a user that shares at least some structural and textual similarities with the cluster of electronic messages, wherein the extracting is based on the first association; and providing the at least two extracted data points for surfacing to the user via one or more computing devices operated by the user. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system comprising one or more processors and memory operably coupled with the one or more processors, wherein the memory stores instructions that, in response to execution of the instructions by one or more processors, cause the one or more processors to:
-
identify a data extraction template generated from a cluster of electronic messages that share at least some underlying structural and textual similarities; apply features of the cluster of electronic messages as input to one or more category machine learning models, wherein the one or more category machine learning models are trained to classify electronic messages into one or more of a plurality of document categories; determine a document category associated with the data extraction template based on output generated over the one or more category machine learning models based on the input provided to the one or more category machine learning models; apply the same features or different features of the cluster of electronic messages as input to one or more extraction machine learning models, wherein the one or more extraction machine learning models are trained to provide one or more locations of one or more transient fields in electronic messages, wherein the one or more extraction machine learning models are selected from a plurality of extraction machine learning models based on the determined document category; determine one or more locations of one or more transient fields in the cluster of electronic messages based on output generated from the one or more extraction machine learning models based on the input provided to the one or more extraction machine learning models; store, in the memory, a first association between the data extraction template and the determined one or more transient field locations in the cluster of electronic messages; extract at least two data points from a given electronic message of a user that shares at least some structural and textual similarities with the cluster of electronic messages, wherein the extraction is based on the first association; and provide the at least two extracted data points for surfacing to the user via one or more computing devices operated by the user. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. At least one non-transitory computer-readable medium comprising instructions that, in response to execution of the instructions by one or more processors, cause the one or more processors to perform the following operations:
-
identifying a data extraction template generated from a cluster of electronic messages that share at least some underlying structural and textual similarities; applying features of the cluster of electronic messages as input to one or more category machine learning models, wherein the one or more category machine learning models are trained to classify electronic messages into one or more of a plurality of document categories; and determining a document category associated with the data extraction template based on output generated over the one or more category machine learning models based on the input provided to the one or more category machine learning models; applying features of the cluster of electronic messages as input to one or more extraction machine learning models, wherein the one or more extraction machine learning models are trained to provide one or more locations of one or more transient fields in electronic messages, wherein the one or more extraction machine learning models are selected from a plurality of extraction machine learning models based on the determined document category; determining one or more locations of one or more transient fields in the cluster of electronic messages based on output generated from the one or more extraction machine learning models based on the input provided to the one or more extraction machine learning models; storing, in computer memory, a first association between the data extraction template and the determined document category, and a second association between the data extraction template and the determined one or more transient field locations in the cluster of electronic messages; extracting at least at least two data points from a given electronic message of a user that shares at least some structural and textual similarities with the cluster of electronic messages, wherein the extraction is based on the first and second associations; and providing the at least two extracted data points for surfacing to the user via one or more computing devices operated by the user. - View Dependent Claims (18)
-
Specification