Template-based structured document classification and extraction

US 10,657,158 B2
Filed: 11/23/2016
Issued: 05/19/2020
Est. Priority Date: 11/23/2016
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method, comprising:

identifying a data extraction template generated from a cluster of electronic messages that share at least some underlying structural and textual similarities;

applying features of the cluster of electronic messages as input to one or more category machine learning models, wherein the one or more category machine learning models are trained to classify electronic messages into one or more of a plurality of document categories;

determining a document category associated with the data extraction template based on output generated over the one or more category machine learning models based on the input provided to the one or more category machine learning models;

applying the same features or different features of the cluster of electronic messages as input to one or more extraction machine learning models, wherein the one or more extraction machine learning models are trained to provide one or more locations of one or more transient fields in electronic messages, and wherein the one or more extraction machine learning models are selected from a plurality of extraction machine learning models based on the determined document category;

determining one or more locations of one or more transient fields in the cluster of electronic messages based on output generated from the one or more extraction machine learning models based on the input provided to the one or more extraction machine learning models;

storing, in computer memory, a first association between the data extraction template and the determined one or more transient field locations in the cluster of electronic messages;

extracting at least two data points from a given electronic message of a user that shares at least some structural and textual similarities with the cluster of electronic messages, wherein the extracting is based on the first association; and

providing the at least two extracted data points for surfacing to the user via one or more computing devices operated by the user.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques are described herein for automatically generating data extraction templates for structured documents (e.g., B2C emails, invoices, bills, invitations, etc.), and for assigning classifications to those data extraction templates to streamline data extraction from subsequent structured documents. In various implementations, a data extraction template generated from a cluster of structured documents that share fixed content may be identified. Features of the cluster of structured documents may be applied as input to extraction machine learning model(s) trained to provide location(s) of transient field(s) in structured documents, to determine location(s) of transient field(s) in the cluster of structured documents. An association between the data extraction template and the determined transient field location(s) may be stored. Based on the association, data point(s) may be extracted from a given structured document of a user that shares fixed content with the cluster of structured documents. The extracted data point(s) may be surfaced to the user.

19 Citations

View as Search Results

18 Claims

1. A computer-implemented method, comprising:
- identifying a data extraction template generated from a cluster of electronic messages that share at least some underlying structural and textual similarities;
  
  applying features of the cluster of electronic messages as input to one or more category machine learning models, wherein the one or more category machine learning models are trained to classify electronic messages into one or more of a plurality of document categories;
  
  determining a document category associated with the data extraction template based on output generated over the one or more category machine learning models based on the input provided to the one or more category machine learning models;
  
  applying the same features or different features of the cluster of electronic messages as input to one or more extraction machine learning models, wherein the one or more extraction machine learning models are trained to provide one or more locations of one or more transient fields in electronic messages, and wherein the one or more extraction machine learning models are selected from a plurality of extraction machine learning models based on the determined document category;
  
  determining one or more locations of one or more transient fields in the cluster of electronic messages based on output generated from the one or more extraction machine learning models based on the input provided to the one or more extraction machine learning models;
  
  storing, in computer memory, a first association between the data extraction template and the determined one or more transient field locations in the cluster of electronic messages;
  
  extracting at least two data points from a given electronic message of a user that shares at least some structural and textual similarities with the cluster of electronic messages, wherein the extracting is based on the first association; and
  
  providing the at least two extracted data points for surfacing to the user via one or more computing devices operated by the user.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein the extracted data point is surfaced to the user in a manner that is selected based on the determined document category.
  - 3. The method of claim 1, further comprising storing, in the computer memory, a second association between the data extraction template and the determined document category.
  - 4. The method of claim 3, wherein the second association is stored in response to a determination that a count of electronic messages in the cluster that were classified into the document category satisfy a threshold.
  - 5. The method of claim 1, wherein the electronic messages comprise emails, SMS messages, or MMS messages.
  - 6. The method of claim 1, wherein the one or more extraction machine learning models are further trained to provide, in association with the one or more transient field locations, one or more semantic classifications, and wherein the first association further includes an association between the data extraction template and one or more semantic classifications.
  - 7. The method of claim 6, wherein the extracted data point is surfaced to the user in a manner that is selected based on a semantic classification of the one or more semantic classifications that is associated with a transient field location of the one or more transient field locations that contained the extracted data point.
  - 8. The method of claim 1, wherein the first association is stored in response to a determination that a count of electronic messages in the cluster for which a particular transient field location is provided by the one or more extraction machine learning models satisfies a threshold.

9. A system comprising one or more processors and memory operably coupled with the one or more processors, wherein the memory stores instructions that, in response to execution of the instructions by one or more processors, cause the one or more processors to:
- identify a data extraction template generated from a cluster of electronic messages that share at least some underlying structural and textual similarities;
  
  apply features of the cluster of electronic messages as input to one or more category machine learning models, wherein the one or more category machine learning models are trained to classify electronic messages into one or more of a plurality of document categories;
  
  determine a document category associated with the data extraction template based on output generated over the one or more category machine learning models based on the input provided to the one or more category machine learning models;
  
  apply the same features or different features of the cluster of electronic messages as input to one or more extraction machine learning models, wherein the one or more extraction machine learning models are trained to provide one or more locations of one or more transient fields in electronic messages, wherein the one or more extraction machine learning models are selected from a plurality of extraction machine learning models based on the determined document category;
  
  determine one or more locations of one or more transient fields in the cluster of electronic messages based on output generated from the one or more extraction machine learning models based on the input provided to the one or more extraction machine learning models;
  
  store, in the memory, a first association between the data extraction template and the determined one or more transient field locations in the cluster of electronic messages;
  
  extract at least two data points from a given electronic message of a user that shares at least some structural and textual similarities with the cluster of electronic messages, wherein the extraction is based on the first association; and
  
  provide the at least two extracted data points for surfacing to the user via one or more computing devices operated by the user.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The system of claim 9, wherein the extracted data point is surfaced to the user in a manner that is selected based on the determined document category.
  - 11. The system of claim 9, further comprising instructions to store, in the computer memory, a second association between the data extraction template and the determined document category.
  - 12. The system of claim 11, wherein the second association is stored in response to a determination that a count of electronic messages in the cluster that were classified into the document category satisfy a threshold.
  - 13. The system of claim 9, wherein the electronic messages comprise emails, SMS messages, or MMS messages.
  - 14. The system of claim 9, wherein the one or more extraction machine learning models are further trained to provide, in association with the one or more transient field locations, one or more semantic classifications, and wherein the first association further includes an association between the data extraction template and one or more semantic classifications.
  - 15. The system of claim 14, wherein the extracted data point is surfaced to the user in a manner that is selected based on a semantic classification of the one or more semantic classifications that is associated with a transient field location of the one or more transient field locations that contained the extracted data point.
  - 16. The system of claim 9, wherein the first association is stored in response to a determination that a count of electronic messages in the cluster for which a particular transient field location is provided by the one or more extraction machine learning models satisfies a threshold.

17. At least one non-transitory computer-readable medium comprising instructions that, in response to execution of the instructions by one or more processors, cause the one or more processors to perform the following operations:
- identifying a data extraction template generated from a cluster of electronic messages that share at least some underlying structural and textual similarities;
  
  applying features of the cluster of electronic messages as input to one or more category machine learning models, wherein the one or more category machine learning models are trained to classify electronic messages into one or more of a plurality of document categories; and
  
  determining a document category associated with the data extraction template based on output generated over the one or more category machine learning models based on the input provided to the one or more category machine learning models;
  
  applying features of the cluster of electronic messages as input to one or more extraction machine learning models, wherein the one or more extraction machine learning models are trained to provide one or more locations of one or more transient fields in electronic messages, wherein the one or more extraction machine learning models are selected from a plurality of extraction machine learning models based on the determined document category;
  
  determining one or more locations of one or more transient fields in the cluster of electronic messages based on output generated from the one or more extraction machine learning models based on the input provided to the one or more extraction machine learning models;
  
  storing, in computer memory, a first association between the data extraction template and the determined document category, and a second association between the data extraction template and the determined one or more transient field locations in the cluster of electronic messages;
  
  extracting at least at least two data points from a given electronic message of a user that shares at least some structural and textual similarities with the cluster of electronic messages, wherein the extraction is based on the first and second associations; and
  
  providing the at least two extracted data points for surfacing to the user via one or more computing devices operated by the user.
- View Dependent Claims (18)
- - 18. The at least one non-transitory computer-readable medium of claim 17, wherein the extracted data point is surfaced to the user in a manner that is selected based on the determined document category.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google LLC (Alphabet Inc.)
Inventors
Sheng, Ying, Lu, Yifeng, Xie, Jing, Yang, Jie, Pueyo, Luis Garcia, Lou, Jinan, Wendt, James
Primary Examiner(s)
Lin, Shew Fen

Application Number

US15/360,939
Publication Number

US 20180144042A1
Time in Patent Office

1,273 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/285   Clustering or classification

G06F 16/93   Document management systems

G06F 40/174   Form filling; Merging

G06F 40/186   Templates

G06N 20/00   Machine learning

G06N 20/20   Ensemble learning

G06Q 10/10   Office automation; Time man...

Template-based structured document classification and extraction

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

19 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Template-based structured document classification and extraction

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

19 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links