Identification of key segments in document images

US 10,699,112 B1
Filed: 09/28/2018
Issued: 06/30/2020
Est. Priority Date: 09/28/2018
Status: Active Grant

First Claim

Patent Images

1. A computerized method for identifying keywords in a document image, comprising:

(i) retrieving a document image from a set of document images where each document in the set of document images contains information organized in a two-dimensional structure and contains keywords, where each keyword of a set of the keywords has a value associated therewith;

(ii) processing the document image to identify text segments contained within the document image;

(iii) processing the text segments to identify subword embeddings associated with each of the text segments, wherein each of the subword embeddings associated with a text segment represents a character group in the document image,(iv) generating an n-dimensional vector for each text segment from its subword embeddings;

(v) for each identified text segment, mapping one or more of the n-dimensional vectors to each of the identified text segments to generate for each identified text segment, a feature vector which describes a local context of the identified text segment;

(vi) retrieving an annotated version of the document image containing a visual indication annotation associated with each visual indication of a keyword in the document;

(vii) associating with each visual indication of a keyword in the annotated version of the document image a corresponding feature vector to generate a training document; and

(viii) repeating steps (i) through (vii) for each document from the set of document images to generate a set of training documents.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method of automatically learning new keywords in a document image based on context such as when a never before seen keyword exists surrounded by other key-value pairs. A machine learning based approach leverages subword embeddings and two-dimensional geometric contexts in a gradient boosted trees classifier. Keys may be composed of multi-word strings or single-word strings.

Citations

18 Claims

1. A computerized method for identifying keywords in a document image, comprising:
- (i) retrieving a document image from a set of document images where each document in the set of document images contains information organized in a two-dimensional structure and contains keywords, where each keyword of a set of the keywords has a value associated therewith;
  
  (ii) processing the document image to identify text segments contained within the document image;
  
  (iii) processing the text segments to identify subword embeddings associated with each of the text segments, wherein each of the subword embeddings associated with a text segment represents a character group in the document image,(iv) generating an n-dimensional vector for each text segment from its subword embeddings;
  
  (v) for each identified text segment, mapping one or more of the n-dimensional vectors to each of the identified text segments to generate for each identified text segment, a feature vector which describes a local context of the identified text segment;
  
  (vi) retrieving an annotated version of the document image containing a visual indication annotation associated with each visual indication of a keyword in the document;
  
  (vii) associating with each visual indication of a keyword in the annotated version of the document image a corresponding feature vector to generate a training document; and
  
  (viii) repeating steps (i) through (vii) for each document from the set of document images to generate a set of training documents.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The computerized method of claim 1 wherein each of the subword embeddings utilize a vector representation of one or more n-character groupings of a word, where n is a preselected integer and where a word is represented by a sum of the vector representations.
  - 3. The computerized method of claim 1 wherein step (v) comprises:
    - selecting for the identified text segment one or more of text segments that are positioned above, below, to the left and to the right of the identified text segment.
  - 4. The computerized method of claim 1 wherein step (v) comprises:
    - selecting for the identified text segment one or more of text segments that are positioned above, below, to the left and to the right of the identified text segment, and which overlap the identified text segment by greater than a preselected overlap amount.
  - 5. The computerized method of claim 1 wherein step (v) comprises:
    - selecting for the identified text segment one or more of text segments that are positioned above, below, to the left and to the right of the identified text segment, and wherein the feature vector comprises a concatenation of vectors corresponding to the identified text segment and of a vector corresponding to each of the text segments that are positioned above, below, to the left and to the right of the identified text segment.
  - 6. The computerized method of claim 1 further comprising providing the set of training documents to a supervised learning engine to generate a trained model.

7. A document processing system comprising:
- data storage for storing a set of document images where each document in the set of document images contains information organized in a two-dimensional structure and contains keywords, where each keyword of a set of the keywords has a value associated therewith; and
  
  a processor operatively coupled to the data storage and configured to execute instructions that when executed cause the processor to generate a set of training documents from at least a portion of the documents in the set of document images by, for each document in the portion of the documents in the set of document images;
  
  retrieving a document image from the data storage;
  
  processing the document image to identify text segments contained within the document image;
  
  processing the text segments to identify subword embeddings associated with each of the text segments, wherein each subword embedding associated with a text segment represents a character group in the document image,generating an n-dimensional vector for each text segment from its subword embeddings;
  
  for each identified text segment, mapping one or more of the n-dimensional vectors to each of the identified text segments to generate for each identified text segment, a feature vector which describes a local context of the identified text segment;
  
  retrieving an annotated version of the document image containing a visual indication annotation associated with each visual indication of a keyword in the document; and
  
  associating with each visual indication of a keyword in the annotated version of the document image a corresponding feature vector to generate a training document for the set of training documents.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The document processing system of claim 7 wherein the subword embeddings utilize a vector representation of one or more n-character groupings of a word, where n is a preselected integer and where a word is represented by a sum of the vector representations.
  - 9. The document processing system of claim 7 wherein the instructions that when executed cause, for each identified text segment, mapping one or more of the n-dimensional vectors to each of the identified text segments to generate for each identified text segment, a feature vector which describes a local context of the identified text segment, comprise instructions that when executed cause the processor to:
    - select for the identified text segment one or more of text segments that are positioned above, below, to the left and to the right of the identified text segment.
  - 10. The document processing system of claim 7 wherein the instructions that when executed cause, for each identified text segment, mapping one or more of the n-dimensional vectors to each of the identified text segments to generate for each identified text segment, a feature vector which describes a local context of the identified text segment, comprise instructions that when executed cause the processor to:
    - select for the identified text segment one or more of text segments that are positioned above, below, to the left and to the right of the identified text segment and which overlap the identified text segment by greater than a preselected overlap amount.
  - 11. The document processing system of claim 7 wherein the instructions that when executed cause, for each identified text segment, mapping one or more of the n-dimensional vectors to each of the identified text segments to generate for each identified text segment, a feature vector which describes a local context of the identified text segment, comprise instructions that when executed cause the processor to:
    - select for the identified text segment one or more of text segments that are positioned above, below, to the left and to the right of the identified text segment, and wherein the feature vector comprises a concatenation of vectors corresponding to the identified text segment and of a vector corresponding to each of the text segments that are positioned above, below, to the left and to the right of the identified text segment.
  - 12. The document processing system of claim 7 further comprising instructions that when executed cause the processor to provide the set of training documents to a supervised learning engine to generate a trained model.

13. A computer program product for generating a set of training documents, the computer program product comprising a non-transitory computer readable storage medium and including instructions for causing the computer system to execute a method for generating a set of training documents, the method comprising the actions of:
- retrieving a document image from data storage which has stored thereon a set of document images where each document in the set of document images contains information organized in a two-dimensional structure and contains keywords, where each keyword of a set of the keywords has a value associated therewith;
  
  generating the set of training documents from at least a portion of the documents in the set of document images, by, for each document in the portion of the documents in the set of document images;
  
  processing the document image to identify text segments contained within the document image;
  
  processing the text segments to identify subword embeddings associated with each of the text segments, wherein each subword embedding associated with a text segment represents a character group in the document image,generating an n-dimensional vector for each text segment from its subword embeddings;
  
  for each identified text segment, mapping one or more of the n-dimensional vectors to each of the identified text segments to generate for each identified text segment, a feature vector which describes a local context of the identified text segment;
  
  retrieving an annotated version of the document image containing a visual indication annotation associated with each visual indication of a keyword in the document; and
  
  associating with each visual indication of a keyword in the annotated version of the document image a corresponding feature vector to generate a training document for the set of training documents.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The computer program product of claim 13 wherein the subword embeddings utilize a vector representation of one or more n-character groupings of a word, where n is a preselected integer and where a word is represented by a sum of the vector representations.
  - 15. The computer program product of claim 13 wherein the operation of for each identified text segment, mapping one or more of the n-dimensional vectors to each of the identified text segments to generate for each identified text segment, a feature vector which describes a local context of the identified text segment, comprises:
    - selecting for the identified text segment one or more of text segments that are positioned above, below, to the left and to the right of the identified text segment.
  - 16. The computer program product of claim 13 wherein the operation of for each identified text segment, mapping one or more of the n-dimensional vectors to each of the identified text segments to generate for each identified text segment, a feature vector which describes a local context of the identified text segment, comprises:
    - selecting for the identified text segment one or more of text segments that are positioned above, below, to the left and to the right of the identified text segment and which overlap the identified text segment by greater than a preselected overlap amount.
  - 17. The computer program product of claim 13 wherein the operation of for each identified text segment, mapping one or more of the n-dimensional vectors to each of the identified text segments to generate for each identified text segment, a feature vector which describes a local context of the identified text segment, comprises:
    - selecting for the identified text segment one or more of text segments that are positioned above, below, to the left and to the right of the identified text segment, and wherein the feature vector comprises a concatenation of vectors corresponding to the identified text segment and of a vector corresponding to each of the text segments that are positioned above, below, to the left and to the right of the identified text segment.
  - 18. The computer program product of claim 13 further comprising the operation of providing the set of training documents to a supervised learning engine to generate a trained model.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Automation Anywhere, Inc.
Original Assignee
Automation Anywhere, Inc.
Inventors
Corcoran, Thomas, Gejji, Vibhas, Van Lare, Stephen
Primary Examiner(s)
Osinski, Michael S

Application Number

US16/146,562
Time in Patent Office

641 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/313   Selection or weighting of t...

G06F 16/3347   using vector based model

G06F 16/34   Browsing; Visualisation the...

G06F 16/5846   using extracted text

G06F 18/214   Generating training pattern...

G06N 20/00   Machine learning

G06N 20/20   Ensemble learning

G06N 5/01   Dynamic search techniques; ...

G06V 10/23   based on positionally close...

G06V 30/414   Extracting the geometrical ...

G06V 30/416   Extracting the logical stru...

Identification of key segments in document images

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Identification of key segments in document images

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links