Identification of key segments in document images
First Claim
Patent Images
1. A computerized method for identifying keywords in a document image, comprising:
- (i) retrieving a document image from a set of document images where each document in the set of document images contains information organized in a two-dimensional structure and contains keywords, where each keyword of a set of the keywords has a value associated therewith;
(ii) processing the document image to identify text segments contained within the document image;
(iii) processing the text segments to identify subword embeddings associated with each of the text segments, wherein each of the subword embeddings associated with a text segment represents a character group in the document image,(iv) generating an n-dimensional vector for each text segment from its subword embeddings;
(v) for each identified text segment, mapping one or more of the n-dimensional vectors to each of the identified text segments to generate for each identified text segment, a feature vector which describes a local context of the identified text segment;
(vi) retrieving an annotated version of the document image containing a visual indication annotation associated with each visual indication of a keyword in the document;
(vii) associating with each visual indication of a keyword in the annotated version of the document image a corresponding feature vector to generate a training document; and
(viii) repeating steps (i) through (vii) for each document from the set of document images to generate a set of training documents.
3 Assignments
0 Petitions
Accused Products
Abstract
A system and method of automatically learning new keywords in a document image based on context such as when a never before seen keyword exists surrounded by other key-value pairs. A machine learning based approach leverages subword embeddings and two-dimensional geometric contexts in a gradient boosted trees classifier. Keys may be composed of multi-word strings or single-word strings.
-
Citations
18 Claims
-
1. A computerized method for identifying keywords in a document image, comprising:
-
(i) retrieving a document image from a set of document images where each document in the set of document images contains information organized in a two-dimensional structure and contains keywords, where each keyword of a set of the keywords has a value associated therewith; (ii) processing the document image to identify text segments contained within the document image; (iii) processing the text segments to identify subword embeddings associated with each of the text segments, wherein each of the subword embeddings associated with a text segment represents a character group in the document image, (iv) generating an n-dimensional vector for each text segment from its subword embeddings; (v) for each identified text segment, mapping one or more of the n-dimensional vectors to each of the identified text segments to generate for each identified text segment, a feature vector which describes a local context of the identified text segment; (vi) retrieving an annotated version of the document image containing a visual indication annotation associated with each visual indication of a keyword in the document; (vii) associating with each visual indication of a keyword in the annotated version of the document image a corresponding feature vector to generate a training document; and (viii) repeating steps (i) through (vii) for each document from the set of document images to generate a set of training documents. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A document processing system comprising:
-
data storage for storing a set of document images where each document in the set of document images contains information organized in a two-dimensional structure and contains keywords, where each keyword of a set of the keywords has a value associated therewith; and a processor operatively coupled to the data storage and configured to execute instructions that when executed cause the processor to generate a set of training documents from at least a portion of the documents in the set of document images by, for each document in the portion of the documents in the set of document images; retrieving a document image from the data storage; processing the document image to identify text segments contained within the document image; processing the text segments to identify subword embeddings associated with each of the text segments, wherein each subword embedding associated with a text segment represents a character group in the document image, generating an n-dimensional vector for each text segment from its subword embeddings; for each identified text segment, mapping one or more of the n-dimensional vectors to each of the identified text segments to generate for each identified text segment, a feature vector which describes a local context of the identified text segment; retrieving an annotated version of the document image containing a visual indication annotation associated with each visual indication of a keyword in the document; and associating with each visual indication of a keyword in the annotated version of the document image a corresponding feature vector to generate a training document for the set of training documents. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A computer program product for generating a set of training documents, the computer program product comprising a non-transitory computer readable storage medium and including instructions for causing the computer system to execute a method for generating a set of training documents, the method comprising the actions of:
-
retrieving a document image from data storage which has stored thereon a set of document images where each document in the set of document images contains information organized in a two-dimensional structure and contains keywords, where each keyword of a set of the keywords has a value associated therewith; generating the set of training documents from at least a portion of the documents in the set of document images, by, for each document in the portion of the documents in the set of document images; processing the document image to identify text segments contained within the document image; processing the text segments to identify subword embeddings associated with each of the text segments, wherein each subword embedding associated with a text segment represents a character group in the document image, generating an n-dimensional vector for each text segment from its subword embeddings; for each identified text segment, mapping one or more of the n-dimensional vectors to each of the identified text segments to generate for each identified text segment, a feature vector which describes a local context of the identified text segment; retrieving an annotated version of the document image containing a visual indication annotation associated with each visual indication of a keyword in the document; and associating with each visual indication of a keyword in the annotated version of the document image a corresponding feature vector to generate a training document for the set of training documents. - View Dependent Claims (14, 15, 16, 17, 18)
-
Specification