Document alteration based on native text analysis and OCR
First Claim
1. A system for document alteration based on native text analysis and optical character recognition (OCR), the system comprising:
- at least one processor to;
analyze native text obtained from a native document to identify a text entity in the native document;
use a native application interface to convert the native document to a document image, wherein the native application interface is determined based on a document type of the native document;
perform OCR on the document image to identify a text location of the text entity, wherein the identifying of the text location of the text entity comprises;
recognizing a plurality of words in the document image,matching a given word of the plurality of words recognized in the document image with the text entity identified by the analyzing of the native text obtained from the native document, wherein the matching comprises matching variations of a root portion of the text entity to the given word of the plurality of words,generating a plurality of bounding coordinates for each of the plurality of words, wherein the plurality of bounding coordinates describe a bounding rectangle of a plurality of bounding rectangles that surrounds the given word of the plurality of words, andusing the boundary rectangle that surrounds the given word to identify the text location of the text entity; and
generate a redaction box at the text location in the document image to conceal the text entity.
3 Assignments
0 Petitions
Accused Products
Abstract
Example embodiments relate to document alteration based on native text analysis and optical character recognition (OCR). In example embodiments, a system analyzes native text obtained from a native document to identify a text entity in the native document. At this stage, the system may use a native application interface to convert the native document to a document image and perform OCR on the document image to identify a text location of the text entity. The system may then generate an alteration box (e.g., redaction box, highlight box) at the text location in the document image to alter a presentation of the text entity.
-
Citations
11 Claims
-
1. A system for document alteration based on native text analysis and optical character recognition (OCR), the system comprising:
at least one processor to; analyze native text obtained from a native document to identify a text entity in the native document; use a native application interface to convert the native document to a document image, wherein the native application interface is determined based on a document type of the native document; perform OCR on the document image to identify a text location of the text entity, wherein the identifying of the text location of the text entity comprises; recognizing a plurality of words in the document image, matching a given word of the plurality of words recognized in the document image with the text entity identified by the analyzing of the native text obtained from the native document, wherein the matching comprises matching variations of a root portion of the text entity to the given word of the plurality of words, generating a plurality of bounding coordinates for each of the plurality of words, wherein the plurality of bounding coordinates describe a bounding rectangle of a plurality of bounding rectangles that surrounds the given word of the plurality of words, and using the boundary rectangle that surrounds the given word to identify the text location of the text entity; and generate a redaction box at the text location in the document image to conceal the text entity. - View Dependent Claims (2, 3, 4)
-
5. A method for document alteration based on native text analysis and optical character recognition (OCR) on a computing device, the method comprising:
-
performing, by the computing device, named-entity recognition on native text from a native document to categorize a text entity of the native text in a predefined text category, wherein the text entity is designated for redaction based on the predefined text category; using a native application interface to convert the native document to a document image, wherein the native application interface is determined based on a document type of the native document; performing OCR on the document image to identify a text location of the text entity, wherein performing the OCR on the document image to identify the text location of the text entity comprises; recognizing a plurality of words in the document image; generating a plurality of bounding coordinates for each of the plurality of words, wherein the plurality of bounding coordinates describe a bounding rectangle of a plurality of bounding rectangles that surrounds one of the plurality of words; using the plurality of words and the plurality of bounding rectangles to identify the text location of the text entity, wherein using the plurality of words and the plurality of bounding rectangles to identify the text location of the text entity comprises matching variations of a root portion of the text entity to a matching word of the plurality of words, wherein the matching word is associated with the text location; and generating a redaction box at the text location in the document image to conceal the text entity. - View Dependent Claims (6, 7)
-
-
8. A non-transitory machine-readable storage medium encoded with instructions executable by at least one processor to:
-
perform named-entity recognition on native text from a native document to categorize each of a plurality of text entities in a respective text category of a plurality of predefined text categories; use a native application interface to convert the native document to a document image, wherein the native application interface is determined based on a document type of the native document; perform OCR on the document image to identify a plurality of text locations for the plurality of text entities, wherein the identifying of a first text location of a first text entity of the plurality of text entities comprises matching a given word of a plurality of words recognized by the OCR on the document image with the first text entity identified by the named-entity recognition on the native text from the native document, and using a location of the given word recognized by the OCR as the first text location; generate redaction boxes at the plurality of text locations in the document image to conceal the plurality of text entities; calculate a quantity of text entities of the plurality of text entities that are categorized in a first text category of the plurality of predefined text categories; and cause display of the quantity of text entities that are categorized in the first text category. - View Dependent Claims (9)
-
-
10. A non-transitory machine-readable storage medium encoded with instructions executable by at least one processor to:
-
perform named-entity recognition on native text from a native document to categorize each of a plurality of text entities in a respective text category of a plurality of predefined text categories; use a native application interface to convert the native document to a document image, wherein the native application interface is determined based on a document type of the native document; perform OCR on the document image to identify a plurality of text locations for the plurality of text entities, wherein the identifying of a first text location of a first text entity of the plurality of text entities comprises; recognizing a plurality of words in the document image, matching a given word of the plurality of words recognized by the OCR in the document image with the first text entity identified by the named-entity recognition on the native text from the native document, generating a plurality of bounding coordinates for each of the plurality of words, wherein the plurality of bounding coordinates describe a bounding rectangle of a plurality of bounding rectangles that surrounds the given word of the plurality of words, and using the boundary rectangle that surrounds the given word to identify the first text location of the first text entity; and generate redaction boxes at the plurality of text locations in the document image to conceal the plurality of text entities. - View Dependent Claims (11)
-
Specification