Document alteration based on native text analysis and OCR

US 9,256,798 B2
Filed: 01/31/2013
Issued: 02/09/2016
Est. Priority Date: 01/31/2013
Status: Active Grant

First Claim

Patent Images

1. A system for document alteration based on native text analysis and optical character recognition (OCR), the system comprising:

at least one processor to;

analyze native text obtained from a native document to identify a text entity in the native document;

use a native application interface to convert the native document to a document image, wherein the native application interface is determined based on a document type of the native document;

perform OCR on the document image to identify a text location of the text entity, wherein the identifying of the text location of the text entity comprises;

recognizing a plurality of words in the document image,matching a given word of the plurality of words recognized in the document image with the text entity identified by the analyzing of the native text obtained from the native document, wherein the matching comprises matching variations of a root portion of the text entity to the given word of the plurality of words,generating a plurality of bounding coordinates for each of the plurality of words, wherein the plurality of bounding coordinates describe a bounding rectangle of a plurality of bounding rectangles that surrounds the given word of the plurality of words, andusing the boundary rectangle that surrounds the given word to identify the text location of the text entity; and

generate a redaction box at the text location in the document image to conceal the text entity.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Example embodiments relate to document alteration based on native text analysis and optical character recognition (OCR). In example embodiments, a system analyzes native text obtained from a native document to identify a text entity in the native document. At this stage, the system may use a native application interface to convert the native document to a document image and perform OCR on the document image to identify a text location of the text entity. The system may then generate an alteration box (e.g., redaction box, highlight box) at the text location in the document image to alter a presentation of the text entity.

Citations

11 Claims

1. A system for document alteration based on native text analysis and optical character recognition (OCR), the system comprising:
- at least one processor to;
  
  analyze native text obtained from a native document to identify a text entity in the native document;
  
  use a native application interface to convert the native document to a document image, wherein the native application interface is determined based on a document type of the native document;
  
  perform OCR on the document image to identify a text location of the text entity, wherein the identifying of the text location of the text entity comprises;
  
  recognizing a plurality of words in the document image,matching a given word of the plurality of words recognized in the document image with the text entity identified by the analyzing of the native text obtained from the native document, wherein the matching comprises matching variations of a root portion of the text entity to the given word of the plurality of words,generating a plurality of bounding coordinates for each of the plurality of words, wherein the plurality of bounding coordinates describe a bounding rectangle of a plurality of bounding rectangles that surrounds the given word of the plurality of words, andusing the boundary rectangle that surrounds the given word to identify the text location of the text entity; and
  
  generate a redaction box at the text location in the document image to conceal the text entity.
- View Dependent Claims (2, 3, 4)
- - 2. The system of claim 1, wherein the redaction box is generated using a surrounding threshold so that the redaction box further conceals a buffer area surrounding the text entity.
  - 3. The system of claim 1, wherein the analyzing of the native text to identify the text entity comprises performing named-entity recognition to categorize the text entity in a predetermined text category, wherein the text entity is designated for redaction based on the predetermined text category.
  - 4. The system of claim 1, wherein the processor is to select the used native application interface from a plurality of native application interfaces corresponding to respective different document types, the selecting of the used native application interface is based on the document type of the native document.

5. A method for document alteration based on native text analysis and optical character recognition (OCR) on a computing device, the method comprising:
- performing, by the computing device, named-entity recognition on native text from a native document to categorize a text entity of the native text in a predefined text category, wherein the text entity is designated for redaction based on the predefined text category;
  
  using a native application interface to convert the native document to a document image, wherein the native application interface is determined based on a document type of the native document;
  
  performing OCR on the document image to identify a text location of the text entity, wherein performing the OCR on the document image to identify the text location of the text entity comprises;
  
  recognizing a plurality of words in the document image;
  
  generating a plurality of bounding coordinates for each of the plurality of words, wherein the plurality of bounding coordinates describe a bounding rectangle of a plurality of bounding rectangles that surrounds one of the plurality of words;
  
  using the plurality of words and the plurality of bounding rectangles to identify the text location of the text entity, wherein using the plurality of words and the plurality of bounding rectangles to identify the text location of the text entity comprises matching variations of a root portion of the text entity to a matching word of the plurality of words, wherein the matching word is associated with the text location; and
  
  generating a redaction box at the text location in the document image to conceal the text entity.
- View Dependent Claims (6, 7)
- - 6. The method of claim 5, wherein the redaction box is generated using a surrounding threshold so that the redaction box further conceals a buffer area surrounding the text entity.
  - 7. The method of claim 5, further comprising selecting, by the computing device, the used native application interface from a plurality of native application interfaces corresponding to respective different document types, the selecting of the used native application interface is based on the document type of the native document.

8. A non-transitory machine-readable storage medium encoded with instructions executable by at least one processor to:
- perform named-entity recognition on native text from a native document to categorize each of a plurality of text entities in a respective text category of a plurality of predefined text categories;
  
  use a native application interface to convert the native document to a document image, wherein the native application interface is determined based on a document type of the native document;
  
  perform OCR on the document image to identify a plurality of text locations for the plurality of text entities, wherein the identifying of a first text location of a first text entity of the plurality of text entities comprises matching a given word of a plurality of words recognized by the OCR on the document image with the first text entity identified by the named-entity recognition on the native text from the native document, and using a location of the given word recognized by the OCR as the first text location;
  
  generate redaction boxes at the plurality of text locations in the document image to conceal the plurality of text entities;
  
  calculate a quantity of text entities of the plurality of text entities that are categorized in a first text category of the plurality of predefined text categories; and
  
  cause display of the quantity of text entities that are categorized in the first text category.
- View Dependent Claims (9)
- - 9. The machine-readable storage medium of claim 8, wherein the instructions are executable by the processor to further select the used native application interface from a plurality of native application interfaces corresponding to respective different document types, the selecting of the used native application interface is based on the document type of the native document.

10. A non-transitory machine-readable storage medium encoded with instructions executable by at least one processor to:
- perform named-entity recognition on native text from a native document to categorize each of a plurality of text entities in a respective text category of a plurality of predefined text categories;
  
  use a native application interface to convert the native document to a document image, wherein the native application interface is determined based on a document type of the native document;
  
  perform OCR on the document image to identify a plurality of text locations for the plurality of text entities, wherein the identifying of a first text location of a first text entity of the plurality of text entities comprises;
  
  recognizing a plurality of words in the document image,matching a given word of the plurality of words recognized by the OCR in the document image with the first text entity identified by the named-entity recognition on the native text from the native document,generating a plurality of bounding coordinates for each of the plurality of words, wherein the plurality of bounding coordinates describe a bounding rectangle of a plurality of bounding rectangles that surrounds the given word of the plurality of words, andusing the boundary rectangle that surrounds the given word to identify the first text location of the first text entity; and
  
  generate redaction boxes at the plurality of text locations in the document image to conceal the plurality of text entities.
- View Dependent Claims (11)
- - 11. The machine-readable storage medium of claim 10, wherein the matching comprises matching variations of a root portion of the first text entity to the given word of the plurality of words.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Original Assignee
Aurasma Limited (HP Inc.)
Inventors
Walker, James Richard, Burtoft, James Arthur
Primary Examiner(s)
Koziol, Stephen R
Assistant Examiner(s)
Thomas, Mia M

Application Number

US13/756,432
Publication Number

US 20140212040A1
Time in Patent Office

1,104 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06V 30/10   Character recognition

G06V 30/1456   based on user interactions

G06V 30/15   Cutting or merging image el...

Document alteration based on native text analysis and OCR

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

11 Claims

Specification

Solutions

Use Cases

Quick Links

Document alteration based on native text analysis and OCR

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

11 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links