Document Alteration Based on Native Text Analysis and OCR

US 20140212040A1
Filed: 01/31/2013
Published: 07/31/2014
Est. Priority Date: 01/31/2013
Status: Active Grant

First Claim

Patent Images

1. A system for document alteration based on native text analysis and optical character recognition (OCR), the system comprising:

a processor to;

analyze native text obtained from a native document to identify a text entity in the native document;

use a native application interface to convert the native document to a document image, wherein the native application interface is determined based on a document type of the native document;

perform OCR on the document image to identify a text location of the text entity; and

generate a redaction box at the text location in the document image to conceal the text entity.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Example embodiments relate to document alteration based on native text analysis and optical character recognition (OCR). In example embodiments, a system analyzes native text obtained from a native document to identify a text entity in the native document. At this stage, the system may use a native application interface to convert the native document to a document image and perform OCR on the document image to identify a text location of the text entity. The system may then generate an alteration box (e.g., redaction box, highlight box) at the text location in the document image to alter a presentation of the text entity.

Citations

15 Claims

1. A system for document alteration based on native text analysis and optical character recognition (OCR), the system comprising:
- a processor to;
  
  analyze native text obtained from a native document to identify a text entity in the native document;
  
  use a native application interface to convert the native document to a document image, wherein the native application interface is determined based on a document type of the native document;
  
  perform OCR on the document image to identify a text location of the text entity; and
  
  generate a redaction box at the text location in the document image to conceal the text entity.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The system of claim 1, wherein the processor performs OCR on the document image to identify the text location for the text entity by:
    - recognizing a plurality of characters in the document image;
      
      generating a plurality of bounding coordinates for each of the plurality of characters, wherein the plurality of bounding coordinates describes a bounding rectangle of a plurality of bounding rectangles that surrounds one of the plurality of characters; and
      
      using the plurality of characters and the plurality of bounding rectangles to identify the text location of the text entity.
  - 3. The system of claim 1, wherein the processor performs OCR on the document image to identify the text location for the text entity by:
    - recognizing a plurality of words in the document image;
      
      generating a plurality of bounding coordinates for each of the plurality of words, wherein the plurality of bounding coordinates describes a bounding rectangle of a plurality of bounding rectangles that surrounds one of the plurality of words; and
      
      using the plurality of words and the plurality of bounding rectangles to identify the text location of the text entity.
  - 4. The system of claim 3, wherein using the plurality of words and the plurality of bounding rectangles to identify the text location of the text entity comprises matching variations of a root portion of the text entity to a matching word of the plurality of words, wherein the matching word is associated with the text location.
  - 5. The system of claim 1, wherein the redaction box is generated using a surrounding threshold so that the redaction box further conceals a buffer area surrounding the text entity.
  - 6. The system of claim 1, wherein analyzing the native text to identify the text entity comprises performing named-entity recognition to categorize the text entity in a predetermined text category, wherein the text entity is designated for redaction based on the predetermined text category.

7. A method for document alteration based on native text analysis and optical character recognition (OCR) on a computing device, the method comprising:
- performing, by the computing device, named-entity recognition on native text from a native document to categorize a text entity of the native text in a predefined text category, wherein the text entity is designated for redaction based on the predefined text category;
  
  using a native application interface to convert the native document to a document image, wherein the native application interface is determined based on a document type of the native document;
  
  performing OCR on the document image to identify a text location of the text entity; and
  
  generating a redaction box at the text location in the document image to conceal the text entity.
- View Dependent Claims (8, 9, 10, 11)
- - 8. The method of claim 7, wherein performing OCR on the document image to identify the text location of the text entity comprises:
    - recognizing a plurality of characters in the document image;
      
      generating a plurality of bounding coordinates for each of the plurality of characters, wherein the plurality of bounding coordinates describes a bounding rectangle of a plurality of bounding rectangles that surrounds one of the plurality of characters; and
      
      using the plurality of characters and the plurality of bounding rectangles to identify the text location of the text entity.
  - 9. The method of claim 7, wherein performing OCR on the document image to identify the text location of the text entity comprises:
    - recognizing a plurality of words in the document image;
      
      generating a plurality of bounding coordinates for each of the plurality of words, wherein the plurality of bounding coordinates describes a bounding rectangle of a plurality of bounding rectangles that surrounds one of the plurality of words; and
      
      using the plurality of words and the plurality of bounding rectangles to identify the text location of the text entity.
  - 10. The method of claim 9, wherein using the plurality of words and the plurality of bounding rectangles to identify the text location of the text entity comprises matching variations of a root portion of the text entity to a matching word of the plurality of words, wherein the matching word is associated with the text location.
  - 11. The method of claim 7, wherein the redaction box is generated using a surrounding threshold so that the redaction box further conceals a buffer area surrounding the text entity.

12. A non-transitory machine-readable storage medium encoded with instructions executable by a processor, the machine-readable storage medium comprising:
- instructions for performing named-entity recognition on native text from a native document to categorize each of a plurality of text entities in one of a plurality of predefined text categories;
  
  instructions for using a native application interface to convert the native document to a document image, wherein the native application interface is determined based on a document type of the native document;
  
  instructions for performing OCR on the document image to identify a plurality of text locations for the plurality of text entities; and
  
  instructions for generating redaction boxes at the plurality of text locations in the document image to conceal the plurality of text entities.
- View Dependent Claims (13, 14, 15)
- - 13. The machine-readable storage medium of claim 12, further comprising:
    - instructions for calculating a quantity of the plurality of text entities that are categorized in a predefined text category of the plurality of predefined text categories; and
      
      instructions for displaying the quantity of the plurality of text entities that are categorized in the predefined text category.
  - 14. The machine-readable storage medium of claim 12, wherein instructions for performing OCR on the document image to identify the text location of the text entity comprises:
    - instructions for recognizing a plurality of words in the document image;
      
      instructions for generating a plurality of bounding coordinates for each of the plurality of words, wherein the plurality of bounding coordinates describes a bounding rectangle of a plurality of bounding rectangles that surrounds one of the plurality of words; and
      
      instructions for using the plurality of words and the plurality of bounding rectangles to identify the plurality of text locations of the plurality of text entities.
  - 15. The machine-readable storage medium of claim 14, wherein instructions for using the plurality of words and the plurality of bounding rectangles to identify the text location of the text entity comprises instructions for matching variations of a root portion of the text entity to a matching word of the plurality of words, wherein the matching word is associated with the text location.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Original Assignee
Longsand Limited (Open Text Corporation)
Inventors
Walker, James Richard, Burtoft, James Arthur

Granted Patent

US 9,256,798 B2
Time in Patent Office

Days
Field of Search
US Class Current

382/182
CPC Class Codes

G06V 30/10   Character recognition

G06V 30/1456   based on user interactions

G06V 30/15   Cutting or merging image el...

Document Alteration Based on Native Text Analysis and OCR

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Document Alteration Based on Native Text Analysis and OCR

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links