SYSTEM FOR AUTOMATED TEXT AND HALFTONE SEGMENTATION

US 20150356740A1
Filed: 06/05/2014
Published: 12/10/2015
Est. Priority Date: 06/05/2014
Status: Active Grant

First Claim

Patent Images

1. A method for segmenting a text region from a pictorial region within a scanned image comprising:

scanning a document to obtain scanned image data representing the document;

generating a binary image from the scanned image data, wherein the binary image comprises a two dimensional array of pixels and where a value of the pixel corresponds to the pixel being one of an ON pixel and an OFF pixel;

identifying a connected component within the binary image, the connected component comprises a group of pixels enclosing a set of connected pixels having the same value;

determining at least two of a size label, a solidity label, and run length label for the connected component, wherein each of the size label, the solidity label, and the run length label identifies the connected component as being either a text area or a non-text area, and where the connected component corresponds to a text component if the at least two of the size label, the solidity label, and the run length label identify the connected component as being a text component; and

classifying the connected component as a text region within the scanned image when the connected component is identified as being a text component.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for segmenting text from non-text portions of a digital image using the size, solidity, and run length characteristics of connected components within the image data. For a connected component comprising a rectangular group of pixels enclosing a set of connected pixels having the same binary state, the size characteristic may be based on a ratio of height to width of the connected component and the total number of pixels within the connected component, the solidity characteristic may be based on a ratio of pixels within a convex hull of the set of connected pixel to a total number of pixels within the connected component, and the run length characteristic may be based on a number of transitions within the connected component.

Citations

19 Claims

1. A method for segmenting a text region from a pictorial region within a scanned image comprising:
- scanning a document to obtain scanned image data representing the document;
  
  generating a binary image from the scanned image data, wherein the binary image comprises a two dimensional array of pixels and where a value of the pixel corresponds to the pixel being one of an ON pixel and an OFF pixel;
  
  identifying a connected component within the binary image, the connected component comprises a group of pixels enclosing a set of connected pixels having the same value;
  
  determining at least two of a size label, a solidity label, and run length label for the connected component, wherein each of the size label, the solidity label, and the run length label identifies the connected component as being either a text area or a non-text area, and where the connected component corresponds to a text component if the at least two of the size label, the solidity label, and the run length label identify the connected component as being a text component; and
  
  classifying the connected component as a text region within the scanned image when the connected component is identified as being a text component.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein the size label is based on a ratio of a height and a width of the connected component and an area of the connected component, where the connected component is labeled as a text area if the ratio is between an upper and a lower ratio threshold and the area is between an upper and a lower area threshold.
  - 3. The method of claim 1, wherein the solidity label is based on a ratio of total number of pixels within the connected component and a number of pixels within a convex hull of the connected pixels within the connected component and where the connected component is labeled as a text area if the ratio greater than a solidity threshold.
  - 4. The method of claim 1, wherein the run length label is based on a number of transitions within a set of scanlines selected from across the connected component and where the connected component is labeled as a text area if the number of transitions is less than a transition threshold.
  - 5. The method of claim 4, wherein the set of scanlines selected from across the connected component comprises a set of horizontal scanlines.
  - 6. The method of claim 4, wherein the set of scanlines selected from across the connected component comprises a set of vertical scanlines.
  - 7. The method of claim 4, wherein the set of scanlines selected from across the connected component comprises a set of horizontal scanlines and a set of vertical scanlines.
  - 8. The method of claim 1, wherein the run length label is based on a function of run length vector characteristic of each scanline within a set of scanlines and a ratio of a height and a width of the connected component;
    - wherein the set of scanlines is sampled from across the connected component and for each scanline within the set the run length characteristic is determined as a difference of a number elements within a run length vector for the scanline that exceed a threshold length and a length of the run length vector.

9. A system for or segmenting a text region from a pictorial region within a scanned image comprising:
- a scanner operable to scan a document and generate scanned image data representing the document; and
  
  a processor operable to generate a binary image from the scanned image data, wherein the binary image comprises a two dimensional array of pixels and where a value of the pixel corresponds to the pixel being one of an ON pixel and an OFF pixel;
  
  identify a connected component within the binary image, the connected component comprises a group of pixels enclosing a set of contiguous pixels having the same value;
  
  determine a size label, a solidity label, and run length label for the connected component, wherein each of the size label, the solidity label, and the run length label identifies the connected component as being either a text area or a non-text area, and where the connected component corresponds to a text component if the size label, the solidity label, and the run length label identify the connected component as being a text component; and
  
  identify a text region within the scanned image as an area of the scanned image that corresponds a text component.
- View Dependent Claims (10, 11, 12, 13)
- - 10. The system of claim 9 wherein the processor includes a multicore processor operable to determine the size label, the solidity label, and the run length label for multiple connected components in parallel.
  - 11. The system of claim 9, wherein the size label is based on a ratio of a height and a width of the connected component and an area of the connected component, where the connected component is labeled as a text area if the ratio is between an upper and a lower ratio threshold and the area is between an upper and a lower an area threshold.
  - 12. The system of claim 9, wherein the solidity label is calculated as a ratio of total number of pixels within the connected component to a number of pixels of a convex hull for the set of contiguous pixels within the connected component and where the connected component is labeled as a text area if the ratio greater than a solidity threshold.
  - 13. The system of claim 9, wherein the run length label is based on a number of transitions within a set of scanlines selected from across the connected component and where the connected component is labeled as a text area if the number of transitions is less than a transition threshold.

14. A method of segmenting text from non-text portions of a digital image, comprising:
- locating a connected component within digital image data corresponding to a document having a text region and a non-text region, where the connected component comprises a group of pixels enclosing a set of connected ON pixels;
  
  identifying a size label based on a ratio of height to width of the connected component and an area of the connected component;
  
  identifying a solidity label based on a ratio of pixels within a convex hull of connected ON pixels to a total number of pixels within the connected component;
  
  identifying a run length label is based on a number of transitions within the connected component, andclassifying the connected component as the text region when at least two of the size label, the solidity label, and the run length label indicated that the connected component is a text area.
- View Dependent Claims (15, 16, 17, 18, 19)
- - 15. The method of claim 14, wherein the size label indicates that the connected component is a text area when the ratio of the height and the width of the connected component is between an upper and a lower size threshold and the area is between an upper and a lower area threshold.
  - 16. The method of claim 14, wherein the solidity label indicates that the connected component is a text area when the ratio of connected ON pixels to the total number of pixels within the connected component is greater than a solidity threshold.
  - 17. The method of claim 14, wherein the run length label indicates that the connected component is a text area when the number of transitions within a set of scanlines selected from the connected component is less than a transition threshold.
  - 18. The method of claim 17, wherein set of scanlines selected from the connected component comprises a set of horizontal scanlines and a set of vertical scanlines.
  - 19. The method of claim 14, wherein the run length label is determined from a comparison of a run length vector characteristic of each scanline within a set of scanlines with a function of the ratio of height to width of the connected component and the area of the connected component;
    - wherein the set of scanlines is sampled from across the connected component and for each scanline within the set of scanlines the run length characteristic is determined as a difference of a number elements within a run length vector for the scanline that exceed a threshold length and a length of the run length vector and wherein the run length vector characteristic is compared to one of a product of the height to width ratio and the area and a quotient of the height to width ratio and the area.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Gopalakrishnan, Sainarayanan, Li, Xing, Cuciurean-Zapan, Clara, Subbaian, Sudhagar

Granted Patent

US 9,842,281 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 18/24   Classification techniques

G06V 10/457   by analysing connectivity, ...

G06V 30/413   Classification of content, ...

SYSTEM FOR AUTOMATED TEXT AND HALFTONE SEGMENTATION

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEM FOR AUTOMATED TEXT AND HALFTONE SEGMENTATION

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links