Method and system for preprocessing an image for optical character recognition

US 8,194,983 B2
Filed: 05/13/2010
Issued: 06/05/2012
Est. Priority Date: 05/13/2010
Status: Expired due to Fees

First Claim

Patent Images

1. A method of preprocessing an image for optical character recognition (OCR), wherein the image comprises Arabic text and non-text items, the method comprising:

determining a plurality of components associated with at least one of the Arabic text and the non-text items, wherein a component comprises a set of connected pixels;

calculating a first set of characteristic parameters associated with the plurality of components; and

merging the plurality of components based on the first set of characteristic parameters to form at least one of at least one sub-word and at least one word;

calculating a second set of characteristic parameters associated with the at least one of each sub-word and each word, wherein the second set of characteristic parameters is one of a line height, a word spacing, and a line spacing;

grouping at least two sub-words based on the second set of characteristic parameters to form one of at least one sub-word and at least one word;

segmenting the at least one sub word and the at least one word into at least one horizontal line based on at least one of a line height and a line spacing;

identifying at least one component associated with the at least one horizontal line comprising a height greater than a factor of the line height;

determining a center of each horizontal line of the at least one horizontal line, wherein the center is a mid point between a top edge and a bottom edge of each horizontal line;

calculating a distance between at least one of the center and the top edge, and the center and the bottom edge; and

determining orientation of the image based on the distance.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention provides method and system for preprocessing an image including one or more of Arabic text and non-text items for Optical Character Recognition (OCR). The method includes determining a plurality of components associated with one or more of the Arabic text and the non-text items, wherein a component includes a set of connected pixels. A first set of characteristic parameters is then calculated for the plurality of components. The plurality of components are subsequently merged based on the first set of characteristic parameters to form one or more of one or more sub-words and one or more words.

14 Citations

View as Search Results

20 Claims

1. A method of preprocessing an image for optical character recognition (OCR), wherein the image comprises Arabic text and non-text items, the method comprising:
- determining a plurality of components associated with at least one of the Arabic text and the non-text items, wherein a component comprises a set of connected pixels;
  
  calculating a first set of characteristic parameters associated with the plurality of components; and
  
  merging the plurality of components based on the first set of characteristic parameters to form at least one of at least one sub-word and at least one word;
  
  calculating a second set of characteristic parameters associated with the at least one of each sub-word and each word, wherein the second set of characteristic parameters is one of a line height, a word spacing, and a line spacing;
  
  grouping at least two sub-words based on the second set of characteristic parameters to form one of at least one sub-word and at least one word;
  
  segmenting the at least one sub word and the at least one word into at least one horizontal line based on at least one of a line height and a line spacing;
  
  identifying at least one component associated with the at least one horizontal line comprising a height greater than a factor of the line height;
  
  determining a center of each horizontal line of the at least one horizontal line, wherein the center is a mid point between a top edge and a bottom edge of each horizontal line;
  
  calculating a distance between at least one of the center and the top edge, and the center and the bottom edge; and
  
  determining orientation of the image based on the distance.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, wherein the image is obtained by converting at least one of a grayscale image and a color image into a binary image.
  - 3. The method of claim 1, wherein the image is obtained by filtering salt and pepper noise.
  - 4. The method of claim 1, wherein the image is obtained by correcting skew using a modified Hough transform, wherein the modified Hough transform is adapted for the Arabic text.
  - 5. The method of claim 1, wherein determining the plurality of components comprises:
    - performing a raster scan across the image;
      
      identifying a plurality of pixels associated with at least one of the plurality of components corresponding to at least one sweep of the raster scan; and
      
      aggregating the plurality of pixels based on an interconnection between the plurality of pixels to form at least one set of connected pixels.
  - 6. The method of claim 5, wherein a pixel is interconnected with at least one of 8 neighboring pixels of the pixel.
  - 7. The method of claim 1, wherein the first set of characteristic parameters is at least one of a line height, a word spacing, a line spacing, a number of pixels corresponding to each component, a width of each component, a height of each component, coordinates of each component, density of each component, and aspect ratio of each component.
  - 8. The method of claim 7, wherein calculating the line height comprises:
    - creating a histogram of heights corresponding to a height of each of the plurality of components;
      
      identifying a frequently occurring height from the histogram of heights; and
      
      computing line height based on the frequently occurring height.
  - 9. The method of claim 7, wherein calculating the word spacing comprises:
    - creating a histogram of spaces between consecutive components of the plurality of components;
      
      identifying a frequently occurring space from the histogram, wherein the frequently occurring space is within a threshold range determined by the line height; and
      
      computing the word spacing based on the frequently occurring space.
  - 10. The method of claim 9, wherein the consecutive components comprise at least one of vertically overlapping components and components separated by a predefined distance, wherein the vertically overlapping components share at least one coordinate along a vertical axis.
  - 11. The method of claim 7, wherein calculating the line spacing comprises:
    - creating a histogram of a plurality of horizontal projections of the plurality of components, wherein a horizontal projection of the plurality of horizontal projections indicates a number of pixels associated with the plurality of components corresponding to each sweep of the raster scan;
      
      calculating an average distance between two consecutive maximum horizontal projections; and
      
      computing the line spacing based on the average distance.
  - 12. The method of claim 9, wherein merging the plurality of components comprises:
    - combining the consecutive components based on the word spacing; and
      
      filtering at least one component of the plurality of components associated with the non-text items from the plurality of components associated with the Arabic text based on the first set of characteristic parameters.

13. A system for preprocessing an image for optical character recognition (OCR), wherein the image comprises Arabic text and non-text items, the system comprising:
- a memory; and
  
  a processor coupled to the memory, wherein the processor is configured to;
  
  determine a plurality of components associated with at least one of the Arabic text and the non-text items, wherein a component comprises a set of connected pixels;
  
  calculate a first set of characteristic parameters associated with the plurality of components;
  
  merge the plurality of components based on the first set of characteristic parameters to form at least one of at least one sub-word and at least one word;
  
  calculate a second set of characteristic parameters of the at least one of each subword and each word, wherein the second set of characteristic parameters is one of a line height, a word spacing, and a line spacing;
  
  group at least two sub-words based on the second set of characteristic parameters to form one of at least one sub-word and at least one word;
  
  segment the at least one sub word and the at least one word into at least one horizontal line based on at least one of a line height and a line spacing;
  
  identify at least one component associated with the at least one horizontal line comprising a height greater than a factor of the line height;
  
  determine a center of each horizontal line of the at least one horizontal line, wherein the center is a mid point between a top edge and a bottom edge of each horizontal line;
  
  calculate a distance between at least one of the center and the top edge, and thecenter and the bottom edge; and
  
  determine orientation of the image based on the distance.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
- - 14. The system of claim 13, wherein the processor is further configured to perform at least one of:
    - converting at least one of a grayscale image and a color image into a binary Image;
      
      filtering salt and pepper noise; and
      
      correcting skew using a modified Hough transform.
  - 15. The system of claim 13, wherein for determining the plurality of components the processor is further configured to:
    - perform a raster scan across the image;
      
      identify a plurality of pixels associated with at least one of the plurality of components corresponding to at least one sweep of the raster scan; and
      
      aggregate the plurality of pixels based on an interconnection between the plurality of pixels to form at least one set of connected pixels.
  - 16. The system of claim 13, wherein the first set of characteristic parameters is at least one of a line height, a word spacing, a line spacing, a number of pixels corresponding to each component, a width of each component, a height of each component, coordinates of each component, density of each component, and the aspect ratio of each component.
  - 17. The system of claim 16, wherein for calculating the line height the processor is further configured to:
    - create a histogram of heights corresponding to a height of each of the plurality of components;
      
      identify a frequently occurring height from the histogram of heights; and
      
      compute line height based on the frequently occurring height.
  - 18. The system of claim 16, wherein for calculating the word spacing the processor is further configured to:
    - create a histogram of spaces between consecutive components of the plurality of components;
      
      identify a frequently occurring space from the histogram, wherein the frequently occurring space is within a threshold range determined by the line height; and
      
      computing the word spacing based on the frequently occurring space.
  - 19. The system of claim 18, wherein the processor is further configured to:
    - combine the consecutive components based on the word spacing to form at least one of the at least sub-word and the at least one word; and
      
      filter at least one component of the plurality of components associated with the non-text items from the plurality of components associated with the Arabic text based on the first set of characteristic parameters.
  - 20. The system of claim 16, wherein for calculating the line spacing the processor is further configured to:
    - create a histogram of a plurality of horizontal projections of the plurality of components, wherein a horizontal projection of the plurality of horizontal projections indicates a number of pixels associated with the plurality of components corresponding to each sweep of the raster scan;
      
      calculate an average distance between two consecutive maximum horizontal projections; and
      
      compute the line spacing based on the average distance.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
King Abdul AZIZ City For Science and Technology
Original Assignee
King Abdul AZIZ City For Science and Technology
Inventors
Al-Omari, Hussein Khalid, Khorsheed, Mohammad Sulaiman
Primary Examiner(s)
Bhatnagar, Anand
Assistant Examiner(s)
PARK, SOO JIN

Application Number

US12/779,152
Publication Number

US 20110280477A1
Time in Patent Office

754 Days
Field of Search

382/171, 382/198, 382/296, 382/301, 382186-189
US Class Current

382/198
CPC Class Codes

G06V 30/10   Character recognition

G06V 30/15   Cutting or merging image el...

G06V 30/414   Extracting the geometrical ...

Method and system for preprocessing an image for optical character recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

14 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Method and system for preprocessing an image for optical character recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

14 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others