System and methods for arabic text recognition based on effective arabic text feature extraction
First Claim
1. A method for automatically recognizing Arabic text, comprising:
- acquiring a text image comprising one or more Arabic words each including one or more Arabic characters;
identify a plurality of lines of Arabic text in the text image;
segmenting one of the plurality of lines of Arabic text into Arabic words;
digitizing at least one of the Arabic words to form a two-dimensional array of pixels each associated with a pixel value, wherein the pixel value is expressed in a binary number;
dividing the one of the Arabic words into a plurality of line images;
defining a plurality of cells in one of the plurality of line images, wherein each of the plurality of cells comprises a group of adjacent pixels;
serializing pixel values of pixels in each of the plurality of cells in one of the plurality of line images to form a binary cell number;
forming a text feature vector according to binary cell numbers obtained from the plurality of cells in one of the plurality of line images; and
feeding the text feature vector into a Hidden Markov Model to recognize the one or more Arabic words including the Arabic characters.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for automatically recognizing Arabic text includes building an Arabic corpus comprising Arabic text files written in different writing styles and ground truths corresponding to each of the Arabic text files, storing writing-style indices in association with the Arabic text files, digitizing an Arabic word to form an array of pixels, dividing the Arabic word into line images, forming a text feature vector from the line images, training a Hidden Markov Model using the Arabic text files and ground truths in the Arabic corpus in accordance with the writing-style indices, and feeding the text feature vector into a Hidden Markov Model to recognize the Arabic words.
8 Citations
20 Claims
-
1. A method for automatically recognizing Arabic text, comprising:
-
acquiring a text image comprising one or more Arabic words each including one or more Arabic characters; identify a plurality of lines of Arabic text in the text image; segmenting one of the plurality of lines of Arabic text into Arabic words; digitizing at least one of the Arabic words to form a two-dimensional array of pixels each associated with a pixel value, wherein the pixel value is expressed in a binary number; dividing the one of the Arabic words into a plurality of line images; defining a plurality of cells in one of the plurality of line images, wherein each of the plurality of cells comprises a group of adjacent pixels; serializing pixel values of pixels in each of the plurality of cells in one of the plurality of line images to form a binary cell number; forming a text feature vector according to binary cell numbers obtained from the plurality of cells in one of the plurality of line images; and feeding the text feature vector into a Hidden Markov Model to recognize the one or more Arabic words including the Arabic characters. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A method for automatically recognizing Arabic text, comprising:
-
acquiring a text image comprising one or more Arabic words each including one or more Arabic characters; identify a plurality of lines of Arabic text in the text image; segmenting one of the plurality of lines of text into Arabic words; digitizing at least one of the Arabic words to form a two-dimensional array of pixels each associated with a pixel value, wherein the pixel value is expressed in a binary number; dividing the one of the Arabic words into a plurality of line images; downsizing at least one of the plurality of line images to produce a downsized line image; serializing pixel values of pixels in each column of the downsized line image to form a string of serialized numbers, wherein the string of serialized numbers forms a text feature vector; and feeding the text feature vector into a Hidden Markov Model to recognize the one or more Arabic words including the Arabic characters. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
Specification