×

SYSTEM AND METHODS FOR ARABIC TEXT RECOGNITION AND ARABIC CORPUS BUILDING

  • US 20130251247A1
  • Filed: 05/12/2013
  • Published: 09/26/2013
  • Est. Priority Date: 04/27/2009
  • Status: Active Grant
First Claim
Patent Images

1. A method for automatically recognizing Arabic text, comprising:

  • building an Arabic corpus comprising Arabic text files and ground truths corresponding to each of the Arabic text files, wherein the Arabic text files include Arabic texts written in different writing styles;

    storing writing-style indices in association with the Arabic text files by a computer, wherein each of the writing-style indices indicates that one of the Arabic text files is written in one of the writing styles;

    acquiring a text image containing a line of Arabic characters;

    digitizing the line of the Arabic characters to form a two-dimensional array of pixels each associated with a pixel value, wherein the pixel value is expressed in a binary number;

    dividing the line of the Arabic characters into a plurality of line images;

    defining a plurality of cells in one of the plurality of line images, wherein each of the plurality of cells comprises a group of adjacent pixels;

    serializing pixel values of pixels in each of the plurality of cells in one of the plurality of line images to form a binary cell number;

    forming a text feature vector according to binary cell numbers obtained from the plurality of cells in one of the plurality of line images;

    training a Hidden Markov Model using the Arabic text files and ground truths in the Arabic corpus in accordance with the writing-style indices in association with the Arabic text files; and

    feeding the text feature vector into the Hidden Markov Model to recognize the line of Arabic characters.

View all claims
  • 0 Assignments
Timeline View
Assignment View
    ×
    ×