Method for retrieval of arabic historical manuscripts

US 9,075,846 B2
Filed: 12/12/2012
Issued: 07/07/2015
Est. Priority Date: 12/12/2012
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method for retrieval of Arabic historical manuscripts, comprising the steps of:

entering Arabic historical manuscript images into a computer for processing;

extracting circular polar grid features from the Arabic historical manuscript images stored in the computer, wherein the step of extracting circular polar grid features comprises;

building a circular polar grid from a multiline-axis including an intersection of a 0°

line, a 45°

line, a 90°

line and a 135°

line;

overlaying concentric circles centered about the intersection point of said multiline-axis, the concentric circles having radial values of r, 2r, 3r, . . . nr; and

centering said circular polar grid at a centroid of an image term to be indexed;

constructing a Latent Semantic Index based on the extracted circular polar grid features, the Latent Semantic Index having a reduced dimension m×

n Term-by-Document matrix obtained from a Singular Value Decomposition of a higher dimensional Term-by-Document matrix constructed by the computer from the extracted circular polar grid features, wherein m rows represent the features and n columns represent the images;

accepting a user query against the stored Arabic historical manuscript images, the computer forming the user query as a query vector derived from features extraction of a query image supplied by the user;

performing query matching based on comparison between the query vector and the Term-by-Document matrix;

weighing each term of said Term-by-Document matrix by a value representing an occurrence frequency of a feature of said term in said document, wherein the step of weighing each term of said Term-by-Document matrix comprises;

picking a comprehensive training set of said document for each said feature;

calculating a mean μ

_fand a standard deviation σ

_fof the features f'"'"'s value across the training set; and

for each image in the collection, defining an occurrence count O_fjof feature f according to the relation;

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The method for retrieval of Arabic historical manuscripts using Latent Semantic Indexing approaches the problem of manuscripts indexing and retrieval by automatic indexing of Arabic historical manuscripts through word spotting, using “Text Image” similarity of keywords. The similarity is computed using Latent Semantic Indexing (LSI). The method involves a manuscript page preprocessing step, a segmentation step, and a feature extraction step. Feature extraction utilizes a circular polar grid feature set. Once the salient features have been extracted, indexing of historical Arabic manuscripts using LSI is performed in support of content-based image retrieval (CBIR).

Citations

13 Claims

1. A computer-implemented method for retrieval of Arabic historical manuscripts, comprising the steps of:
- entering Arabic historical manuscript images into a computer for processing;
  
  extracting circular polar grid features from the Arabic historical manuscript images stored in the computer, wherein the step of extracting circular polar grid features comprises;
  
  building a circular polar grid from a multiline-axis including an intersection of a 0°
  
  line, a 45°
  
  line, a 90°
  
  line and a 135°
  
  line;
  
  overlaying concentric circles centered about the intersection point of said multiline-axis, the concentric circles having radial values of r, 2r, 3r, . . . nr; and
  
  centering said circular polar grid at a centroid of an image term to be indexed;
  
  constructing a Latent Semantic Index based on the extracted circular polar grid features, the Latent Semantic Index having a reduced dimension m×
  
  n Term-by-Document matrix obtained from a Singular Value Decomposition of a higher dimensional Term-by-Document matrix constructed by the computer from the extracted circular polar grid features, wherein m rows represent the features and n columns represent the images;
  
  accepting a user query against the stored Arabic historical manuscript images, the computer forming the user query as a query vector derived from features extraction of a query image supplied by the user;
  
  performing query matching based on comparison between the query vector and the Term-by-Document matrix;
  
  weighing each term of said Term-by-Document matrix by a value representing an occurrence frequency of a feature of said term in said document, wherein the step of weighing each term of said Term-by-Document matrix comprises;
  
  picking a comprehensive training set of said document for each said feature;
  
  calculating a mean μ
  
  _fand a standard deviation σ
  
  _fof the features f'"'"'s value across the training set; and
  
  for each image in the collection, defining an occurrence count O_fjof feature f according to the relation;
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The computer-implemented method according to claim 1, wherein said number of concentric circles is four, thereby defining 32 regions of intersection between said multilines and said concentric circles.
  - 3. The computer-implemented method according to claim 1, further comprising the step of, for each of said regions, normalizing said count.
  - 4. The computer-implemented method according to claim 1, further comprising the step of calculating said predetermined distance measurement as a cosine between said query vector, and said Term-by Document matrix.
  - 5. The computer-implemented method according to claim 1, further comprising preprocessing steps adapted for enhancing efficiency of said circular polar grid features extraction step.
  - 6. The computer-implemented method according to claim 5, wherein said preprocessing steps include an RGB conversion procedure comprising the steps of:
    - converting said Arabic historical manuscript images from an RGB color space to gray-scale images;
      
      converting said gray-scale images to binary images by performing calculations characterized by the relation;
  - 7. The computer-implemented method according to claim 5, wherein said preprocessing steps include a smoothing and noise removal procedure comprising the steps of:
    - accepting as input binary versions of said Arabic historical manuscript images;
      
      providing as output said binary versions of said Arabic historical manuscript images processed according to rules characterized by the relation if P₀=0 then;
  - 8. The computer-implemented method according to claim 5, wherein said preprocessing steps include a segmentation procedure comprising the steps of:
    - determining a baseline of text of said Arabic historical manuscripts images by calculating a horizontal projection profile, said horizontal projection profile calculation being characterized by the relation;
      
      P_i=Σ
      
      Img(i,j)where P(i,j) is the horizontal projection of the image for row i, and the Img(i,j) is the pixel value at (i, j);
      
      based on said baseline determination, segmenting a line image to connected component images comprised of subwords of said images; and
      
      tagging each said subword with page number and line number information to facilitate storage and retrieval of said image subword.
  - 9. The computer-implemented method according to claim 8, wherein said returned document images display step further comprises displaying image thumbnails of said image subwords matched according to said query matching process.

10. A computer software product embedded in a non-transitory storage medium readable by a processor, the non-transitory storage medium having stored thereon a set of instructions which, when executed by the processor, causes a computer to perform retrieval of Arabic historical manuscripts using Latent Semantic Indexing, comprising:
- (a) a first sequence of instructions which, when executed by the processor, causes said processor to accept in main memory storage Arabic historical manuscript images for processing;
  
  (b) a second sequence of instructions which, when executed by the processor, causes said processor to extract circular polar grid features from said Arabic historical manuscript images stored in said main memory storage;
  
  (c) a third sequence of instructions which, when executed by the processor, causes said processor to construct a Latent Semantic Index based on said extracted circular polar grid features, said Latent Semantic Index being comprised of a reduced dimension m×
  
  n Term-by-Document matrix obtained from a Singular Value Decomposition of a higher dimensional Term-by-Document matrix constructed by said computer from said extracted circular polar grid features, wherein m rows represent said features and n columns represent said images;
  
  (d) a fourth sequence of instructions which, when executed by the processor, causes said processor to accept a user query against said stored Arabic historical manuscript images, and to form said user query as a query vector derived from features extraction of a query image supplied by said user;
  
  (e) a fifth sequence of instructions which, when executed by the processor, causes said processor to perform query matching based on comparison between said query vector and said Term-by-Document matrix;
  
  (f) a sixth sequence of instructions which, when executed by the processor, causes said processor to display Arabic historical document images returned by said query matching process, said returned document images being ranked by similarity to said user query according to a predetermined distance measurement between said query vector and said Term-by-Document matrix;
  
  (g) a seventh sequence of instructions which, when executed by the processor, causes said processor to build said circular polar grid from a multiline-axis comprised of the intersection of a 0°
  
  line, a 45°
  
  line, a 90°
  
  line and a 135°
  
  line;
  
  (h) an eighth sequence of instructions which, when executed by the processor, causes said processor to overlay concentric circles centered about the intersection point of said multiline-axis, said concentric circles having radial values of r;
  
  2r, 3r, . . . nr; and
  
  (i) a ninth sequence of instructions which, when executed by the processor, causes said processor to center said circular polar grid at a centroid of an image term to be indexed by said retrieval process;
  
  (j) a tenth sequence of instructions which, when executed by the processor, causes said processor to determine a plurality of image features defined by a count of black image pixels found in regions of intersection between said multilines and said concentric circles;
  
  (k) an eleventh sequence of instructions which, when executed by the processor, causes said processor to weigh each term of said Term-by-Document matrix by a value representing an occurrence frequency of a feature of said term in said document;
  
  (l) a twelfth sequence of instructions which, when executed by the processor, causes said processor to pick a comprehensive training set of said document for each said feature;
  
  (m) a thirteenth sequence of instructions which, when executed by the processor, causes said processor to calculate the mean μ
  
  _fand the standard deviation σ
  
  _fof the features f'"'"'s value across the training set and, for each image in the collection, causes said processor to define the occurrence count O_fjof feature f according to the relation
- View Dependent Claims (11, 12, 13)
- - 11. The computer software product according to claim 10, further comprising a twentieth sequence of instructions which, when executed by the processor, causes said processor to use four concentric circles thereby defining 32 said regions of intersection between said multilines and said concentric circles.
  - 12. The computer software product according to claim 10, further comprising a twenty-first sequence of instructions which, when executed by the processor, causes said processor to normalize said count for each of said regions.
  - 13. The computer software product according to claim 10, further comprising a twenty-second sequence of instructions which, when executed by the processor, causes said processor to calculate said predetermined distance measurement as a cosine between said query vector, and said Term-by Document matrix.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
King Fahd University of Petroleum & Minerals (Government of Saudi Arabia)
Original Assignee
King Fahd University of Petroleum & Minerals (Government of Saudi Arabia)
Inventors
Yahia, Mohammad Husni Najib, Al-Khatib, Wasfi G.
Primary Examiner(s)
Corrielus, Jean M

Application Number

US13/712,773
Publication Number

US 20140164370A1
Time in Patent Office

937 Days
Field of Search

707/730, 704 1- 10
US Class Current

1/1
CPC Class Codes

G06F 16/24575 using context

G06F 16/35 Clustering; Classification

Method for retrieval of arabic historical manuscripts

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

Method for retrieval of arabic historical manuscripts

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links