Method for retrieval of arabic historical manuscripts
First Claim
1. A computer-implemented method for retrieval of Arabic historical manuscripts, comprising the steps of:
- entering Arabic historical manuscript images into a computer for processing;
extracting circular polar grid features from the Arabic historical manuscript images stored in the computer, wherein the step of extracting circular polar grid features comprises;
building a circular polar grid from a multiline-axis including an intersection of a 0°
line, a 45°
line, a 90°
line and a 135°
line;
overlaying concentric circles centered about the intersection point of said multiline-axis, the concentric circles having radial values of r, 2r, 3r, . . . nr; and
centering said circular polar grid at a centroid of an image term to be indexed;
constructing a Latent Semantic Index based on the extracted circular polar grid features, the Latent Semantic Index having a reduced dimension m×
n Term-by-Document matrix obtained from a Singular Value Decomposition of a higher dimensional Term-by-Document matrix constructed by the computer from the extracted circular polar grid features, wherein m rows represent the features and n columns represent the images;
accepting a user query against the stored Arabic historical manuscript images, the computer forming the user query as a query vector derived from features extraction of a query image supplied by the user;
performing query matching based on comparison between the query vector and the Term-by-Document matrix;
weighing each term of said Term-by-Document matrix by a value representing an occurrence frequency of a feature of said term in said document, wherein the step of weighing each term of said Term-by-Document matrix comprises;
picking a comprehensive training set of said document for each said feature;
calculating a mean μ
f and a standard deviation σ
f of the features f'"'"'s value across the training set; and
for each image in the collection, defining an occurrence count Ofj of feature f according to the relation;
1 Assignment
0 Petitions
Accused Products
Abstract
The method for retrieval of Arabic historical manuscripts using Latent Semantic Indexing approaches the problem of manuscripts indexing and retrieval by automatic indexing of Arabic historical manuscripts through word spotting, using “Text Image” similarity of keywords. The similarity is computed using Latent Semantic Indexing (LSI). The method involves a manuscript page preprocessing step, a segmentation step, and a feature extraction step. Feature extraction utilizes a circular polar grid feature set. Once the salient features have been extracted, indexing of historical Arabic manuscripts using LSI is performed in support of content-based image retrieval (CBIR).
-
Citations
13 Claims
-
1. A computer-implemented method for retrieval of Arabic historical manuscripts, comprising the steps of:
-
entering Arabic historical manuscript images into a computer for processing; extracting circular polar grid features from the Arabic historical manuscript images stored in the computer, wherein the step of extracting circular polar grid features comprises; building a circular polar grid from a multiline-axis including an intersection of a 0°
line, a 45°
line, a 90°
line and a 135°
line;overlaying concentric circles centered about the intersection point of said multiline-axis, the concentric circles having radial values of r, 2r, 3r, . . . nr; and centering said circular polar grid at a centroid of an image term to be indexed; constructing a Latent Semantic Index based on the extracted circular polar grid features, the Latent Semantic Index having a reduced dimension m×
n Term-by-Document matrix obtained from a Singular Value Decomposition of a higher dimensional Term-by-Document matrix constructed by the computer from the extracted circular polar grid features, wherein m rows represent the features and n columns represent the images;accepting a user query against the stored Arabic historical manuscript images, the computer forming the user query as a query vector derived from features extraction of a query image supplied by the user; performing query matching based on comparison between the query vector and the Term-by-Document matrix; weighing each term of said Term-by-Document matrix by a value representing an occurrence frequency of a feature of said term in said document, wherein the step of weighing each term of said Term-by-Document matrix comprises; picking a comprehensive training set of said document for each said feature; calculating a mean μ
f and a standard deviation σ
f of the features f'"'"'s value across the training set; andfor each image in the collection, defining an occurrence count Ofj of feature f according to the relation; - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A computer software product embedded in a non-transitory storage medium readable by a processor, the non-transitory storage medium having stored thereon a set of instructions which, when executed by the processor, causes a computer to perform retrieval of Arabic historical manuscripts using Latent Semantic Indexing, comprising:
-
(a) a first sequence of instructions which, when executed by the processor, causes said processor to accept in main memory storage Arabic historical manuscript images for processing; (b) a second sequence of instructions which, when executed by the processor, causes said processor to extract circular polar grid features from said Arabic historical manuscript images stored in said main memory storage; (c) a third sequence of instructions which, when executed by the processor, causes said processor to construct a Latent Semantic Index based on said extracted circular polar grid features, said Latent Semantic Index being comprised of a reduced dimension m×
n Term-by-Document matrix obtained from a Singular Value Decomposition of a higher dimensional Term-by-Document matrix constructed by said computer from said extracted circular polar grid features, wherein m rows represent said features and n columns represent said images;(d) a fourth sequence of instructions which, when executed by the processor, causes said processor to accept a user query against said stored Arabic historical manuscript images, and to form said user query as a query vector derived from features extraction of a query image supplied by said user; (e) a fifth sequence of instructions which, when executed by the processor, causes said processor to perform query matching based on comparison between said query vector and said Term-by-Document matrix; (f) a sixth sequence of instructions which, when executed by the processor, causes said processor to display Arabic historical document images returned by said query matching process, said returned document images being ranked by similarity to said user query according to a predetermined distance measurement between said query vector and said Term-by-Document matrix; (g) a seventh sequence of instructions which, when executed by the processor, causes said processor to build said circular polar grid from a multiline-axis comprised of the intersection of a 0°
line, a 45°
line, a 90°
line and a 135°
line;(h) an eighth sequence of instructions which, when executed by the processor, causes said processor to overlay concentric circles centered about the intersection point of said multiline-axis, said concentric circles having radial values of r;
2r, 3r, . . . nr; and(i) a ninth sequence of instructions which, when executed by the processor, causes said processor to center said circular polar grid at a centroid of an image term to be indexed by said retrieval process; (j) a tenth sequence of instructions which, when executed by the processor, causes said processor to determine a plurality of image features defined by a count of black image pixels found in regions of intersection between said multilines and said concentric circles; (k) an eleventh sequence of instructions which, when executed by the processor, causes said processor to weigh each term of said Term-by-Document matrix by a value representing an occurrence frequency of a feature of said term in said document; (l) a twelfth sequence of instructions which, when executed by the processor, causes said processor to pick a comprehensive training set of said document for each said feature; (m) a thirteenth sequence of instructions which, when executed by the processor, causes said processor to calculate the mean μ
f and the standard deviation σ
f of the features f'"'"'s value across the training set and, for each image in the collection, causes said processor to define the occurrence count Ofj of feature f according to the relation - View Dependent Claims (11, 12, 13)
-
Specification