Image-domain script and language identification
First Claim
1. A computer-implemented method of identifying a writing system associated with a document image containing one or more words written in the writing system, the method comprising:
- identifying a document image fragment based on the document image, wherein the document image fragment contains one or more pixels from one or more of the words in the document image;
generating a set of sequential features associated with the document image fragment, wherein each sequential feature describes one dimensional graphic information derived from the one or more pixels in the document image fragment;
identifying a plurality of n-grams based on the set of sequential features, wherein each n-gram comprises an ordered subset of sequential features;
generating a classification score for the document image fragment based at least in part on a frequency of occurrence of the n-grams in sets of sequential features associated with known writing systems, the classification score indicating a likelihood that the document image fragment is written in the writing system; and
identifying the writing system associated with the document image based at least in part on the classification score for the document image fragment.
2 Assignments
0 Petitions
Accused Products
Abstract
Disclosed herein is a method, computer system and computer program product for identifying a writing system associated with a document image containing one or more words written in the writing system. Initially, a document image fragment is identified based on the document image, wherein the document image fragment contains one or more pixels from one or more of the words in the document image. A set of sequential features associated with the document image fragment is generated, wherein each sequential feature describes one dimensional graphic information derived from the one or more pixels in the document image fragment. A classification score for the document image fragment is generated responsive at least in part to the set of sequential features, the classification score indicating a likelihood that the document image fragment is written in the writing system. The writing system associated with the document image is identified based at least in part on the classification score for the document image fragment.
40 Citations
28 Claims
-
1. A computer-implemented method of identifying a writing system associated with a document image containing one or more words written in the writing system, the method comprising:
-
identifying a document image fragment based on the document image, wherein the document image fragment contains one or more pixels from one or more of the words in the document image; generating a set of sequential features associated with the document image fragment, wherein each sequential feature describes one dimensional graphic information derived from the one or more pixels in the document image fragment; identifying a plurality of n-grams based on the set of sequential features, wherein each n-gram comprises an ordered subset of sequential features; generating a classification score for the document image fragment based at least in part on a frequency of occurrence of the n-grams in sets of sequential features associated with known writing systems, the classification score indicating a likelihood that the document image fragment is written in the writing system; and identifying the writing system associated with the document image based at least in part on the classification score for the document image fragment. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 26, 27, 28)
-
-
11. A non-transitory computer-readable storage medium encoded with executable computer program code for identifying a writing system associated with a document image containing one or more words written in the writing system, the program code comprising:
-
program code for identifying a document image fragment based on the document image, wherein the document image fragment contains one or more pixels from one or more of the words in the document image; program code for generating a set of sequential features associated with the document image fragment, wherein each sequential feature describes one dimensional graphic information derived from the one or more pixels in the document image fragment; program code for identifying a plurality of n-grams based on the set of sequential features, wherein each n-gram comprises an ordered subset of sequential features; program code for generating a classification score for the document image fragment based at least in part on a frequency of occurrence of the n-grams in sets of sequential features associated with known writing systems, the classification score indicating a likelihood that the document image fragment is written in the writing system; and program code for identifying the writing system associated with the document image based at least in part on the classification score for the document image fragment. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A computer system for identifying a writing system associated with a document image containing one or more words written in the writing system, the system comprising:
-
a non-transitory computer-readable storage medium encoded with executable computer program code comprising; a segmentation module adapted to identify a document image fragment based on the document image, wherein the document image fragment contains one or more pixels from one or more of the words in the document image; a feature extraction module adapted to generate a set of sequential features associated with the document image fragment, wherein each sequential feature describes one dimensional graphic information derived from the one or more pixels in the document image fragment; a feature classification module adapted to; identify a plurality of n-grams based on the set of sequential features, wherein each n-gram comprises an ordered subset of sequential features; and generate a classification score for the document image based at least in part on a frequency of occurrence of the n-grams in sets of sequential features associated with known writing systems, the classification score indicating a likelihood that the document image fragment is written in the writing system; and an optical character recognition module adapted to identify the writing system associated with the document image based at least in part on the classification score for the document image fragment; and a processor for executing the computer program code. - View Dependent Claims (22, 23, 24, 25)
-
Specification