×

Image-domain script and language identification

  • US 8,233,726 B1
  • Filed: 11/27/2007
  • Issued: 07/31/2012
  • Est. Priority Date: 11/27/2007
  • Status: Expired due to Fees
First Claim
Patent Images

1. A computer-implemented method of identifying a writing system associated with a document image containing one or more words written in the writing system, the method comprising:

  • identifying a document image fragment based on the document image, wherein the document image fragment contains one or more pixels from one or more of the words in the document image;

    generating a set of sequential features associated with the document image fragment, wherein each sequential feature describes one dimensional graphic information derived from the one or more pixels in the document image fragment;

    identifying a plurality of n-grams based on the set of sequential features, wherein each n-gram comprises an ordered subset of sequential features;

    generating a classification score for the document image fragment based at least in part on a frequency of occurrence of the n-grams in sets of sequential features associated with known writing systems, the classification score indicating a likelihood that the document image fragment is written in the writing system; and

    identifying the writing system associated with the document image based at least in part on the classification score for the document image fragment.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×