×

Fast identification of text intensive pages from photographs

  • US 10,372,981 B1
  • Filed: 09/22/2016
  • Issued: 08/06/2019
  • Est. Priority Date: 09/23/2015
  • Status: Active Grant
First Claim
Patent Images

1. A method of determining if a document is a text page, comprising:

  • partitioning the document into a first plurality of cells;

    scaling each respective cell of the first plurality of cells to a standardized number of pixels to provide a first set of snippets, wherein the first set of snippets correspond to the first plurality of cells;

    using a classifier to examine the first set of snippets to determine which of the first plurality of cells are classified as text and which of the cells are not classified as text;

    determining a volume of text for the document based on a determined total amount of text in the document corresponding to a sum of an amount of text in each of the first plurality of cells classified as text;

    determining whether the determined total amount of text in the document meets a pre-determined threshold;

    in response to determining that the determined total amount of text in the document does not meet the pre-determined threshold and that criteria for a next partition level are met, for respective cells that are not classified as text;

    a) applying further partitioning to the respective cells that are not classified as text to generate a respective further plurality of cells;

    b) scaling each of the further partitioned cells to provide a respective further set of snippets that correspond to the further partitioned cells;

    c) using the classifier to examine the respective further set of snippets to determine which of the respective further partitioned cells are classified as text and which of the respective further partitioned cells are not classified as text;

    d) augmenting the volume of text for the document based on an updated determined total amount of text in the document corresponding to a sum of the amount of text in each of the cells of the plurality of cells and the respective further partitioned cells classified as text;

    in response to determining that the augmented volume of text fort the document does not meet the pre-determined threshold and that criteria for a next partition level are met, repeating a)-d);

    anddetermining that the document is a text page in response to the determined total amount of text in the document exceeding the pre-determined threshold.

View all claims
  • 8 Assignments
Timeline View
Assignment View
    ×
    ×