Fast identification of text intensive pages from photographs
First Claim
1. A method of determining if a document is a text page, comprising:
- partitioning the document into a first plurality of cells;
scaling each respective cell of the first plurality of cells to a standardized number of pixels to provide a first set of snippets, wherein the first set of snippets correspond to the first plurality of cells;
using a classifier to examine the first set of snippets to determine which of the first plurality of cells are classified as text and which of the cells are not classified as text;
determining a volume of text for the document based on a determined total amount of text in the document corresponding to a sum of an amount of text in each of the first plurality of cells classified as text;
determining whether the determined total amount of text in the document meets a pre-determined threshold;
in response to determining that the determined total amount of text in the document does not meet the pre-determined threshold and that criteria for a next partition level are met, for respective cells that are not classified as text;
a) applying further partitioning to the respective cells that are not classified as text to generate a respective further plurality of cells;
b) scaling each of the further partitioned cells to provide a respective further set of snippets that correspond to the further partitioned cells;
c) using the classifier to examine the respective further set of snippets to determine which of the respective further partitioned cells are classified as text and which of the respective further partitioned cells are not classified as text;
d) augmenting the volume of text for the document based on an updated determined total amount of text in the document corresponding to a sum of the amount of text in each of the cells of the plurality of cells and the respective further partitioned cells classified as text;
in response to determining that the augmented volume of text fort the document does not meet the pre-determined threshold and that criteria for a next partition level are met, repeating a)-d);
anddetermining that the document is a text page in response to the determined total amount of text in the document exceeding the pre-determined threshold.
8 Assignments
0 Petitions
Accused Products
Abstract
Determining if a document is a text page includes partitioning the document into a plurality of cells, scaling each of the cells to a standardized number of pixels to provide a corresponding snippet for each of the cells, using a classifier to examine the snippets to determine which of the cells are classified as text and which of the cells are not classified as text, determining a volume of text for the document based on a total amount of text in the document corresponding to a sum of an amount of text in each of the cells classified as text, and determining that the document is a text page in response to the total amount exceeding a pre-determined threshold. In response to the total amount being less than the pre-determined threshold, cells not classified as text may be examined further. The classifier may be provided by training a neural net.
19 Citations
21 Claims
-
1. A method of determining if a document is a text page, comprising:
-
partitioning the document into a first plurality of cells; scaling each respective cell of the first plurality of cells to a standardized number of pixels to provide a first set of snippets, wherein the first set of snippets correspond to the first plurality of cells; using a classifier to examine the first set of snippets to determine which of the first plurality of cells are classified as text and which of the cells are not classified as text; determining a volume of text for the document based on a determined total amount of text in the document corresponding to a sum of an amount of text in each of the first plurality of cells classified as text; determining whether the determined total amount of text in the document meets a pre-determined threshold; in response to determining that the determined total amount of text in the document does not meet the pre-determined threshold and that criteria for a next partition level are met, for respective cells that are not classified as text; a) applying further partitioning to the respective cells that are not classified as text to generate a respective further plurality of cells; b) scaling each of the further partitioned cells to provide a respective further set of snippets that correspond to the further partitioned cells; c) using the classifier to examine the respective further set of snippets to determine which of the respective further partitioned cells are classified as text and which of the respective further partitioned cells are not classified as text; d) augmenting the volume of text for the document based on an updated determined total amount of text in the document corresponding to a sum of the amount of text in each of the cells of the plurality of cells and the respective further partitioned cells classified as text; in response to determining that the augmented volume of text fort the document does not meet the pre-determined threshold and that criteria for a next partition level are met, repeating a)-d); and determining that the document is a text page in response to the determined total amount of text in the document exceeding the pre-determined threshold. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A non-transitory computer readable medium containing software that determines if a document is a text page, the software comprising:
-
executable code that partitions the document into a first plurality of cells; executable code that scales each respective cell of the first plurality of cells to a standardized number of pixels to provide a first set of snippets, wherein the first set of snippets correspond to the first plurality of cells; executable code that uses a classifier to examine the first set of snippets to determine which of the first plurality of cells are classified as text and which of the cells are not classified as text; executable code that determines a volume of text for the document based on a determined total amount of text in the document corresponding to a sum of an amount of text in each of the first plurality of cells classified as text; executable code that determines whether the determined total amount of text in the document meets a pre-determined threshold; executable code that, in response to determining that the determined total amount of text in the document does not meet the pre-determined threshold and that criteria for a next partition level are met, for respective cells that are not classified as text; a) applies further partitioning to the respective cells that are not classified as text to generate a respective further plurality of cells; b) scales each of the further partitioned cells to provide a respective further set of snippets that correspond to the further partitioned cells; c) uses the classifier to examine the respective further set of snippets to determine which of the respective further partitioned cells are classified as text and which of the respective further partitioned cells are not classified as text; d) augments the volume of text for the document based on an updated determined total amount of text in the document corresponding to a sum of the amount of text in each of the cells of the plurality of cells and the respective further partitioned cells classified as text executable code that, in response to determining that the augmented volume of text fort the document does not meet the pre-determined threshold and that criteria for a next partition level are met, repeats a)-d); and executable code that determines that the document is a text page in response to the total amount exceeding a pre-determined threshold. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21)
-
Specification