FAST IDENTIFICATION OF TEXT INTENSIVE PAGES FROM PHOTOGRAPHS
First Claim
1. A method of training a neural network to distinguish between text documents and image documents, comprising:
- obtaining a corpus of text and image documents;
for at least one text document of the corpus of text and image documents;
scanning at least one page of the at least one text document by shifting a text window to a plurality of locations on the at least one page of the at least one text document;
determining whether text in the window at a respective location of the plurality of locations meets text line criteria; and
in accordance with a determination that the text in the window at the respective location of the plurality of locations meets text line criteria, storing the text in the window as a respective text snippet; and
for at least one image document of the corpus of text and image documents;
superimposing a plurality of image windows over at least one page of the at least one image document;
determining whether the content of a respective image window meets image criteria; and
in accordance with a determination that the content of the respective image window meets the image criteria, storing the content of the respective image window as a respective image snippet; and
providing the respective text snippet and the respective image snippet to a classifier.
5 Assignments
0 Petitions
Accused Products
Abstract
Methods and systems for training a neural network to distinguish between text documents and image documents are described. A corpus of text and image documents is obtained. A page of a text document is scanned by shifting a text window to a plurality of locations. In accordance with a determination that the text in the window at a respective location meets text line criteria, the text in the window is stored as a respective text snippet. A plurality of image windows are superimposed over at least one page of an image document. In accordance with a determination that the content of a respective image window meets image criteria, content of the image window is stored as a respective image snippet. The respective text snippet and the respective image snippet are provided to a classifier.
2 Citations
20 Claims
-
1. A method of training a neural network to distinguish between text documents and image documents, comprising:
-
obtaining a corpus of text and image documents; for at least one text document of the corpus of text and image documents; scanning at least one page of the at least one text document by shifting a text window to a plurality of locations on the at least one page of the at least one text document; determining whether text in the window at a respective location of the plurality of locations meets text line criteria; and in accordance with a determination that the text in the window at the respective location of the plurality of locations meets text line criteria, storing the text in the window as a respective text snippet; and for at least one image document of the corpus of text and image documents; superimposing a plurality of image windows over at least one page of the at least one image document; determining whether the content of a respective image window meets image criteria; and in accordance with a determination that the content of the respective image window meets the image criteria, storing the content of the respective image window as a respective image snippet; and providing the respective text snippet and the respective image snippet to a classifier. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computing device for training a neural network to distinguish between text documents and image documents, comprising:
-
one or more processors; memory; a display; and one or more programs stored in the memory and configured for execution by the one or more processors, the one or more programs comprising instructions for; obtaining a corpus of text and image documents; for at least one text document of the corpus of text and image documents; scanning at least one page of the at least one text document by shifting a text window to a plurality of locations on the at least one page of the at least one text document; determining whether text in the window at a respective location of the plurality of locations meets text line criteria; and in accordance with a determination that the text in the window at the respective location of the plurality of locations meets text line criteria, storing the text in the window as a respective text snippet; and for at least one image document of the corpus of text and image documents; superimposing a plurality of image windows over at least one page of the at least one image document; determining whether the content of a respective image window meets image criteria; and in accordance with a determination that the content of the respective image window meets the image criteria, storing the content of the respective image window as a respective image snippet; and providing the respective text snippet and the respective image snippet to a classifier. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
-
-
20. A non-transitory computer readable medium containing software that trains a neural network to distinguish between text documents and image documents using a corpus of text and image documents, the software comprising executable code that:
-
obtains a corpus of text and image documents; for at least one text document of the corpus of text and image documents; scans at least one page of the at least one text document by shifting a text window to a plurality of locations on the at least one page of the at least one text document; determines whether text in the window at a respective location of the plurality of locations meets text line criteria; and in accordance with a determination that the text in the window at the respective location of the plurality of locations meets text line criteria, stores the text in the window as a respective text snippet; and for at least one image document of the corpus of text and image documents; superimposes a plurality of image windows over at least one page of the at least one image document; determines whether the content of a respective image window meets image criteria; and in accordance with a determination that the content of the respective image window meets the image criteria, stores the content of the respective image window as a respective image snippet; and provides the respective text snippet and the respective image snippet to a classifier.
-
Specification