×

FAST IDENTIFICATION OF TEXT INTENSIVE PAGES FROM PHOTOGRAPHS

  • US 20190318163A1
  • Filed: 06/27/2019
  • Published: 10/17/2019
  • Est. Priority Date: 09/23/2015
  • Status: Active Grant
First Claim
Patent Images

1. A method of training a neural network to distinguish between text documents and image documents, comprising:

  • obtaining a corpus of text and image documents;

    for at least one text document of the corpus of text and image documents;

    scanning at least one page of the at least one text document by shifting a text window to a plurality of locations on the at least one page of the at least one text document;

    determining whether text in the window at a respective location of the plurality of locations meets text line criteria; and

    in accordance with a determination that the text in the window at the respective location of the plurality of locations meets text line criteria, storing the text in the window as a respective text snippet; and

    for at least one image document of the corpus of text and image documents;

    superimposing a plurality of image windows over at least one page of the at least one image document;

    determining whether the content of a respective image window meets image criteria; and

    in accordance with a determination that the content of the respective image window meets the image criteria, storing the content of the respective image window as a respective image snippet; and

    providing the respective text snippet and the respective image snippet to a classifier.

View all claims
  • 5 Assignments
Timeline View
Assignment View
    ×
    ×