×

Identification of layout and content flow of an unstructured document

  • US 9,063,911 B2
  • Filed: 06/07/2009
  • Issued: 06/23/2015
  • Est. Priority Date: 01/02/2009
  • Status: Active Grant
First Claim
Patent Images

1. A non-transitory machine readable medium storing a program which when executed by at least one processing unit analyzes a document, the program comprising sets of instructions for:

  • receiving a document comprising a plurality of glyphs, each glyph having a position in the document, wherein the glyphs are arranged in paragraphs and columns that are not defined in the received document;

    based on positions of the glyphs in the document, creating associations between glyphs to identify different sets of glyphs as different words;

    creating associations between words to identify different sets of words as different paragraphs of the received document;

    generating, for a page of the document, a graph of the paragraphs on the page comprising (i) a node for each paragraph and (ii) connections between a plurality of the nodes that account for relative positions of the different paragraphs to each other according to the positions of the glyphs that are associated with the different paragraphs of the received document;

    based on the connections between nodes, modifying the graph by merging sets of at least two connected nodes into single nodes, wherein each merged node represents a column of the received document; and

    using the modified graph with the merged nodes to identify layouts of the received document and assign the columns to different layouts on the page.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×