Identification of layout and content flow of an unstructured document
First Claim
1. A non-transitory machine readable medium storing a program which when executed by at least one processing unit analyzes a document, the program comprising sets of instructions for:
- receiving a document comprising a plurality of glyphs, each glyph having a position in the document, wherein the glyphs are arranged in paragraphs and columns that are not defined in the received document;
based on positions of the glyphs in the document, creating associations between glyphs to identify different sets of glyphs as different words;
creating associations between words to identify different sets of words as different paragraphs of the received document;
generating, for a page of the document, a graph of the paragraphs on the page comprising (i) a node for each paragraph and (ii) connections between a plurality of the nodes that account for relative positions of the different paragraphs to each other according to the positions of the glyphs that are associated with the different paragraphs of the received document;
based on the connections between nodes, modifying the graph by merging sets of at least two connected nodes into single nodes, wherein each merged node represents a column of the received document; and
using the modified graph with the merged nodes to identify layouts of the received document and assign the columns to different layouts on the page.
1 Assignment
0 Petitions
Accused Products
Abstract
Some embodiments provide a method for analyzing an unstructured document that includes a number of glyphs, each of which has a position in the unstructured document. Based on positions of the glyphs in the unstructured document, the method creates associations between different sets of glyphs in order to identify different sets of glyphs as different words. The method creates associations between different sets of words in order to identify different sets of words as different paragraphs. The method defines associations between paragraphs that are not contiguous in order to define a reading order for the paragraphs.
132 Citations
24 Claims
-
1. A non-transitory machine readable medium storing a program which when executed by at least one processing unit analyzes a document, the program comprising sets of instructions for:
-
receiving a document comprising a plurality of glyphs, each glyph having a position in the document, wherein the glyphs are arranged in paragraphs and columns that are not defined in the received document; based on positions of the glyphs in the document, creating associations between glyphs to identify different sets of glyphs as different words; creating associations between words to identify different sets of words as different paragraphs of the received document; generating, for a page of the document, a graph of the paragraphs on the page comprising (i) a node for each paragraph and (ii) connections between a plurality of the nodes that account for relative positions of the different paragraphs to each other according to the positions of the glyphs that are associated with the different paragraphs of the received document; based on the connections between nodes, modifying the graph by merging sets of at least two connected nodes into single nodes, wherein each merged node represents a column of the received document; and using the modified graph with the merged nodes to identify layouts of the received document and assign the columns to different layouts on the page. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A non-transitory machine readable medium storing a program which when executed by at least one processing unit analyzes a document comprising a plurality of glyphs, each glyph having a position in the document, the program comprising sets of instructions for:
-
based on the positions of the glyphs, identifying different sets of glyphs as different words; identifying different sets of words as different paragraphs; generating, for a plurality of paragraphs of the document, a graph of the paragraphs comprising (i) a node for each paragraph and ii) connections between the nodes that account for relative positions of the paragraphs to each other; based on the relative positions of the paragraphs in the plurality of paragraphs, partitioning the generated graph into at least two independent graphs that are each isolated from each other, each independent graph representing a different layout on the page, wherein nodes of each independent graph are only connected to other nodes that are within the independent graph; and assigning paragraphs of each independent graph to the different layouts based on the independent graphs. - View Dependent Claims (12, 13, 14, 15, 16, 17)
-
-
18. A method for analyzing a document, the method comprising:
-
receiving a document comprising a plurality of glyphs, each glyph having a position in the document, wherein the glyphs are arranged in paragraphs and columns that are not defined in the received document; based on positions of the glyphs in the document, creating associations between glyphs to identify different sets of glyphs as different words; creating associations between words to identify different sets of words as different paragraphs of the received document; generating, for a page of the document, a graph of the paragraphs on the page comprising (i) a node for each paragraph and (ii) connections between a plurality of the nodes that account for relative positions of the glyphs that are associated with the different paragraphs of the received document; based on the connections between nodes, modifying the graph by merging sets of at least two connected nodes into single nodes, wherein each merged node represents a column of the received document; and using the modified graph with the merged nodes to identify layouts of the received document and assign the columns to different layouts on the page. - View Dependent Claims (19, 20, 21, 22, 23, 24)
-
Specification