Efficient data structures for parsing and analyzing a document
First Claim
Patent Images
1. A non-transitory machine readable medium storing a program for execution by at least one processing unit, the program comprising sets of instructions for:
- parsing an unstructured document comprising a plurality of primitive elements, wherein the plurality of primitive elements comprises a plurality of glyphs;
storing the primitive elements in a random order in a first array;
storing, in a second array, references to the stored primitive elements, the references ordered in the second array based on locations of the primitive elements in the document, wherein each of the references refers to a single primitive element in the first array;
receiving instructions to perform a document reconstruction operation that associates a portion of the primitive elements into a structural element in order to generate a structured document from the unstructured document; and
performing the received instructions to define the structural element using the references stored in the second array without storing any new references to the primitive elements.
1 Assignment
0 Petitions
Accused Products
Abstract
Some embodiments provide a method that parses an unstructured document that includes a number of primitive elements. The method stores the primitive elements in a random order in a first storage. The method stores references to the primitive elements in a second storage in an order based on locations of the primitive elements in the unstructured document. The method receives instructions to perform a document reconstruction operation. The method performs the received instructions without storing any new references to the primitive elements.
-
Citations
27 Claims
-
1. A non-transitory machine readable medium storing a program for execution by at least one processing unit, the program comprising sets of instructions for:
-
parsing an unstructured document comprising a plurality of primitive elements, wherein the plurality of primitive elements comprises a plurality of glyphs; storing the primitive elements in a random order in a first array; storing, in a second array, references to the stored primitive elements, the references ordered in the second array based on locations of the primitive elements in the document, wherein each of the references refers to a single primitive element in the first array; receiving instructions to perform a document reconstruction operation that associates a portion of the primitive elements into a structural element in order to generate a structured document from the unstructured document; and performing the received instructions to define the structural element using the references stored in the second array without storing any new references to the primitive elements. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A system comprising:
-
a set of processing units for executing sets of instructions; a memory for storing a program for execution by at least one of the processing units, the program comprising; a first set of instructions for (i) parsing a document comprising a plurality of primitive elements and (ii) storing the primitive elements in a random order in a first storage, wherein the plurality of primitive elements comprises a plurality of glyphs; a second set of instructions for (i) allocating memory in a second storage for storing a single reference for each of the randomly-ordered primitive elements and (ii) storing the references in a particular order in the allocated memory, wherein the particular order is based on the arrangement of the plurality of primitive elements in the document; a third set of instructions for storing data structures for associating sets of the primitive elements with each other, each of the data structures comprising only a single reference to a first one of the ordered references in the second storage and a count value that indicates a number of the ordered references in the second storage associated with each other by the data structure; and a fourth set of instructions for (i) receiving instructions to perform document reconstruction operations that associate sets of the primitive elements with each other in a particular order and (ii) identifying which of the first, second, and third sets of instructions is required to perform the document reconstruction operations while minimizing memory and computation usage. - View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A non-transitory machine readable medium storing a program for execution by at least one processing unit, the program comprising sets of instructions for:
-
receiving a document comprising a plurality of primitive elements, wherein the primitive elements each have a particular location in the document and wherein the plurality of primitive elements comprises a plurality of glyphs; storing the primitive elements in a first storage in an order irrespective of their locations in the document; storing a single reference to each of the stored primitive elements in a second storage in a first order based on the locations of the primitive elements; performing a first document reconstruction operation that associates a first set of the primitive elements together as a first structural element of a structured document using the references stored in the second storage without storing any new references to the primitive elements; and performing a second document reconstruction operation that associates a second set of the primitive elements together as a second structural element of the structured document by storing a new set of references to the stored primitive elements in a second, different order in the second storage. - View Dependent Claims (18, 19, 20, 21, 22, 23)
-
-
24. A non-transitory machine readable medium storing a program for execution by at least one processing unit, the program for analyzing documents, the program comprising:
-
a first module for (i) parsing a document comprising a plurality of glyphs, and (ii) storing the glyphs in a random order in a first storage; a second module for (i) allocating memory in a second storage for storing references to the randomly-ordered glyphs and (ii) storing the references in a particular order in the allocated memory; a third module for (i) associating sets of the glyphs as lines of text by storing a first string that references a first portion of the ordered references, (ii) associating sets of the glyphs as words by storing a second string that references a second portion of the ordered references, and (iii) associating sets of the glyphs as lines of text by storing a third string that references a third portion of the ordered references, wherein each string comprises only a reference to a first one of the ordered references and a count value; and a fourth module for (i) receiving instructions to perform document reconstruction operations and (ii) identifying which of the first, second, and third modules is required to perform the document reconstruction operations while minimizing memory and computation usage. - View Dependent Claims (25, 26, 27)
-
Specification