Methods for efficient cluster analysis
First Claim
1. A non-transitory machine readable medium storing a program which when executed by at least one processor defines structure for a document comprising a plurality of primitive elements, the program comprising sets of instructions for:
- defining an indirectly sorted first array that stores sorted indices of a second array of difference values, wherein the difference values indicate differences between sorted attribute values of different primitive elements, the primitive elements being defined in terms of the attribute values;
using the indirectly sorted first array to generate a plurality of different partition sets at different distance scales for the plurality of primitive elements;
from the plurality of partition sets, selecting an optimal partition set based on a set of optimization measures; and
grouping the plurality of primitive elements using the optimal partition set in order to associate a subset of the primitive elements as a structured element in the document.
0 Assignments
0 Petitions
Accused Products
Abstract
Some embodiments provide a method for defining structure for an unstructured document that includes a number of primitive elements that are defined in terms of their position in the document. The method identifies a pairwise grouping of nearest primitive elements. The method sorts the pairwise primitive elements based on an order from the closest to the furthest pairs. The method stores a single value that identifies which of the pairwise primitive elements are sufficiently far apart to form a partition. The method uses the stored value to identify and analyze the partitions in order to define structural elements for the document.
-
Citations
20 Claims
-
1. A non-transitory machine readable medium storing a program which when executed by at least one processor defines structure for a document comprising a plurality of primitive elements, the program comprising sets of instructions for:
-
defining an indirectly sorted first array that stores sorted indices of a second array of difference values, wherein the difference values indicate differences between sorted attribute values of different primitive elements, the primitive elements being defined in terms of the attribute values; using the indirectly sorted first array to generate a plurality of different partition sets at different distance scales for the plurality of primitive elements; from the plurality of partition sets, selecting an optimal partition set based on a set of optimization measures; and grouping the plurality of primitive elements using the optimal partition set in order to associate a subset of the primitive elements as a structured element in the document. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method for defining structure for a document, the method comprising:
-
identifying pairs of nearest primitive elements in a document comprising a plurality of primitive elements that are defined in terms of their position in the document; sorting the pairs of nearest primitive elements based on an order from a pair of nearest primitive elements that are a closest distance apart to a pair of nearest primitive elements that are a furthest distance apart; storing a single value that identifies, from the sorted pairs of nearest primitive elements, a particular pair of nearest primitive elements whose primitive elements are a minimum distance apart to form a partition, wherein each pair of nearest primitive elements that are a greater distance apart than the minimum distance is after the particular pair in the order; identifying partitions between the primitive elements in the document based on the single value and the sorted pairs of nearest primitive elements; and using the partitions to define structural elements for the document. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A method for defining a structure for a document comprising a plurality of primitive elements, the method comprising:
-
defining an indirectly sorted first array that stores sorted indices of a second array of difference values, wherein the difference values indicate differences between sorted attribute values of different primitive elements, the primitive elements being defined in terms of the attribute values; using the indirectly sorted first array to generate a plurality of different partition sets at different distance scales for the plurality of primitive elements; from the plurality of partition sets, selecting an optimal partition set based on a set of optimization measures; and grouping the plurality of primitive elements using the optimal partition set in order to associate a subset of the primitive elements as a structured element in the document. - View Dependent Claims (12, 13, 14, 15)
-
-
16. An apparatus comprising:
-
a set of processing units for executing sets of instructions; a machine readable medium storing a program which when executed by at least one of the processing units defines a program for defining structure for a document, the program comprising sets of instructions for; identifying pairs of nearest primitive elements in a document comprising a plurality of primitive elements that are defined in terms of their position in the document; sorting the pairs of nearest primitive elements based on an order from a pair of nearest primitive elements that are a closest distance apart to a pair of nearest primitive elements that are a furthest distance apart; storing a single value that identifies, from the sorted pairs of nearest primitive elements, a particular pair of nearest primitive elements whose primitive elements are a minimum distance apart to form a partition, wherein each pair of nearest primitive elements that are greater distance apart than the minimum distance is after the particular pair in the order; identifying partitions between the primitive elements in the document based on the single value and the sorted pairs of nearest primitive elements; and using the partitions to define structural elements for the document. - View Dependent Claims (17, 18, 19, 20)
-
Specification