METHODS FOR EFFICIENT CLUSTER ANALYSIS
0 Assignments
0 Petitions
Accused Products
Abstract
Some embodiments provide a method for defining structure for an unstructured document that includes a number of primitive elements that are defined in terms of their position in the document. The method identifies a pairwise grouping of nearest primitive elements. The method sorts the pairwise primitive elements based on an order from the closest to the furthest pairs. The method stores a single value that identifies which of the pairwise primitive elements are sufficiently far apart to form a partition. The method uses the stored value to identify and analyze the partitions in order to define structural elements for the document.
-
Citations
47 Claims
-
1-23. -23. (canceled)
-
24. A computer readable medium storing a computer program which when executed by at least one processor defines structure for a document comprising a plurality of primitive elements that are defined in terms of attributes values, the computer program comprising sets of instructions for:
-
defining an indirectly sorted array using sorted indices of an array of difference values that indicate differences between the attribute values of different primitive elements; using the indirectly sorted array to generate a plurality of different partition sets at different distance scales for the plurality of primitive elements, from the plurality of partition sets, selecting an optimal partition set based on a set of optimization measures; and grouping the plurality of elements using the optimal partition set in order to associate a subset of the primitive elements as a structured element in the document. - View Dependent Claims (25, 26, 27, 28)
-
-
29. A computer readable medium storing a computer program which when executed by at least one processor defines structure for a document comprising a plurality of primitive elements that are defined in terms of their positions in the document, the computer program comprising sets of instructions for:
-
calculating relative difference values between adjacent pairs of a particular set of primitive elements sorted according to an order in which the primitive elements appear in the document; sorting the calculated relative difference values into a monotonic order; and storing a single number from the monotonic order that identifies where to form partitions between subsets of primitive elements in the document. - View Dependent Claims (30, 31, 32, 33, 34, 35, 36, 37, 38)
-
-
39. A method for defining a program for defining structure for a document, the method comprising:
-
defining a module for identifying pairs of nearest primitive elements in a document comprising a plurality of primitive elements that are defined in terms of their position in the document; defining a module for sorting the pairs of primitive elements based on an order from the closest pair of primitive elements to the furthest pair of primitive elements; defining a module for storing a single value that identifies which of the pairs of primitive elements are sufficiently far apart to form a partition; and defining a module for using the stored value to identify and analyze the partitions in order to define structural elements for the document. - View Dependent Claims (40, 41, 42, 43)
-
-
44. A computer readable medium storing a computer program which when executed by at least one processor partitions a plurality of primitive of document into structural elements, the computer program comprising sets of instructions for:
-
calculating positional difference values for pairs of a particular set of primitive elements according to their order in the document; sorting the positional difference values of the pairs of primitive elements into a numbered order from closest to furthest pairs; based on a minimum gap size, storing a single number from the numbered order that identifies which of the pairs of primitive elements are sufficiently far apart to form partitions between subsets of primitive elements; and using the stored single number to partition the particular set of primitive elements into structural elements for the document. - View Dependent Claims (45, 46, 47)
-
Specification