Clustering
First Claim
1. A clustering system comprising:
- a mark extractor to extract a mark from a document;
a match component operative to compare one or more properties of the mark to one or more match properties of existing clusters of marks so as to identify matching existing clusters, the match properties comprising resized images for the existing clusters, and the match component operative to compute a range of acceptable values for the one or more properties using a threshold;
a two dimensional table that stores the existing clusters according to box size, wherein if no matches are identified, the mark is added to the existing clusters as a new cluster, and if a match is identified then bitmaps of the mark and the matching existing clusters are compared; and
a match symbol component operative to compare a bitmap of the mark to a bitmap of the matching existing clusters the bitmaps are compared bit by bit to identify a matching cluster having a similar bitmap, wherein once an acceptable matching cluster is identified based on the bitmaps, the bitmap of the matched cluster is updated with an average based on the bitmap of the mark, and if no matching cluster is acceptable based on the bitmaps, the mark is added to the existing clusters as a new cluster.
3 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for performing clustering of a document image are disclosed. A property of an extracted mark from a document is compared to the properties of the existing clusters. If the property of the mark fails to match any of the properties of the existing clusters, the mark is added as a new cluster to the existing cluster. One property that can be utilized is x size and y size, which is the width and height, of the existing clusters. Another property that can be employed is ink size, which refers to the ratio of black pixels to total pixels in a cluster. Yet another property that can be utilized is a reduced mark or image, which is a pixel size reduced version the bitmap of the mark and/or cluster. The above properties can be employed to identify mismatches and reduce the number of bit by bit comparisons performed.
-
Citations
28 Claims
-
1. A clustering system comprising:
-
a mark extractor to extract a mark from a document; a match component operative to compare one or more properties of the mark to one or more match properties of existing clusters of marks so as to identify matching existing clusters, the match properties comprising resized images for the existing clusters, and the match component operative to compute a range of acceptable values for the one or more properties using a threshold; a two dimensional table that stores the existing clusters according to box size, wherein if no matches are identified, the mark is added to the existing clusters as a new cluster, and if a match is identified then bitmaps of the mark and the matching existing clusters are compared; and a match symbol component operative to compare a bitmap of the mark to a bitmap of the matching existing clusters the bitmaps are compared bit by bit to identify a matching cluster having a similar bitmap, wherein once an acceptable matching cluster is identified based on the bitmaps, the bitmap of the matched cluster is updated with an average based on the bitmap of the mark, and if no matching cluster is acceptable based on the bitmaps, the mark is added to the existing clusters as a new cluster. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A method of clustering comprising:
-
locating a mark within a document; comparing a first property of the mark with first properties of existing clusters to identify matching and mismatching clusters by computing an acceptable range of properties; on a match of the first property, comparing a bitmap of the mark with bitmaps of the matching clusters, the bitmaps are compared bit by bit to identify a matching cluster having a similar bitmap, wherein once an acceptable matching cluster is identified based on the bitmaps, the bitmap of the matched cluster is updated with an average based on the bitmap of the mark, and if no matching cluster is acceptable based on the bitmaps, the mark is added to the existing clusters as a new cluster; on a mismatch of the first property, adding the mark as the new cluster to the existing clusters; and generating a resized mark image from the bitmap of the mark. - View Dependent Claims (19, 20, 21, 22, 23, 24, 25)
-
-
26. A document encoding system comprising:
-
a mask separator operative to generate a binary mask from a document image, the binary mask including textual information; a background foreground segmenter operative to segment a foreground image and a background image from the document image according to the binary mask; and a clustering system operative to identify clusters in the mask in a computationally efficient manner.
-
-
27. A computer readable medium storing computer executable components operable to perform a method of clustering, comprising:
-
a component for locating a mark; a component for comparing a first property of the mark with first properties of existing clusters to identify matching and mismatching clusters by computing an acceptable range of properties; on a match of the first property, a component for comparing a bitmap of the mark with bitmaps of the matching clusters, the bitmaps are compared bit by bit to identify a matching cluster having a similar bitmap, wherein once an acceptable matching cluster is identified based on the bitmaps, the bitmap of the matched cluster is updated with an average based on the bitmap of the mark, and if no matching cluster is acceptable based on the bitmaps, the mark is added to the existing clusters as a new cluster; on a mismatch of the first property, a component to add the mark as a new cluster to the existing clusters; and a component for generating a resized mark image from the bitmap of the mark.
-
-
28. A computer readable medium storing computer executable instructions operable to perform a method of clustering, comprising:
for each page of at least one page of a document; a component for finding at least one mark; a component for comparing a first property of the at least one mark with first properties of existing clusters to identify matching and mismatching clusters by computing a range of acceptable values for the first property using a threshold; on a match of the first property, a component for comparing a bitmap of the at least one mark with bitmaps of the matching clusters, the bitmaps are compared bit by bit to identify a matching cluster having a similar bitmap, wherein once an acceptable matching cluster is identified based on the bitmaps, the bitmap of the matched cluster is updated with an average based on the bitmap of the mark, and if no matching cluster is acceptable based on the bitmaps, the mark is added to the existing clusters as a new cluster; and on a mismatch of the first property, a component for adding the at least one mark as a new cluster to the existing clusters; a component for updating a global library; and a component for generating a resized mark image from the bitmap of the mark.
Specification