Relabelling of tokenized symbols in fontless structured document image representations
First Claim
1. A method comprising the steps of:
- providing a processor with a first set of digital information comprising a first structured representation of a document, a plurality of image collections being obtainable from the first structured representation, each such obtainable image collection comprising at least one image, each image in each such collection being an image of at least a portion of the document;
with a processor, producing from the first set of digital information a second set of digital information comprising a second structured representation of the document, the second structured representation being a lossless representation of a particular image collection, the particular image collection being one of the plurality of image collections obtainable from the first structured representation, the second structured representation including a plurality of tokens and a plurality of positions, wherein at least one token in the plurality of tokens has an associated semantic label, the second set of digital information being produced by extracting the plurality of tokens from the first structured representation, each token comprising a set of pixel data representing a subimage of the particular image collection, and determining the plurality of positions from the first structured representation, each position being a position of a token subimage in the particular image collection, a token subimage being one of the subimages from one of the tokens, at least one token subimage having a plurality of pixels and occurring at more than one position in the image collection;
and making the second set of digital information thus produced available for further use.
4 Assignments
0 Petitions
Accused Products
Abstract
A processor is provided with a first set of digital information that includes a first structured representation of a document. From the first set of digital information, the processor produces a second set of digital information that includes a second structured representation of the document. The second structured representation is a lossless representation and includes a set of tokens and a set of positions. At least one token in the plurality of tokens has an associated semantic label which may be a character code associated with various font types in the second structured representation of the document. The semantic label may be obtained and stored in the second structured representation of the document by a computer program. The first and second representations may be resolution dependent structured representations and have, respectively, first and second characteristic resolutions. The first representation, but not the second, is provided in digital form to an untrusted recipient. A search for particular content of the second representation, including semantic labels, is requested by the recipient. A highlighted version of the first representation of the document is then provided to the recipient.
-
Citations
23 Claims
-
1. A method comprising the steps of:
-
providing a processor with a first set of digital information comprising a first structured representation of a document, a plurality of image collections being obtainable from the first structured representation, each such obtainable image collection comprising at least one image, each image in each such collection being an image of at least a portion of the document;
with a processor, producing from the first set of digital information a second set of digital information comprising a second structured representation of the document, the second structured representation being a lossless representation of a particular image collection, the particular image collection being one of the plurality of image collections obtainable from the first structured representation, the second structured representation including a plurality of tokens and a plurality of positions, wherein at least one token in the plurality of tokens has an associated semantic label, the second set of digital information being produced by extracting the plurality of tokens from the first structured representation, each token comprising a set of pixel data representing a subimage of the particular image collection, and determining the plurality of positions from the first structured representation, each position being a position of a token subimage in the particular image collection, a token subimage being one of the subimages from one of the tokens, at least one token subimage having a plurality of pixels and occurring at more than one position in the image collection;
and making the second set of digital information thus produced available for further use. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. An article of manufacture comprising an information storage medium wherein is stored information comprising a computer program for facilitating production by a processor of a second set of digital information from a first set of digital information,
the first set of digital information comprising a first structured representation of a document, the first structured representation having a plurality of image collections, each such obtainable image collection comprising at least one image, each image in each such collection being an image of at least a portion of the document; the second set of digital information comprising a second structured representation of a document, the second structured representation being a lossless representation of a particular image collection, the particular image collection being one of the plurality of image collections obtainable from the first structured representation, the second structured representation including a plurality of tokens and a plurality of positions, wherein at least one token in the plurality of tokens has an associated semantic label, each token comprising a set of pixel data representing a subimage of the particular image collection, each position being a position of a token subimage in the particular image collection, a token subimage being one of the subimages from one of the tokens, at least one token subimage having a plurality of pixels and occurring at more than one position in the particular image collection. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17)
-
18. A method comprising the steps of:
-
providing a processor with a first set of digital information comprising a first structured representation (hereinafter “
the starting representation”
) of a document, the starting representation being a resolution-independent representation, a plurality of image collections being obtainable from the starting representation, each such obtainable image collection comprising at least one image, each image in each such collection being an image of at least a portion of the document, each image in each such collection having a characteristic resolution;
with a processor, producing from the first set of digital information a second set of digital information comprising a second structured representation (hereinafter, “
the low-resolution representation”
) of the document, the low-resolution representation being a lossless representation of a particular image collection (hereinafter, “
the low-resolution image collection”
), the low-resolution image collection being one of the plurality of image collections obtainable from the starting representation, each image in the low-resolution image collection having a first characteristic resolution (hereinafter, “
the low resolution”
), the low-resolution representation including a plurality of tokens (hereinafter “
the low-resolution tokens”
) and a plurality of positions, the second set of digital information being produced byextracting the low-resolution tokens from the starting representation, each low-resolution token comprising a set of pixel data representing a subimage of the low-resolution image collection, and determining from the starting representation the plurality of positions of the low-resolution representation, each position of the low-resolution representation being a position of a subimage (hereinafter, “
the low-resolution subimage”
) in the low-resolution image collection, a low-resolution subimage being one of the subimages from one of the low-resolution tokens, at least one low-resolution subimage having a plurality of pixels and occurring at more than one position in the image collection;
with a processor, producing from the first set of digital information a third set of digital information comprising a third structured representation (hereinafter, “
the high-resolution representation”
) of the document, the high-resolution representation being a lossless representation of a particular image collection (hereinafter, “
the high-resolution image collection”
), the high-resolution image collection being one of the plurality of image collections obtainable from the starting representation, each image in the high-resolution image collection having a second characteristic resolution (hereinafter, “
the high resolution”
), the high resolution being greater than the low resolution, the high-resolution representation including a plurality of tokens (hereinafter “
the high-resolution tokens”
) and a plurality of positions wherein at least one high-resolution token in the plurality of tokens has an associated semantic label, the third set of digital information being produced byextracting the high-resolution tokens from the starting representation, each high-resolution token comprising a set of pixel data representing a subimage of the high-resolution image collection, and determining from the starting representation the plurality of positions of the high-resolution representation, each position of the high-resolution representation being a position of a subimage (hereinafter, “
the high-resolution subimage”
) in the high-resolution image collection, a high-resolution subimage being one of the subimages from one of the high-resolution tokens, at least one high-resolution subimage having a plurality of pixels and occurring at more than one position in the image collection; and
making the second and third sets of digital information thus produced available for further use. - View Dependent Claims (19, 20, 21, 22, 23)
-
Specification