Data format conversion
First Claim
1. A method of converting intermediate document data representing document text derived from data in an image data format into a semantically-meaningful tagged text data format, the method comprising:
- inputting intermediate document data derived from document image data, said intermediate document data comprising character data corresponding to characters in the document and attribute data corresponding to one or more attributes of characters in the document;
processing the intermediate document data according to attribute-dependent rules; and
generating tagged text data comprising tagged sections of said document text, the tags defining semantically meaningful portions of said text determined according to said attribute data.
1 Assignment
0 Petitions
Accused Products
Abstract
Converting intermediate document data representing document text derived from data in an image data format into a semantically-meaningful tagged text data format may be provided. Intermediate document data derived from document image data may be imputed. The intermediate document data may comprise character data corresponding to characters in the document and attribute data corresponding to one or more attributes of characters in the document. The intermediate document data may then be processed according to attribute-dependent rules. Tagged text input data may be generated comprising tagged section of the document text. The tags may define semantically meaningful portions of the text determined according to the attribute data.
52 Citations
23 Claims
-
1. A method of converting intermediate document data representing document text derived from data in an image data format into a semantically-meaningful tagged text data format, the method comprising:
-
inputting intermediate document data derived from document image data, said intermediate document data comprising character data corresponding to characters in the document and attribute data corresponding to one or more attributes of characters in the document;
processing the intermediate document data according to attribute-dependent rules; and
generating tagged text data comprising tagged sections of said document text, the tags defining semantically meaningful portions of said text determined according to said attribute data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A method of converting a document from an image format into a semantically-meaningful format, the method comprising:
-
receiving document image data;
generating intermediate document data comprising words, lines, paragraphs and zones by generating character data corresponding to characters in the document and attribute data corresponding to one or more attributes of characters in the document and grouping said character data;
processing the intermediate document data according to format-identification rules and generating output data comprising tagged sections of text, the text corresponding to text in said document, the tags defining portions of said text determined by said attribute data; and
outputting said output data.
-
- 15. A data carrier carrying a template data structure, said template data structure comprising data for a plurality of data format conversion rules for application to intermediate document data derived from document image data, said intermediate document data comprising character data corresponding to characters in the text of the document and attribute data corresponding to one or more attributes of characters in the document, a said rule comprising data for identifying a set of characters for a portion of document text from said character attributes and/or one or more variables defined by a said rule.
-
21. A data carrier carrying a list of at least one attribute-dependent rule for use in processing document data, the document data comprising character data corresponding to characters in a document and attribute data corresponding to attributes of characters in the document, the rule comprising:
-
a first portion in an imperative programming language for determining portions of said document delineated by said attribute data; and
a second portion in a document processing language for generating output data comprising tagged sections of text, the tags defining portions of said text delineated by said attribute data.
-
-
22. An apparatus for converting a document from an image format into a semantically-meaningful format, the apparatus comprising:
-
a data memory operable to store data to be processed;
an instruction memory storing processor implementable instructions; and
a processor operable to read and process the data in accordance with instructions stored in the instruction memory;
wherein the instructions stored in the instruction memory comprise instructions for controlling the processor to;
receive document image data;
generate from said image data intermediate document data comprising character data corresponding to characters in the document and attribute data corresponding to one or more attributes of characters in the document;
process the intermediate document data according to attribute-dependent rules and generate output data comprising tagged sections of text, the text corresponding to text in said document, the tags defining semantically meaningful portions of said text determined by said attribute data; and
outputting said output data. - View Dependent Claims (23)
-
Specification