Detection and reconstruction of East Asian layout features in a fixed format document
First Claim
1. A method for detecting ruby text in a fixed format document, the method comprising:
- receiving, at a parser, a fixed format document containing one or more lines of text on one or more pages;
detecting, by a line detection engine, one or more lines in the fixed format document containing one or more attributes of a ruby line;
retaining the one or more lines in the fixed format document containing one or more attributes of a ruby line as ruby line candidates and a line successive to the one or more lines as ruby base line candidates;
analyzing, by a document processor, the ruby line candidate for finding one or more ruby texts contained in the ruby line candidate;
matching the one or more ruby texts with a corresponding ruby base text in a successive ruby base line candidate for reconstruction in a flow format document; and
reconstructing, by a serializer, the fixed format document to a flow format document containing the matched one or more ruby texts and corresponding ruby base text.
2 Assignments
0 Petitions
Accused Products
Abstract
Detection of East Asian layout features and reconstruction of East Asian layout features is provided. Vertically written text in the fixed format document is detected and rotated for layout analysis. After layout analysis, the rotated text is rotated back and restructured in a flow format document. When a plurality of characters is written horizontally in a vertical line of text, vertically overlapping text runs are detected, designated as horizontal-in-vertical text, and are restructured as horizontal-in-vertical text in a flow format document. Lines of text are analyzed for attributes of a ruby line and are designated as ruby text, associated with corresponding text in a ruby base line, and restructured as ruby text in a flow format document. Text in a fixed format document is analyzed for detection of a particular East Asian language so that a font for the language is designated in a flow format document.
-
Citations
20 Claims
-
1. A method for detecting ruby text in a fixed format document, the method comprising:
-
receiving, at a parser, a fixed format document containing one or more lines of text on one or more pages; detecting, by a line detection engine, one or more lines in the fixed format document containing one or more attributes of a ruby line; retaining the one or more lines in the fixed format document containing one or more attributes of a ruby line as ruby line candidates and a line successive to the one or more lines as ruby base line candidates; analyzing, by a document processor, the ruby line candidate for finding one or more ruby texts contained in the ruby line candidate; matching the one or more ruby texts with a corresponding ruby base text in a successive ruby base line candidate for reconstruction in a flow format document; and reconstructing, by a serializer, the fixed format document to a flow format document containing the matched one or more ruby texts and corresponding ruby base text. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computing device for detecting ruby text in a fixed format document, comprising:
-
a processing unit; and a memory including computer-readable instructions which when executed by the processor are operable to; detect, at a parser, a fixed format document; detect, at a line detection engine, one or more lines in the fixed format document containing one or more attributes of a ruby line; retain the one or more lines in the fixed format document containing one or more attributes of a ruby line as ruby line candidates and a line successive to the one or more lines as ruby base line candidates; analyze, by a document processor, the ruby line candidate for finding one or more ruby texts contained in the ruby line candidate; match the one or more ruby texts with a corresponding ruby base text in a successive ruby base line candidate for reconstruction in a flow format document; and reconstruct, by a serializer, the fixed format document as the flow format document containing the matched one or more ruby texts and the corresponding ruby base text. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A computer readable storage device including instructions, which when executed by a processor, detect ruby text in a fixed format document by:
-
detecting, at a parser, a fixed format document; detecting, at a line detection engine, one or more lines in the fixed format document containing one or more attributes of a ruby line including; analyzing the one or more lines of text for finding an empty line or a line consisting of whitespace characters; if a line of text is empty or consists of whitespace characters, discarding the line of text as a ruby line candidate or as a ruby base line candidate; analyzing the one or more lines of text for determining if a line of text extends past a successive line of text more than a predetermined amount; if the line of text extends past the successive line of text more than the predetermined amount, discarding the line of text as a ruby line candidate; analyzing the one or more lines of text for determining if a line of text comprises more empty space between successive words than a successive line of text; if the line of text comprises more empty space between successive words than the successive line of text, discarding the line of text as a ruby line candidate; analyzing the one or more lines of text for determining if a font size of characters in a line of text is smaller than a font size of characters in a successive line of text; if the font size of the characters in the line of text is smaller than the font size of the characters in the successive line of text, retaining the line of text as a ruby line candidate and the successive line of text as a ruby base line candidate; analyzing the one or more lines of text for determining if a distance between a line of text and a successive line of text is less than a predetermined distance; if the distance between the line of text and the successive line of text is less than the predetermined distance, retaining the line of text as a ruby line candidate and the successive line of text as a ruby base line candidate; analyzing the ruby base line candidates for determining if the ruby base line candidate comprises Chinese, Japanese, or Korean characters; and if the ruby base line candidate comprises Chinese, Japanese, or Korean characters, retaining the line of text as a ruby base line candidate as a preceding line of text as a ruby line candidate; retaining the one or more lines in the fixed format document containing one or more attributes of a ruby line as ruby line candidates and a line successive to the one or more lines as ruby base line candidates; analyzing, by a document processor, the ruby line candidate for finding one or more ruby texts contained in the ruby line candidate; matching the one or more ruby texts with a corresponding ruby base text in a successive ruby base line candidate for reconstruction in a flow format document; and reconstructing, by a serializer, the fixed format document as the flow format document containing the matched one or more ruby texts and the corresponding ruby base text.
-
Specification