TEXT SEGMENTATION OF A DOCUMENT
First Claim
Patent Images
1. A system to segment text from a portable document format (PDF) document, the system comprising:
- memory for storing computer executable instructions; and
a processing unit for accessing the memory and executing the computer executable instructions, the computer executable instructions comprising;
an engine to group line segments into text blocks using a homogeneity measure based on relative line space difference between line segments and a homogeneity measure based on difference in font size between line segments, wherein the line segments comprise text elements extracted from the PDF document.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method are provided for segmenting text from a portable document format (PDF) document. The system includes a memory for storing computer executable instructions and a processing unit for accessing the memory and executing the computer executable instructions. The computer executable instructions include an engine to group line segments into text blocks using a homogeneity measure based on relative line space difference between line segments and a homogeneity measure based on difference in font size between line segments, where the line segments comprise text elements extracted from the PDF document.
-
Citations
20 Claims
-
1. A system to segment text from a portable document format (PDF) document, the system comprising:
-
memory for storing computer executable instructions; and a processing unit for accessing the memory and executing the computer executable instructions, the computer executable instructions comprising; an engine to group line segments into text blocks using a homogeneity measure based on relative line space difference between line segments and a homogeneity measure based on difference in font size between line segments, wherein the line segments comprise text elements extracted from the PDF document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A method performed using at least one processor of a computer system, the method comprising:
-
determining, using at least one processor, line segments of a portable document format (PDF) document, wherein the line segments comprise text elements extracted from the PDF document; grouping, using at least one processor, the line segments into text blocks using a homogeneity measure based on relative line space difference between line segments and a homogeneity measure based on difference in font size between line segments, wherein the line space is determined as a distance between vertical center lines, wherein each vertical center line is associated with a respective line segment, and wherein the vertical center line provides an indication of the position and extent of the respective line segment. - View Dependent Claims (14, 15, 16, 17)
-
-
18. A non-transitory computer-readable medium having code representing computer-executable instructions encoded thereon, the computer executable instructions comprising instructions executable to cause one or more processors:
-
determine line segments of a portable document format (PDF) document, wherein the line segments comprise text elements extracted from the PDF document; and group the line segments into text blocks using a homogeneity measure based on relative line space difference between line segments and a homogeneity measure based on difference in font size between line segments, wherein the line space is determined as a distance between vertical center lines, wherein each vertical center line is associated with a respective line segment, and wherein the vertical center line provides an indication of the position and extent of the respective line segment. - View Dependent Claims (19, 20)
-
Specification