Automatic table detection method and system
First Claim
1. A computer-implemented method of identifying table data in a document comprising the steps of:
- a) receiving a page description language representation of the document for providing a list of words in the document and position information for the words; and
b) automatically identifying table data in the document based on the page description language representation of the document and at least one table identifying feature, wherein the step of identifying includes the steps of, b1) automatically determining a table bounding box for each table in the document, wherein the table bounding box includes a top edge and a bottom edge;
b2) expanding each table bounding box based on a text density feature, wherein the expanding step includes the steps of, b2—
1) for each line determining a text density measure;
b2—
2) for each line determining a change of text density between the current line and the previous line;
h2—
3) if the change in text density reaches a predetermined threshold, marking the current line with a text density tag;
b2—
4) expanding the top edge of the table bounding box in a first direction to one of a line previously marked by a text density tag and a line with a single word cluster; and
b2—
5) expanding the bottom edge of the table bounding box in a second direction to one of a line previously marked by a text density tag and a line with a single word cluster; and
b3) converting the table data encompassed by each table bounding box to a markup language representation.
3 Assignments
0 Petitions
Accused Products
Abstract
A method for automatically detecting table data in a document that is described by a page definition language and converting the table data into a markup language representation. The document may have one or more pages. The page definition language description of the document provides a list of words, the position of the each on a page with respect to a predetermined reference point, and the size of each word. The present invention automatically identifies table data in the document by utilizing one or more table-identifying features. A first table-identifying feature may be the number of word clusters on a line. A second table-identifying feature may be the vertical alignment of word clusters between lines. A third table-identifying feature may be the changes in text density or space density between lines.
-
Citations
11 Claims
-
1. A computer-implemented method of identifying table data in a document comprising the steps of:
-
a) receiving a page description language representation of the document for providing a list of words in the document and position information for the words; and
b) automatically identifying table data in the document based on the page description language representation of the document and at least one table identifying feature, wherein the step of identifying includes the steps of, b1) automatically determining a table bounding box for each table in the document, wherein the table bounding box includes a top edge and a bottom edge;
b2) expanding each table bounding box based on a text density feature, wherein the expanding step includes the steps of, b2—
1) for each line determining a text density measure;
b2—
2) for each line determining a change of text density between the current line and the previous line;
h2—
3) if the change in text density reaches a predetermined threshold, marking the current line with a text density tag;
b2—
4) expanding the top edge of the table bounding box in a first direction to one of a line previously marked by a text density tag and a line with a single word cluster; and
b2—
5) expanding the bottom edge of the table bounding box in a second direction to one of a line previously marked by a text density tag and a line with a single word cluster; and
b3) converting the table data encompassed by each table bounding box to a markup language representation. - View Dependent Claims (2, 3, 4, 5, 6)
b1) dividing the document into one or more pages;
b2) dividing each page into a plurality of lines;
b3) for each line, clustering the words of the line into one or more word clusters;
b4) automatically identifying table data in the document based on the number of word clusters for each line and the alignment of the word clusters between lines.
-
-
3. The method of claim 2 wherein the step of automatically identifying table data in the document based on the number of word clusters of each line and the alignment of the word clusters between lines further comprises:
-
b4—
1) using the word clusters to generate column position information; and
b4—
2) updating the column position information by performing a union operation between the column position information of the previous line and the column position information of the current line.
-
-
4. The method of claim 1 wherein receiving a page description language representation of the document for providing a list of words in the document and position information for the words includes receiving a PDF representation of the document, and wherein converting the table data encompassed by each table bounding box to a markup language representation includes converting the table data encompassed by each table bounding box to a HTML representation.
-
5. The method of claim 1 wherein the step or automatically identifying table data in the document based on the page description language representation of the document and at least one table identifying feature further comprises:
b1) automatically identifying table data in the document based on one or more table headings.
-
6. The method of claim 1 wherein the step of automatically identifying table data in the document based on the page description language representation of the document and at least one table identifying feature further comprises:
b1) automatically identifying table data in the document based on one or more horizontal lines and vertical lines that separate rows or columns of the table.
-
7. A computer-readable medium having stored thereon sequences of instructions, said sequences of instructions including instructions which, when executed by a processor, cause said processor to perform the steps of:
-
a) receiving a page description language representation of a document for providing a list of words in the document and position information for the words; and
b) automatically identifying table data in the document based on the page description language representation of the document and at least one table identifying feature, wherein the identifying step includes the steps of, b1) dividing the document into one or more pages;
b2) dividing each page into a plurality of lines, wherein the dividing step includes the steps of, b2—
1) for each line determining a text density measure;
b2—
2) for each line determining a change of text density between the current line and the previous line;
h2—
3) if the change in text density reaches a predetermined threshold, marking the current line with a text density tag;
b2—
4) expanding the top edge of the table bounding box in a first direction to one of a line previously marked by a text density tag and a line with a single word cluster; and
b2—
5) expanding the bottom edge of the table bounding box in a second direction to one of a line previously marked by a text density tag and a line with a single word cluster;
b3) for each line, clustering the words of the line into one or more word clusters; and
b4) automatically identifying table data in the document based on the number of word clusters for each line and the alignment of the word clusters between lines. - View Dependent Claims (8, 9)
b4—
1) using the word clusters to generate column position information; and
b4—
2) updating the column position information by performing a union operation between the column position information of the previous line and the column position information of the current line.
-
-
9. The computer-readable medium of claim 7 further containing instructions which, when executed by said processor, would cause said processor to perform the steps of:
-
b1) automatically determining a table bounding box for each table in the document;
b2) expanding each table bounding box based on a text density feature; and
b3) converting the table data encompassed by each table bounding box to a markup language representation.
-
-
10. A document processing system comprising:
-
a) a processor for executing programs; and
b) a table identification program for receiving a page description language representation of a document, the page description language representation providing a list of words in the document and position information for the words, and for automatically identifying table data in the document based on the page description representation of the document and at least one table identifying feature, wherein the table identification program includes, b1) a bounding box generation module for receiving the list of words and for automatically generating a table bounding box for each table in the document based on the number of work clusters in each line; and
b2) an expansion module coupled to the bounding box generation module for receiving the table bounding box for each table in the document, wherein each table bounding box has a first edge and a second edge;
the expansion module for expanding the first edge in a first direction to one of a line that has a single word cluster and a line that has been previously marked with a text density tag and for expanding the second edge in a second direction to one of a line that has a single word cluster and a line that has been previously marked with a text density tag.- View Dependent Claims (11)
b3) a conversion module coupled to the bounding box generation module for receiving the table bounding box for each table in the document, and for converting the words encompassed by the table bounding box into a markup language representation that maintains the table structure of each table.
-
Specification