Automatic table detection method and system

US 6,757,870 B1
Filed: 03/22/2000
Issued: 06/29/2004
Est. Priority Date: 03/22/2000
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method of identifying table data in a document comprising the steps of:

a) receiving a page description language representation of the document for providing a list of words in the document and position information for the words; and

b) automatically identifying table data in the document based on the page description language representation of the document and at least one table identifying feature, wherein the step of identifying includes the steps of, b1) automatically determining a table bounding box for each table in the document, wherein the table bounding box includes a top edge and a bottom edge;

b2) expanding each table bounding box based on a text density feature, wherein the expanding step includes the steps of, b2_—1) for each line determining a text density measure;

b2_—2) for each line determining a change of text density between the current line and the previous line;

h2_—3) if the change in text density reaches a predetermined threshold, marking the current line with a text density tag;

b2_—4) expanding the top edge of the table bounding box in a first direction to one of a line previously marked by a text density tag and a line with a single word cluster; and

b2_—5) expanding the bottom edge of the table bounding box in a second direction to one of a line previously marked by a text density tag and a line with a single word cluster; and

b3) converting the table data encompassed by each table bounding box to a markup language representation.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for automatically detecting table data in a document that is described by a page definition language and converting the table data into a markup language representation. The document may have one or more pages. The page definition language description of the document provides a list of words, the position of the each on a page with respect to a predetermined reference point, and the size of each word. The present invention automatically identifies table data in the document by utilizing one or more table-identifying features. A first table-identifying feature may be the number of word clusters on a line. A second table-identifying feature may be the vertical alignment of word clusters between lines. A third table-identifying feature may be the changes in text density or space density between lines.

Citations

11 Claims

1. A computer-implemented method of identifying table data in a document comprising the steps of:
- a) receiving a page description language representation of the document for providing a list of words in the document and position information for the words; and
  
  b) automatically identifying table data in the document based on the page description language representation of the document and at least one table identifying feature, wherein the step of identifying includes the steps of, b1) automatically determining a table bounding box for each table in the document, wherein the table bounding box includes a top edge and a bottom edge;
  
  b2) expanding each table bounding box based on a text density feature, wherein the expanding step includes the steps of, b2_—1) for each line determining a text density measure;
  
  b2_—2) for each line determining a change of text density between the current line and the previous line;
  
  h2_—3) if the change in text density reaches a predetermined threshold, marking the current line with a text density tag;
  
  b2_—4) expanding the top edge of the table bounding box in a first direction to one of a line previously marked by a text density tag and a line with a single word cluster; and
  
  b2_—5) expanding the bottom edge of the table bounding box in a second direction to one of a line previously marked by a text density tag and a line with a single word cluster; and
  
  b3) converting the table data encompassed by each table bounding box to a markup language representation.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1 wherein the step of automatically identifying table data in the document based on the page description language representation of the document and at least one table identifying feature further comprises:
3. The method of claim 2 wherein the step of automatically identifying table data in the document based on the number of word clusters of each line and the alignment of the word clusters between lines further comprises:
- b4_—1) using the word clusters to generate column position information; and
  
  b4_—2) updating the column position information by performing a union operation between the column position information of the previous line and the column position information of the current line.
4. The method of claim 1 wherein receiving a page description language representation of the document for providing a list of words in the document and position information for the words includes receiving a PDF representation of the document, and wherein converting the table data encompassed by each table bounding box to a markup language representation includes converting the table data encompassed by each table bounding box to a HTML representation.
5. The method of claim 1 wherein the step or automatically identifying table data in the document based on the page description language representation of the document and at least one table identifying feature further comprises:
- b1) automatically identifying table data in the document based on one or more table headings.
6. The method of claim 1 wherein the step of automatically identifying table data in the document based on the page description language representation of the document and at least one table identifying feature further comprises:
- b1) automatically identifying table data in the document based on one or more horizontal lines and vertical lines that separate rows or columns of the table.

7. A computer-readable medium having stored thereon sequences of instructions, said sequences of instructions including instructions which, when executed by a processor, cause said processor to perform the steps of:
- a) receiving a page description language representation of a document for providing a list of words in the document and position information for the words; and
  
  b) automatically identifying table data in the document based on the page description language representation of the document and at least one table identifying feature, wherein the identifying step includes the steps of, b1) dividing the document into one or more pages;
  
  b2) dividing each page into a plurality of lines, wherein the dividing step includes the steps of, b2_—1) for each line determining a text density measure;
  
  b2_—2) for each line determining a change of text density between the current line and the previous line;
  
  h2_—3) if the change in text density reaches a predetermined threshold, marking the current line with a text density tag;
  
  b2_—4) expanding the top edge of the table bounding box in a first direction to one of a line previously marked by a text density tag and a line with a single word cluster; and
  
  b2_—5) expanding the bottom edge of the table bounding box in a second direction to one of a line previously marked by a text density tag and a line with a single word cluster;
  
  b3) for each line, clustering the words of the line into one or more word clusters; and
  
  b4) automatically identifying table data in the document based on the number of word clusters for each line and the alignment of the word clusters between lines.
- View Dependent Claims (8, 9)
- - 8. The computer-readable medium of claim 7 further containing instructions which, when executed by said processor, would cause said processor to perform the steps of:
9. The computer-readable medium of claim 7 further containing instructions which, when executed by said processor, would cause said processor to perform the steps of:
- b1) automatically determining a table bounding box for each table in the document;
  
  b2) expanding each table bounding box based on a text density feature; and
  
  b3) converting the table data encompassed by each table bounding box to a markup language representation.

10. A document processing system comprising:
- a) a processor for executing programs; and
  
  b) a table identification program for receiving a page description language representation of a document, the page description language representation providing a list of words in the document and position information for the words, and for automatically identifying table data in the document based on the page description representation of the document and at least one table identifying feature, wherein the table identification program includes, b1) a bounding box generation module for receiving the list of words and for automatically generating a table bounding box for each table in the document based on the number of work clusters in each line; and
  
  b2) an expansion module coupled to the bounding box generation module for receiving the table bounding box for each table in the document, wherein each table bounding box has a first edge and a second edge;
  
  the expansion module for expanding the first edge in a first direction to one of a line that has a single word cluster and a line that has been previously marked with a text density tag and for expanding the second edge in a second direction to one of a line that has a single word cluster and a line that has been previously marked with a text density tag.
- View Dependent Claims (11)
- - 11. The document processing system of claim 10 wherein the table identification program further comprises:

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hewlett Packard Enterprise Development LP (Hewlett-Packard Enterprise Company)
Original Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Inventors
Stinger, James R.
Primary Examiner(s)
Feild, Joseph H.
Assistant Examiner(s)
Nguyen, Dang T

Application Number

US09/532,538
Time in Patent Office

1,560 Days
Field of Search

715/523, 715/513, 715/522, 707/104, 707/508
US Class Current

715/234
CPC Class Codes

G06F 40/143   Markup, e.g. Standard Gener...

G06F 40/151   Transformation

G06F 40/166   Editing, e.g. inserting or ...

G06F 40/177   of tables; using ruled lines

Automatic table detection method and system

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

11 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic table detection method and system

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

11 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links