Systems and methods for retrieving tabular data from textual sources

US 5,950,196 A
Filed: 07/25/1997
Issued: 09/07/1999
Est. Priority Date: 07/25/1997
Status: Expired due to Fees

First Claim

Patent Images

1. A method for identifying tables and their component fields, the tables embedded in a text document, the method comprising the steps of:

(a) storing in a memory element a character alignment graph, the graph indicating the number of text characters appearing in a particular horizontal location for each of a predetermined number of contiguous lines of text in the text document;

(b) identifying one of the predetermined number of contiguous lines as belonging to a table when the indication of the number of text characters appearing in a particular horizontal location for that predetermined number of contiguous lines fall below a predetermined threshold;

(c) forming an extracted table from all of the identified predetermined numbers of contiguous lines; and

(d) identifying one or more captions for the extracted table on the basis of structural patterns contained in the extracted table.

View all claims

15 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Tables form an important kind of data element in text retrieval. Often, the gist of an entire news article or other exposition can be concisely captured in tabular form. Information other than the key words in a digital document can be exploited to provide the users with more flexible and powerful query capabilities. More specifically, the structural information in a document is exploited to identify tables and their component fields and let the users query based on these fields. Component fields can include table lines, caption lines, row headings, column headings, or other table components. Empirical results have demonstrated that heuristic method based table extraction and component tagging can be performed effectively and efficiently. Moreover, experiments in retrieval using the system of the present invention strongly indicate that such structural decomposition can facilitate better representation of user'"'"'s information needs and hence more effective retrieval of tables.

Citations

20 Claims

1. A method for identifying tables and their component fields, the tables embedded in a text document, the method comprising the steps of:
- (a) storing in a memory element a character alignment graph, the graph indicating the number of text characters appearing in a particular horizontal location for each of a predetermined number of contiguous lines of text in the text document;
  
  (b) identifying one of the predetermined number of contiguous lines as belonging to a table when the indication of the number of text characters appearing in a particular horizontal location for that predetermined number of contiguous lines fall below a predetermined threshold;
  
  (c) forming an extracted table from all of the identified predetermined numbers of contiguous lines; and
  
  (d) identifying one or more captions for the extracted table on the basis of structural patterns contained in the extracted table.
- View Dependent Claims (2, 3, 4, 5, 6, 8, 9)
- - 2. The method of claim 1 wherein step (d) further comprises identifying lines of the extracted table as table lines, and not caption lines, by examining an extracted table line for one or more large gaps present in the middle of the extracted table line.
  - 3. The method of claim 1 wherein step (d) further comprises identifying two lines of the extracted table as table lines, and not caption lines, by comparing the alignment of the gap structure present in each line.
  - 4. The method of claim 1 wherein step (d) further comprises identifying a caption for the extracted table on the basis of context regularity.
  - 5. The method of claim 1 wherein step (d) further comprises identifying a caption for the extracted table by identifying an extracted table line having a different number of columns than the remainder of the extracted table lines.
  - 6. The method of claim 1 wherein step (d) further comprises identifying a caption for the extracted table by identifying an extracted table line having a different gap structure than the remainder of the extracted table lines.
  - 8. The method of claim 1 wherein step (d) further comprises the steps of:
    - (d-a) identifying lines of the extracted table as table lines, and not caption lines, by examining an extracted table line for one or more large gaps present in the middle of the extracted table line; and
      
      (d-b) identifying caption lines from the identified table lines by determining if an identified table line has a different gap structure than the remainder of the extracted table lines.
  - 9. The method of claim 1 wherein step (d) further comprises the steps of:
    - (d-a) identifying lines of the extracted table as table lines, and not caption lines, by examining an extracted table line for one or more large gaps present in the middle of the extracted table line; and
      
      (d-b) identifying table lines from the identified caption lines by determining if an identified caption line is immediately preceded by an identified table line which is itself immediately preceded by a second identified caption line and the identified caption line is immediately followed by an identified table line which is itself immediately followed by a third identified caption line.

7. A method for facilitating data retrieval from tables embedded in a text document, the method comprising the steps of:
- (a) storing in a memory element a character alignment graph, the graph indicating the number of text characters appearing in a particular horizontal location for each of a predetermined number of contiguous lines of text in the text document;
  
  (b) identifying one of the predetermined number of contiguous lines as belonging to a table when the indication of the number of text characters appearing in a particular horizontal location for that predetermined number of contiguous lines fall below a predetermined threshold;
  
  (c) forming an extracted table from all of the identified predetermined numbers of contiguous lines;
  
  (d) identifying one or more components of the extracted table on the basis of structural patterns contained in the extracted table;
  
  (e) indexing the extracted table on the basis of the component tags in order to allow database queries to be performed on the extracted table.
- View Dependent Claims (10, 11, 12, 13)
- - 10. The method of claim 7 wherein step (d) further comprises identifying one or more column headings of the extracted table on the basis of structural patterns contained in the extracted table.
  - 11. The method of claim 7 wherein step (d) further comprises identifying one or more row headings of the extracted table on the basis of structural patterns contained in the extracted table.
  - 12. The method of claim 7 wherein step (d) further comprises identifying one or more table lines of the extracted table on the basis of structural patterns contained in the extracted table.
  - 13. The method of claim 7 wherein step (d) further comprises identifying one or more caption lines of the extracted table on the basis of structural patterns contained in the extracted table.

14. A system for performing data queries on tables embedded in text documents, the system comprising:
- a table extractor which retrieves a text document from a memory element and processes the text document to identify at least one table embedded in the text document;
  
  a component tagger which separates lines of the table identified by said table extractor into caption lines and table lines; and
  
  an indexing unit which indexes the tagged table so that a data query may be applied to the table.

15. A method for facilitating data retrieval from tables embedded in a text document, the method comprising the steps of:
- (a) identifying one or more components of the table on the basis of structural patterns contained in the table;
  
  (b) indexing the table on the basis of the component tags in order to allow database queries to be performed on the table.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The method of claim 15 wherein step (a) further comprises identifying one or more column headings of the table on the basis of structural patterns contained in the table.
  - 17. The method of claim 15 wherein step (a) further comprises identifying one or more row headings of the table on the basis of structural patterns contained in the table.
  - 18. The method of claim 15 wherein step (a) further comprises identifying one or more table lines of the table on the basis of structural patterns contained in the table.
  - 19. The method of claim 15 wherein step (a) further comprises identifying one or more caption lines of the table on the basis of structural patterns contained in the table.
  - 20. The method of claim 15 wherein step (a) further comprises the steps of:
    - (a-a) identifying lines of the table as table lines, and not caption lines, by examining an table line for one or more large gaps present in the middle of the table line; and
      
      (a-b) identifying caption lines from the identified table lines by determining if an identified table line has a different gap structure than the remainder of the table lines.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Open TEXT SA (Open Text Corporation)
Original Assignee
Sovereign Hill Software, Inc. (LeadingSide, Inc.)
Inventors
Pyreddy, Pallavi, Croft, W. Bruce
Primary Examiner(s)
Amsbury, Wayne
Assistant Examiner(s)
HAVAN, THU THAO

Application Number

US08/901,234
Time in Patent Office

774 Days
Field of Search

707/6, 707/9, 707/102, 707/5, 707/2, 707/4, 707/104, 706/45, 706/47, 706/60, 345/333, 345/326, 455/4
US Class Current

715/227
CPC Class Codes

G06F 16/81   Indexing, e.g. XML tags; Da...

G06V 30/416   Extracting the logical stru...

Y10S 707/917   Text

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Y10S 707/99945   Object-oriented database st...

Systems and methods for retrieving tabular data from textual sources

First Claim

15 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for retrieving tabular data from textual sources

First Claim

15 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links