Systems and methods for retrieving tabular data from textual sources
First Claim
1. A method for identifying tables and their component fields, the tables embedded in a text document, the method comprising the steps of:
- (a) storing in a memory element a character alignment graph, the graph indicating the number of text characters appearing in a particular horizontal location for each of a predetermined number of contiguous lines of text in the text document;
(b) identifying one of the predetermined number of contiguous lines as belonging to a table when the indication of the number of text characters appearing in a particular horizontal location for that predetermined number of contiguous lines fall below a predetermined threshold;
(c) forming an extracted table from all of the identified predetermined numbers of contiguous lines; and
(d) identifying one or more captions for the extracted table on the basis of structural patterns contained in the extracted table.
15 Assignments
0 Petitions
Accused Products
Abstract
Tables form an important kind of data element in text retrieval. Often, the gist of an entire news article or other exposition can be concisely captured in tabular form. Information other than the key words in a digital document can be exploited to provide the users with more flexible and powerful query capabilities. More specifically, the structural information in a document is exploited to identify tables and their component fields and let the users query based on these fields. Component fields can include table lines, caption lines, row headings, column headings, or other table components. Empirical results have demonstrated that heuristic method based table extraction and component tagging can be performed effectively and efficiently. Moreover, experiments in retrieval using the system of the present invention strongly indicate that such structural decomposition can facilitate better representation of user'"'"'s information needs and hence more effective retrieval of tables.
-
Citations
20 Claims
-
1. A method for identifying tables and their component fields, the tables embedded in a text document, the method comprising the steps of:
-
(a) storing in a memory element a character alignment graph, the graph indicating the number of text characters appearing in a particular horizontal location for each of a predetermined number of contiguous lines of text in the text document; (b) identifying one of the predetermined number of contiguous lines as belonging to a table when the indication of the number of text characters appearing in a particular horizontal location for that predetermined number of contiguous lines fall below a predetermined threshold; (c) forming an extracted table from all of the identified predetermined numbers of contiguous lines; and (d) identifying one or more captions for the extracted table on the basis of structural patterns contained in the extracted table. - View Dependent Claims (2, 3, 4, 5, 6, 8, 9)
-
-
7. A method for facilitating data retrieval from tables embedded in a text document, the method comprising the steps of:
-
(a) storing in a memory element a character alignment graph, the graph indicating the number of text characters appearing in a particular horizontal location for each of a predetermined number of contiguous lines of text in the text document; (b) identifying one of the predetermined number of contiguous lines as belonging to a table when the indication of the number of text characters appearing in a particular horizontal location for that predetermined number of contiguous lines fall below a predetermined threshold; (c) forming an extracted table from all of the identified predetermined numbers of contiguous lines; (d) identifying one or more components of the extracted table on the basis of structural patterns contained in the extracted table; (e) indexing the extracted table on the basis of the component tags in order to allow database queries to be performed on the extracted table. - View Dependent Claims (10, 11, 12, 13)
-
-
14. A system for performing data queries on tables embedded in text documents, the system comprising:
-
a table extractor which retrieves a text document from a memory element and processes the text document to identify at least one table embedded in the text document; a component tagger which separates lines of the table identified by said table extractor into caption lines and table lines; and an indexing unit which indexes the tagged table so that a data query may be applied to the table.
-
-
15. A method for facilitating data retrieval from tables embedded in a text document, the method comprising the steps of:
-
(a) identifying one or more components of the table on the basis of structural patterns contained in the table; (b) indexing the table on the basis of the component tags in order to allow database queries to be performed on the table. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification