Understanding tables for search
First Claim
Patent Images
1. A method for detecting one or more subject columns of a table, the method comprising:
- selecting a specified number of columns from the table as candidate subject columns, each candidate subject column being a candidate for a true subject column of the table, each candidate subject column including a plurality of values;
for each candidate subject column;
determining a co-occurrence for values in the candidate subject column, including determining how often values in the candidate subject column also occur in true subject columns in a plurality of other tables;
calculating a score for the candidate subject column based on the determined co-occurrence, the calculated score indicating a likelihood of the candidate subject column being a true subject column; and
classifying the candidate subject column as one of;
a true subject column of the table or a non-subject column of the table based on the calculated score for the candidate subject column.
2 Assignments
0 Petitions
Accused Products
Abstract
The present invention extends to methods, systems, and computer program products for understanding tables for search. Aspects of the invention include identifying a subject column for a table, detecting a column header using other tables, and detecting a column header using a knowledge base. Implementations can be utilized in a structured data search system (SDSS) that indexes structured information, such as, tables in a relational database or html tables extracted from web pages. The SDSS allows users to search over the structured information (tables) using different mechanisms including keyword search and data finding data.
17 Citations
33 Claims
-
1. A method for detecting one or more subject columns of a table, the method comprising:
-
selecting a specified number of columns from the table as candidate subject columns, each candidate subject column being a candidate for a true subject column of the table, each candidate subject column including a plurality of values; for each candidate subject column; determining a co-occurrence for values in the candidate subject column, including determining how often values in the candidate subject column also occur in true subject columns in a plurality of other tables; calculating a score for the candidate subject column based on the determined co-occurrence, the calculated score indicating a likelihood of the candidate subject column being a true subject column; and classifying the candidate subject column as one of;
a true subject column of the table or a non-subject column of the table based on the calculated score for the candidate subject column. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. At a computer system, a method for detecting a column header for a table including one or more rows, the method comprising:
-
constructing a set of candidate column names for the table from data defining the table; for each candidate column name in the set of candidate column names; calculating a candidate column name frequency for the candidate column name by identifying one or more other tables, from among a set of other tables, that also contain the candidate column name as a candidate column name; and calculating a non-candidate column name frequency for the candidate column name by identifying a second one or more other tables, from among the set of other tables, that contain the candidate column name other than as a candidate column name; and selecting a row of the table as a column header when at least a specified threshold of candidate column names contained in the row have a candidate column name frequency that is greater than a non-candidate column name frequency. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. At a computer system, a method for detecting a column header for a table including one or more rows, the method comprising:
-
constructing a set of candidate column names for the table; for each candidate column name in the set of candidate column names; calculating a candidate column name frequency for the candidate column name by identifying one or more other tables, from among a set of other tables, that also contain the candidate column name as a candidate column name; inferring that a column included in the set of candidate column names is a hypernym of the cell values contained in the column based on the cell values contained in the column; and selecting the row containing the column as a column header for the table based on the inference and the candidate column name frequencies for the candidate column names in the row. - View Dependent Claims (18, 19, 20)
-
-
21. A system, the system comprising:
-
one or more processors; system memory coupled to the one or more hardware processors, the system memory storing instructions that are executable by the one or more hardware processors; and the one or more hardware processors executing the instructions stored in the system memory to detect one or more subject columns of a table, including the following; select a specified number of columns from the table as candidate subject columns, each candidate subject column being a candidate for a true subject column of the table, each candidate subject column including a plurality of values; for each subject candidate column; determine a co-occurrence for values in the candidate subject column, including determining how often values in the candidate subject column also occur in true subject columns in a plurality of other tables; and calculate a score for the candidate subject column based on the determined co-occurrence, the calculated score indicating a likelihood of the candidate subject column being a true subject column; and classify the candidate subject column as one of;
a true subject column of the table or a non-subject column of the table based on the calculated score for the candidate subject column. - View Dependent Claims (22, 23, 24, 25)
-
-
26. A system, the system comprising:
-
one or more processors; system memory coupled to the one or more hardware processors, the system memory storing instructions that are executable by the one or more hardware processors; and the one or more hardware processors executing the instructions stored in the system memory to detect a column header for a table including one or more rows, including the following; construct a set of candidate column names for the table from data defining the table; for each candidate column name in the set of candidate column names; calculate a candidate column name frequency for the candidate column name by identifying one or more other tables, from among a set of other tables, that also contain the candidate column name as a candidate column name; and calculate a non-candidate column name frequency for the candidate column name by identifying a second one or more other tables, from among the set of other tables, that contain the candidate column name other than as a candidate column name; and select a row of the table as a column header when at least a specified threshold of candidate column names contained in the row have a candidate column name frequency that is greater than a non-candidate column name frequency. - View Dependent Claims (27, 28, 29)
-
-
30. A system, the system comprising:
-
one or more processors; system memory coupled to the one or more hardware processors, the system memory storing instructions that are executable by the one or more hardware processors; and the one or more hardware processors executing the instructions stored in the system memory to detect a column header for a table including one or more rows, including the following; construct a set of candidate column names for the table; for each candidate column name in the set of candidate column names; calculate a candidate column name frequency for the candidate column name by identifying one or more other tables, from among a set of other tables, that also contain the candidate column name as a candidate column name; infer that a column included in the set of candidate column names is a hypernym of the cell values contained in the column based on the cell values contained in the column; and select the row containing the column as a column header for the table based on the inference and the candidate column name frequencies for the candidate column names in the row. - View Dependent Claims (31, 32, 33)
-
Specification