Text based schema discovery and information extraction
First Claim
1. A method for creating a database schema from unstructured documents comprising the steps of:
- accessing an unstructured document;
extracting information from the unstructured document using text mining, the extracted information comprising terms, phrases and sentiments;
analyzing the extracted information to identify sections of the unstructured document;
storing statistics regarding an occurrence of items in the unstructured document, the items comprising the extracted information and identified sections;
repeating the accessing, extracting, analyzing, and storing steps for a plurality of unstructured documents;
creating a probabilistic model based on the statistics stored for the plurality of unstructured documents;
generating a database schema using the probabilistic model;
receiving user modifications to the probabilistic model;
updating the probabilistic model based upon the user modifications; and
generating a database based on the database schema generated using the probabilistic model.
2 Assignments
0 Petitions
Accused Products
Abstract
Various technologies and techniques are disclosed for text based schema discovery and information extraction. Documents are analyzed to identify sections of the documents and a relationship between the sections. Statistics are stored regarding occurrences of items in the documents. A probabilistic model is generated based on the stored statistics. A database schema is generated with a plurality of tables based upon the probabilistic model. The documents are analyzed against the probabilistic model to determine how the documents map to the tables generated from the database schema. The tables are populated from the documents based on a result of the analysis against the probabilistic model.
-
Citations
14 Claims
-
1. A method for creating a database schema from unstructured documents comprising the steps of:
-
accessing an unstructured document; extracting information from the unstructured document using text mining, the extracted information comprising terms, phrases and sentiments; analyzing the extracted information to identify sections of the unstructured document; storing statistics regarding an occurrence of items in the unstructured document, the items comprising the extracted information and identified sections; repeating the accessing, extracting, analyzing, and storing steps for a plurality of unstructured documents; creating a probabilistic model based on the statistics stored for the plurality of unstructured documents; generating a database schema using the probabilistic model; receiving user modifications to the probabilistic model; updating the probabilistic model based upon the user modifications; and generating a database based on the database schema generated using the probabilistic model. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computer storage medium having computer-executable instructions for causing a computer to perform steps comprising:
-
accessing an unstructured document; extracting information from the unstructured document using text mining, the extracted information comprising terms, phrases and sentiments; analyzing the extracted information to identify sections of the unstructured document; storing statistics regarding an occurrence of items in the unstructured document, the items comprising the extracted information and identified sections; repeating the accessing, extracting, analyzing, and storing steps for a plurality of unstructured documents; creating a probabilistic model based on the statistics stored for the plurality of unstructured documents; generating a database schema using the probabilistic model; receiving user modifications to the probabilistic model; updating the probabilistic model based upon the user modifications; and generating a database based on the database schema generated using the probabilistic model. - View Dependent Claims (13, 14)
-
Specification