Analysis and transformation tools for structured and unstructured data
First Claim
1. A section extractor comprising:
- code that looks for specific document headers;
code that extracts the specific document headers;
code that stores the specific document header in a schema; and
code that extracts and stores a specific section of a document or a series of specific sections from a document in a schema.
4 Assignments
0 Petitions
Accused Products
Abstract
A system and method of making unstructured data available to structured data analysis tools. The system includes middleware software that can be used in combination with structured data tools to perform analysis on both structured and unstructured data. Data can be read from a wide variety of unstructured sources. The data may then be transformed with commercial data transformation products that may, for example, extract individual pieces of data and determine relationships between the extracted data. The transformed data and relationships may then be passed through an extraction/transform/load (ETL) layer and placed in a structured schema. The structured schema may then be made available to commercial or proprietary structured data analysis tools.
-
Citations
22 Claims
-
1. A section extractor comprising:
-
code that looks for specific document headers;
code that extracts the specific document headers;
code that stores the specific document header in a schema; and
code that extracts and stores a specific section of a document or a series of specific sections from a document in a schema. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A proximity transformer comprising:
-
code that looks for a first group of predetermined entities or relationship entries in a analysis schema; and
code that looks for the closest instance of a second predetermined entity for each matching entity or relationship entry in the first group of predetermined entities or relationship entries. - View Dependent Claims (8, 9)
-
-
10. A table parser comprising:
-
code to identify a table in a source document, the code determining the columns and rows according to the amount of whitespace between characters or by reading HTML tags;
code to extract column headers, row headers, data points, and order of magnitude indicators; and
code to convert the table to structured rows, columns, cells, headers and order of magnitude multipliers, wherein the table parser can adapt dynamically to different formats and to a plurality of combinations of columns and rows. - View Dependent Claims (11, 12, 13, 14, 15)
-
-
16. A confidence analysis routine comprising:
-
code adapted to calculate a weighted confidence score for a data element, the code weighing (i) a confidence score provided by a transformation tool used to generate the data element if provided by the transformation tool;
(ii) the number of relationships found in the source document per size of the source document;
compared to the average number of relationships found per kilobyte or other size measure of a document;
(iii) the number of entities found to be associated with the relationship, compared to the average number of entities for relationships in the same hierarchy;
(iv) the number of times similar relationships have been found in the past;
(v) the number of entities that are grouped together to form a master entity;
(vi) the number of times the entity occurs in the document compared to the average number of occurrences for entities in the same hierarchy;
(vii) weighted confidences based on hierarchy of relationship or entity. - View Dependent Claims (17)
-
-
18. A search module comprising:
-
code to index data in an analysis schema, the index generated by creating data dump reports using a reporting tool that create a list of each entity, topic, or relationship discussed in a document along with a link back to the source document;
orcode to periodically and/or automatically run analytical reports to be included in an indexing process;
orcode to index metadata contained in a definition of a dimensional model of the analysis schema, definitions of facts, definitions of metrics, definitions of measures, data contained within the dimensions and measures. - View Dependent Claims (19, 20, 21, 22)
-
Specification