System and method of making unstructured data available to structured data analysis tools
First Claim
1. A system for making unstructured data available to structured data tools comprising:
- a core server computer, wherein the core server computer performs steps comprising;
accessing a source of unstructured data;
reading the unstructured data from the source of unstructured data;
sending the unstructured data to one or more transformation tools;
parsing, via a natural-language processing transformation tool, the unstructured data to extract sentences from the unstructured data and then further extract from the extracted sentences sentence-level natural-language processed entities, wherein the sentence-level natural-language processed entities are at least noun phrases;
extracting, via a linguistic processing transformation tool, sentence-level linguistically-processed relationships, wherein the sentence-level linguistically-processed relationships comprise associations between the sentence-level natural-language processed entities;
sending the sentence-level natural-language processed entities and the sentence-level linguistically-processed relationships from the one or more transformation tools to a categorization tool;
determining, via the categorization tool, categorization data elements present in each extracted sentence, wherein the categorization data elements are based on the sentence-level natural-language processed entities and the sentence-level linguistically-processed relationships, and are placed within predetermined categories, and a confidence level for each categorization data element, wherein the confidence level for each categorization data element combines one or more data points linked to the sentence-level natural-language processed entities and the sentence-level linguistically-processed relationships to create a statistically-oriented calculation of confidence assigned to the categorization data element;
outputting the confidence level for at least one of the categorization data elements for use in structured data tools; and
wherein the one or more data points are selected from the group consisting of;
confidence score of value provided by the one or more transformation tools, number of relationships found in the source of unstructured data compared to the size of the source of unstructured data, average number of relationships per kilobyte for relationships of the same type as a selected relationship, number of entities found to be associated with a relationship compared to an average number of entities for relationships in a same hierarchy, number of times similar relationships have been found in the past, number of entities that are grouped together to form a master entity, a number of times an entity occurred in the source of unstructured data compared to the average number of occurrences for entities in the same hierarchy, weighted confidences based on hierarchy of a relationship or entity, measures of data extraction confidence integrated with the system via an analysis schema, measures based on a fullness of a relationship'"'"'s attributes, measures based on the confluence of a same finding by multiple transformation tools, measures based on the source of the unstructured data, and combinations thereof.
12 Assignments
0 Petitions
Accused Products
Abstract
A system and method of making unstructured data available to structured data analysis tools. The system includes middleware software that can be used in combination with structured data tools to perform analysis on both structured and unstructured data. Data can be read from a wide variety of unstructured sources. The data may then be transformed with commercial data transformation products that may, for example, extract individual pieces of data and determine relationships between the extracted data. The transformed data and relationships may then be passed through an extraction/transform/load (ETL) layer and placed in a structured schema. The structured schema may then be made available to commercial or proprietary structured data analysis tools.
236 Citations
35 Claims
-
1. A system for making unstructured data available to structured data tools comprising:
-
a core server computer, wherein the core server computer performs steps comprising; accessing a source of unstructured data; reading the unstructured data from the source of unstructured data; sending the unstructured data to one or more transformation tools; parsing, via a natural-language processing transformation tool, the unstructured data to extract sentences from the unstructured data and then further extract from the extracted sentences sentence-level natural-language processed entities, wherein the sentence-level natural-language processed entities are at least noun phrases; extracting, via a linguistic processing transformation tool, sentence-level linguistically-processed relationships, wherein the sentence-level linguistically-processed relationships comprise associations between the sentence-level natural-language processed entities; sending the sentence-level natural-language processed entities and the sentence-level linguistically-processed relationships from the one or more transformation tools to a categorization tool; determining, via the categorization tool, categorization data elements present in each extracted sentence, wherein the categorization data elements are based on the sentence-level natural-language processed entities and the sentence-level linguistically-processed relationships, and are placed within predetermined categories, and a confidence level for each categorization data element, wherein the confidence level for each categorization data element combines one or more data points linked to the sentence-level natural-language processed entities and the sentence-level linguistically-processed relationships to create a statistically-oriented calculation of confidence assigned to the categorization data element; outputting the confidence level for at least one of the categorization data elements for use in structured data tools; and wherein the one or more data points are selected from the group consisting of;
confidence score of value provided by the one or more transformation tools, number of relationships found in the source of unstructured data compared to the size of the source of unstructured data, average number of relationships per kilobyte for relationships of the same type as a selected relationship, number of entities found to be associated with a relationship compared to an average number of entities for relationships in a same hierarchy, number of times similar relationships have been found in the past, number of entities that are grouped together to form a master entity, a number of times an entity occurred in the source of unstructured data compared to the average number of occurrences for entities in the same hierarchy, weighted confidences based on hierarchy of a relationship or entity, measures of data extraction confidence integrated with the system via an analysis schema, measures based on a fullness of a relationship'"'"'s attributes, measures based on the confluence of a same finding by multiple transformation tools, measures based on the source of the unstructured data, and combinations thereof. - View Dependent Claims (2, 3, 4, 5, 24, 30, 31, 32, 33, 34, 35)
-
-
6. A system for making unstructured data available to structured data tools comprising:
-
a core server computer executing at least one software module comprising; code to access a source of unstructured data; code to read the unstructured data from the source of unstructured data; code to write a copy of the unstructured data to a capture schema, wherein the capture schema comprises a set of tables to store the copy of the unstructured data and attributes of the unstructured data; code to send the copy of the unstructured data to one or more transformation tools; code to parse, via a natural language processing transformation tool, the copy of the unstructured data to extract sentences from the copy of the unstructured data and then further extract from the extracted sentences sentence-level natural-language processed entities, wherein the sentence-level natural-language processed entities are at least noun phrases; extracting, via a linguistic processing transformation tool, sentence-level linguistically-processed relationships, wherein the sentence-level linguistically-processed relationships comprise associations between the sentence-level natural-language processed entities; code to send the sentence-level natural-language processed entities and the sentence-level linguistically-processed relationships from the one or more transformation tools to a categorization tool; code to determine, via the categorization tool, categorization data elements present in each extracted sentence, wherein the categorization data elements based on to the sentence-level natural-language processed entities and the sentence-level linguistically-processed relationships, and are placed within predetermined categories, and a confidence level for each categorization data element, wherein the confidence level for each categorization data element combines one or more data points linked to the sentence-level natural-language processed entities and the sentence-level linguistically-processed relationships to create a statistically-oriented calculation of confidence assigned to the categorization data element; code to write the categorization data elements from the categorization tool and the sentence-level natural-language processed entities and the sentence-level linguistically-processed relationships from the one or more transformation tools in a structured database schema; code to output the confidence level for at least one of the categorization data elements for use in structured data tools; and wherein the one or more data points are selected from the group consisting of;
confidence score of value provided by the one or more transformation tools, number of relationships found in the source of unstructured data compared to the size of the source of unstructured data, average number of relationships per kilobyte for relationships of the same type as a selected relationship, number of entities found to be associated with a relationship compared to an average number of entities for relationships in a same hierarchy, number of times similar relationships have been found in the past, number of entities that are grouped together to form a master entity, a number of times an entity occurred in the source of unstructured data compared to the average number of occurrences for entities in the same hierarchy, weighted confidences based on hierarchy of a relationship or entity, measures of data extraction confidence integrated with the system via an analysis schema, measures based on a fullness of a relationship'"'"'s attributes, measures based on the confluence of a same finding by multiple transformation tools, measures based on the source of the unstructured data, and combinations thereof. - View Dependent Claims (7, 8, 9, 10, 25)
-
-
11. A system for extracting unstructured data from a plurality of unstructured data sources and a plurality of formats comprising:
-
a core server computer comprising; a plurality of application program interfaces (APIs) to interface with the plurality of unstructured data sources; a single internal API that interfaces with a plurality of software components that allow structured data tools to operate on unstructured data; extraction connectors for processing unstructured data from the unstructured data sources and loading the processed unstructured data into a capture schema, wherein the capture schema comprises a set of tables to store a copy of the unstructured data and attributes of the unstructured data; wherein the copy of the unstructured data is sent to one or more transformation tools; wherein a natural-language processing transformation tool parses the copy of the unstructured data to extract sentences from the copy of the unstructured data and then further extract from the extracted sentences sentence-level natural-language processed entities, wherein the sentence-level natural-language processed entities are at least noun phrases; wherein a linguistic processing transformation tool extracts sentence-level linguistically-processed relationships, wherein the sentence-level linguistically-processed relationships comprise associations between the sentence-level natural-language processed entities; wherein the sentence-level natural-language processed entities and the sentence-level linguistically-processed relationships are sent from the one or more transformation tools to a categorization tool; wherein the categorization tool determines categorization data elements present in each extracted sentence, wherein the categorization data elements are based on the sentence-level natural-language processed entities and the sentence-level linguistically-processed relationships, and are placed within predetermined categories, and a confidence level for each categorization data element, wherein the confidence level for each categorization data element combines one or more data points linked to the sentence-level natural-language processed entities and the sentence-level linguistically-processed relationships to create a statistically-oriented calculation of confidence assigned to the categorization data element; wherein the categorization data elements from the categorization tool and the sentence-level natural-language processed entities and the sentence-level linguistically-processed relationships from the one or more transformation tools are written in a structured database schema; wherein the confidence level for at least one of the categorization data elements is output for use in structured data tools; and wherein the one or more data points are selected from the group consisting of;
confidence score of value provided by the one or more transformation tools, number of relationships found in the source of unstructured data compared to the size of the source of unstructured data, average number of relationships per kilobyte for relationships of the same type as a selected relationship, number of entities found to be associated with a relationship compared to an average number of entities for relationships in a same hierarchy, number of times similar relationships have been found in the past, number of entities that are grouped together to form a master entity, a number of times an entity occurred in the source of unstructured data compared to the average number of occurrences for entities in the same hierarchy, weighted confidences based on hierarchy of a relationship or entity, measures of data extraction confidence integrated with the system via an analysis schema, measures based on a fullness of a relationship'"'"'s attributes, measures based on the confluence of a same finding by multiple transformation tools, measures based on the source of the unstructured data, and combinations thereof. - View Dependent Claims (12, 26)
-
-
13. A system comprising:
-
a core server computer executing at least one software module comprising; code capable of understanding the format of data provided by a transformation tool; code to convert the data provided by a transformation tool to a data format that maps to a data capture schema, the data capture schema comprising a set of tables to store a copy of unstructured data and attributes of the unstructured data; code to send the copy of the unstructured data to one or more transformation tools; code to parse, via a natural-language processing transformation tool, the copy of the unstructured data to extract sentences from the copy of the unstructured data and then further extract from the extracted sentences sentence-level natural-language processed entities, wherein the sentence-level natural-language processed entities are at least noun phrases; code to extract, via a linguistic processing transformation tool, sentence-level linguistically-processed relationships, wherein the sentence-level linguistically-processed relationships comprise associations between the sentence-level natural-language processed entities; code to send the sentence-level natural-language processed entities and the sentence-level linguistically-processed relationships from the one or more transformation tools to a categorization tool; wherein the categorization tool determines categorization data elements present in each extracted sentence, wherein the categorization data elements are based on the sentence-level natural-language processed entities and the sentence-level linguistically-processed relationships, and are placed within predetermined categories, and a confidence level for each categorization data element, wherein the confidence level for each categorization data element combines one or more data points linked to the sentence-level natural-language processed entities and the sentence-level linguistically-processed relationships to create a statistically-oriented calculation of confidence assigned to the categorization data element; code to write the categorization data elements from the categorization tool and the sentence-level natural-language processed entities and the sentence-level linguistically-processed relationships from the one or more transformation tools in a structured database schema; code to output the confidence level for at least one of the categorization data elements for use in structured data tools; wherein each of a plurality of source documents are assigned a unique key that identifies an individual source document throughout a software system allowing (i) cross-analysis, (ii) linking of results for further analysis, (iii) drill-down from analytical reports back to the source document or (iv) drill-down from analytical reports back to transformation information stored in the data capture schema; and wherein the one or more data points are selected from the group consisting of;
confidence score of value provided by the one or more transformation tools, number of relationships found in the source of unstructured data compared to the size of the source of unstructured data, average number of relationships per kilobyte for relationships of the same type as a selected relationship, number of entities found to be associated with a relationship compared to an average number of entities for relationships in a same hierarchy, number of times similar relationships have been found in the past, number of entities that are grouped together to form a master entity, a number of times an entity occurred in the source of unstructured data compared to the average number of occurrences for entities in the same hierarchy, weighted confidences based on hierarchy of a relationship or entity, measures of data extraction confidence integrated with the system via an analysis schema, measures based on a fullness of a relationship'"'"'s attributes, measures based on the confluence of a same finding by multiple transformation tools, measures based on the source of the unstructured data, and combinations thereof. - View Dependent Claims (14, 15, 27)
-
-
16. A system for allowing parallel processing of unstructured data on a continuous real-time basis, the system comprising:
-
a core server computer executing at least one software module comprising; code to configure unstructured source extractors and treat them as black boxes in a data workflow; code to read unstructured data from a plurality of data sources and source systems, wherein the unstructured data is available as input for further processing; code to configure end-to-end data flow from the plurality of data sources through one or more transformation components into a capture schema, wherein the capture schema comprises a set of tables to store a copy of the unstructured data and attributes of the unstructured data; code to send the copy of the unstructured data to one or more transformation tools; code to parse, via a natural-language processing transformation tool, the copy of the unstructured data to extract sentences from the copy of the unstructured data and then further extract from the extracted sentences sentence-level natural-language processed entities, wherein the sentence-level natural-language processed entities are at least noun phrases; code to extract, via a linguistic processing transformation tool, sentence-level linguistically-processed relationships, wherein the sentence-level linguistically-processed relationships comprise associations between the sentence-level natural-language processed entities; code to send the sentence-level natural-language processed entities and the sentence-level linguistically-processed relationships from the one or more transformation tools to a categorization tool; code to determine, via the categorization tool, categorization data elements present in each extracted sentence, wherein the categorization data elements are based on the sentence-level natural-language processed entities and the sentence-level linguistically-processed relationships, and are placed within predetermined categories, and a confidence level for each categorization data element, wherein the confidence level for each categorization data element combines one or more data points linked to the sentence-level natural-language processed entities and the sentence-level linguistically-processed relationships to create a statistically-oriented calculation of confidence assigned to the categorization data element; code to write the categorization data elements from the categorization tool and the sentence-level natural-language processed entities and the sentence-level linguistically-processed relationships from the one or more transformation tools in a structured database schema; code to output the confidence level for at least one of the categorization data elements for use in structured data tools; and wherein the one or more data points are selected from the group consisting of;
confidence score of value provided by the one or more transformation tools, number of relationships found in the source of unstructured data compared to the size of the source of unstructured data, average number of relationships per kilobyte for relationships of the same type as a selected relationship, number of entities found to be associated with a relationship compared to an average number of entities for relationships in a same hierarchy, number of times similar relationships have been found in the past, number of entities that are grouped together to form a master entity, a number of times an entity occurred in the source of unstructured data compared to the average number of occurrences for entities in the same hierarchy, weighted confidences based on hierarchy of a relationship or entity, measures of data extraction confidence integrated with the system via an analysis schema, measures based on a fullness of a relationship'"'"'s attributes, measures based on the confluence of a same finding by multiple transformation tools, measures based on the source of the unstructured data and combinations thereof. - View Dependent Claims (17, 28)
-
-
18. A system that allows structured data analysis tools to analyze data in an analysis schema comprising:
-
a core server computer executing at least one software module comprising; ODBC code; JDBC code; code to pre-populate metadata of the structured data analysis tools with tables, columns, attributes, data and metrics from an analysis schema without performing tool customization or application specific setup; and wherein the analysis schema comprises a set of tables that provides structure to unstructured data, wherein at least one of the set of tables comprises master entities, the master entities comprising (i) a group of entities that appear in multiple documents that are the same actual entity, (ii) entities that are spelled differently that are the same actual entity, or (iii) entities that have multiple names that are the same actual entity; wherein unstructured data is read from a source of unstructured data; wherein a copy of the unstructured data is written to a capture schema, wherein the capture schema comprises a set of tables to store the copy of the unstructured data and attributes of the unstructured data; wherein the copy of the unstructured data is sent to one or more transformation tools; wherein a natural-language processing transformation tool parses the copy of the unstructured data to extract sentences from the copy of the unstructured data and then further extract from the extracted sentences sentence-level natural-language processed entities, wherein the sentence-level natural-language processed entities are at least noun phrases; wherein a linguistic processing transformation tool extracts sentence-level linguistically-processed relationships, wherein the sentence-level linguistically-processed relationships comprise associations between the sentence-level natural-language processed entities; wherein the sentence-level natural-language processed entities and the sentence-level linguistically-processed relationships are sent from the one or more transformation tools to a categorization tool; wherein the categorization tool determines categorization data elements present in each extracted sentence, wherein the categorization data elements are based on the sentence-level natural-language processed entities and the sentence-level linguistically-processed relationships, and are placed within predetermined categories, and a confidence level for each categorization data element, wherein the confidence level for each categorization data element combines one or more data points linked to the sentence-level natural-language processed entities and the sentence-level linguistically-processed relationships to create a statistically-oriented calculation of confidence assigned to the categorization data element; wherein the categorization data elements from the categorization tool and the sentence-level natural-language processed entities and the sentence-level linguistically-processed relationships from the one or more transformation tools are written in a structured database schema; wherein the confidence level for at least one of the categorization data elements is output for use in structured data tools; and wherein the one or more data points are selected from the group consisting of;
confidence score of value provided by the one or more transformation tools, number of relationships found in the source of unstructured data compared to the size of the source of unstructured data, average number of relationships per kilobyte for relationships of the same type as a selected relationship, number of entities found to be associated with a relationship compared to an average number of entities for relationships in a same hierarchy, number of times similar relationships have been found in the past, number of entities that are grouped together to form a master entity, a number of times an entity occurred in the source of unstructured data compared to the average number of occurrences for entities in the same hierarchy, weighted confidences based on hierarchy of a relationship or entity, measures of data extraction confidence integrated with the system via an analysis schema, measures based on a fullness of a relationship'"'"'s attributes, measures based on the confluence of a same finding by multiple transformation tools, measures based on the source of the unstructured data, and combinations thereof. - View Dependent Claims (19, 20, 21, 22, 23, 29)
-
Specification