Technique and tools for high-level rule-based customizable data extraction
First Claim
1. A computer program product for efficiently extracting data from a data stream, the computer program product embodied on one or more computer-readable media and comprising:
- computer-readable program code means for defining one or more data extraction rules, each of the rules comprising one or more rule components;
computer-readable program code means for defining one or more output document templates for storing extracted data, wherein each of the templates comprises one or more tags which are hierarchically structured and wherein each template is to be associated with one or more of the data extraction rules;
computer-readable program code means for associating at least one of the templates with at least one of the rules;
computer-readable program code means for storing the rules, the templates, and the associations;
computer-readable program code means for monitoring at least one data stream for arrival of incoming data;
computer-readable program code means for comparing the incoming data to selected ones of the stored rules until detecting a matching rule;
computer-readable program code means for extracting data from the incoming data, upon detecting the matching rule, according to the matching rule; and
computer-readable program code means for storing the extracted data in an extensible document which is created according to the tags and structure of a selected one of the templates that is associated with the matching rule.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention provides a method, system, and computer program product for extracting data from a data stream (including data streams that contain the presentation space for a legacy host screen) using a rule-based approach that does not require a user to write programming language statements. The disclosed techniques apply to presentation space data that is sent from a legacy host application to a workstation, as well as to other types of data streams (including data exchanged between applications, Web page data, etc.). Rules are defined using intuitive, interactive tools to specify the target patterns of data to be extracted. Tags in a markup language (such as the Extensible Markup Language, or “XML”) are defined, and are associated with the defined rules. Upon detecting a match between the data in an incoming data stream and a target rule, an output document (expressed in the markup language) is created. Use of the markup language document provides great flexibility, enabling the document to be translated or otherwise transformed for use in multiple different environments.
-
Citations
28 Claims
-
1. A computer program product for efficiently extracting data from a data stream, the computer program product embodied on one or more computer-readable media and comprising:
-
computer-readable program code means for defining one or more data extraction rules, each of the rules comprising one or more rule components;
computer-readable program code means for defining one or more output document templates for storing extracted data, wherein each of the templates comprises one or more tags which are hierarchically structured and wherein each template is to be associated with one or more of the data extraction rules;
computer-readable program code means for associating at least one of the templates with at least one of the rules;
computer-readable program code means for storing the rules, the templates, and the associations;
computer-readable program code means for monitoring at least one data stream for arrival of incoming data;
computer-readable program code means for comparing the incoming data to selected ones of the stored rules until detecting a matching rule;
computer-readable program code means for extracting data from the incoming data, upon detecting the matching rule, according to the matching rule; and
computer-readable program code means for storing the extracted data in an extensible document which is created according to the tags and structure of a selected one of the templates that is associated with the matching rule. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system for efficiently extracting data from a data stream, comprising:
-
means for defining one or more data extraction rules, each of the rules comprising one or more rule components;
means for defining one or more output document templates for storing extracted data, wherein each of the templates comprises one or more tags which are hierarchically structured and wherein each template is to be associated with one or more of the data extraction rules;
means for associating at least one of the templates with at least one of the rules;
means for storing the rules, the templates, and the associations;
means for monitoring at least one data stream for arrival of incoming data;
means for comparing the incoming data to selected ones of the stored rules until detecting a matching rule;
means for extracting data from the incoming data, upon detecting the matching rule, according to the matching rule; and
means for storing the extracted data in an extensible document which is created according to the tags and structure of a selected one of the templates that is associated with the matching rule. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
-
-
17. A method for efficiently extracting data from a data stream comprising the steps of. defining one or more data extraction rules, each of the rules comprising one or more rule components;
-
defining one or more output document templates for storing extracted data, wherein each of the templates comprises one or more tags which are hierarchically structured and wherein each template is to be associated with one or more of the data extraction rules;
associating at least one of the templates with at least one of the rules;
storing the rules, the templates, and the associations;
monitoring at least one data stream for arrival of incoming data;
comparing the incoming data to selected ones of the stored rules until detecting a matching rule;
extracting data from the incoming data, upon detecting the matching rule, according to the matching rule; and
storing the extracted data in an extensible document which is created according to the tags and structure of a selected one of the templates that is associated with the matching rule.
-
Specification