System and method for processing semi-structured business data using selected template designs
First Claim
1. A method for processing semi-structured data for the support of business decisions, the method comprising:
- inputting semi-structured data in a first format from a real business process;
converting the semi-structured data in a first format into a second format;
storing the semi-structured data in the second format into memory;
processing the stored information with the components of one or more business decisions to couple the business decision with the semi-structured data;
separating the semi-structured data in the second format into a fixed part, a structured data part and an unstructured data part;
identifying one or more relational class related to the business decision in the structured part;
replacing members of the relational class with special symbols to create a rewritten corpus;
determining a template from one or more aggregate patterns coupled to the identified relational class from the rewritten corpus;
displaying the relational class, the template, and the business decision; and
determining from variable fields within the template, an indicator of a business event.
6 Assignments
0 Petitions
Accused Products
Abstract
A method for processing semi-structured data. The method includes receiving semi-structured data into a first format from a real business process. Preferably, the semi-structured data are machine generated. The method includes tokenizing the semi-structured data into a second format and storing the semi-structured data in the second format into one or more memories and clustering the tokenized data to form a plurality of clusters. The method also includes identifying a selected low frequency term in each of the clusters, and processing at least two of the clusters and the associated selected low frequency terms to form a single template for the at least two of the clusters. In a preferred embodiment, the method replaces the selected low frequency term with a wild card character.
52 Citations
29 Claims
-
1. A method for processing semi-structured data for the support of business decisions, the method comprising:
-
inputting semi-structured data in a first format from a real business process; converting the semi-structured data in a first format into a second format; storing the semi-structured data in the second format into memory; processing the stored information with the components of one or more business decisions to couple the business decision with the semi-structured data; separating the semi-structured data in the second format into a fixed part, a structured data part and an unstructured data part; identifying one or more relational class related to the business decision in the structured part; replacing members of the relational class with special symbols to create a rewritten corpus; determining a template from one or more aggregate patterns coupled to the identified relational class from the rewritten corpus; displaying the relational class, the template, and the business decision; and determining from variable fields within the template, an indicator of a business event.
-
-
2. The method in 1 wherein the semi-structured data is processed together with structured data or unstructured data or both.
-
3. The method in 1 wherein the separation of semi-structured data is done interactively.
-
4. The method in 1 further comprising detecting template drift by comparing the template to a second template of the structured data part to identify changed templates, new templates, or missing templates.
- 5. The method in 1 wherein the structured data are extracted and stored in a relational database.
-
6. The method in 5 wherein the structured data are further processed by mapping them into a set of data types commonly used in relational databases.
-
7. The method in 5 wherein the structured data are further processed by mapping them into a set of data types provided by a user.
-
8. The method in 5 wherein the structured data are further processed by mapping them into a set of data types used in the business.
-
10. The method in 1 wherein the semi-structured data consist of a set of text records.
-
11. The method in 10 wherein the records are separated into three parts based on parsing them into tokens and identifying each token as being fixed, structured or unstructured.
-
12. The method in 11 wherein the parsing into tokens is based on user-supplied, commonly used or statistically inferred delimiters.
-
13. The method in 11 wherein the fixed data is separated based on clustering records and identifying frequent tokens in a cluster as fixed.
-
14. The method in 13 wherein clustering is letter n-gram clustering.
-
15. The method in 13 wherein the clustering algorithm used is one of k-means, Expectation-Maximization clustering or hierarchical clustering.
-
16. The method in 11 wherein the separation of structured and unstructured parts is based on aligning records and identifying common tokens as fixed, and identifying non-common tokens as unstructured or structured.
-
17. The method in 16 wherein alignment is done via dynamic programming.
-
18. The method in 17 wherein a pair of different aligned tokens in an alignment of two templates is converted into a wildcard character in the merged template.
-
19. The method in 17 wherein a pair of different aligned tokens in an alignment of two templates is converted into a special symbol that is the union of values that match either the first or the second aligned token.
-
20. The method in 17 wherein a pair of different aligned tokens in an alignment of two templates is converted into a regular expression that matches both the first and the second aligned token.
-
21. The method in 11 wherein a variable part of a record is identified as structured if it can be mapped into a typical relational database numerical format.
-
22. The method in 11 wherein a variable part is identified as structured if it can be mapped into a relational database field commonly used in the business.
-
23. The method in 11 wherein a variable part is identified as unstructured if it did not pass any of the tests for structured data.
-
24. The method in 11 wherein separation is done iteratively.
-
25. The method in 24 wherein iterative separation is guided by the user.
-
26. The method in 1 wherein the special symbols are chosen to represent digits and textual data is regularized by replacing each digit with this special symbol.
-
27. The method in 1 wherein a post processing step examines all token sequences that match a specific wild card character in a template and replace the wild card character with a regular expression that matches all matching token strings.
-
28. The method in 1 wherein a first set of templates is derived in a first step, all data items matching the first set of templates are removed from the collection, and template analysis is then applied to the remainder.
-
29. The method in 28 wherein the template induction and document deletion steps are applied in more than one iteration.
Specification