System and method for processing semi-structured business data using selected template designs
First Claim
1. A method for processing semi-structured data for the support of business decisions, the method comprising:
- inputting semi-structured data in a first format from a real business process;
converting the semi-structured data in a first format into a second format;
storing the semi-structured data in the second format into memory;
processing the stored information with the components of one or more business decisions to couple the business decision with the semi-structured data;
separating the semi-structured data in the second format into a fixed part, a structured data part and an unstructured data part;
identifying one or more factors related to the business decision in the fixed part, the structured part or the unstructured part;
determining one or more aggregate patterns coupled to the identified factors from the separated data;
displaying the factor, the pattern, and the business decision.
6 Assignments
0 Petitions
Accused Products
Abstract
A method for processing semi-structured data. The method includes receiving semi-structured data into a first format from a real business process. Preferably, the semi-structured data are machine generated. The method includes tokenizing the semi-structured data into a second format and storing the semi-structured data in the second format into one or more memories and clustering the tokenized data to form a plurality of clusters. The method also includes identifying a selected low frequency term in each of the clusters, and processing at least two of the clusters and the associated selected low frequency terms to form a single template for the at least two of the clusters. In a preferred embodiment, the method replaces the selected low frequency term with a wild card character.
91 Citations
37 Claims
-
1. A method for processing semi-structured data for the support of business decisions, the method comprising:
-
inputting semi-structured data in a first format from a real business process;
converting the semi-structured data in a first format into a second format;
storing the semi-structured data in the second format into memory;
processing the stored information with the components of one or more business decisions to couple the business decision with the semi-structured data;
separating the semi-structured data in the second format into a fixed part, a structured data part and an unstructured data part;
identifying one or more factors related to the business decision in the fixed part, the structured part or the unstructured part;
determining one or more aggregate patterns coupled to the identified factors from the separated data;
displaying the factor, the pattern, and the business decision.
-
-
2. The method in 1 wherein the semi-structured data is processed together with structured data or unstructured data or both.
-
3. The method in 1 wherein the separation of semi-structured data is done interactively.
-
4. The method in 1 wherein the analysis is compared to a previous analysis and changed templates, new templates, and missing templates in the new data set are identified (template drift detection).
-
5. The method in 1 wherein the structured data are extracted and stored in a relational database.
-
6. The method in 5 wherein the structured data are further processed by mapping them into a set of data types commonly used in relational databases.
-
7. The method in 5 wherein the structured data are further processed by mapping them into a set of data types provided by a user.
-
8. The method in 5 wherein the structured data are further processed by mapping them into a set of data types used in the business.
-
9. The method in 1 wherein the semi-structured data consist of a set of text records.
-
10. The method in 1 wherein the records are separated into three parts based on parsing them into tokens and identifying each token as being fixed, structured or unstructured.
-
11. The method in 1 wherein the tokenization is based on user-supplied, commonly used or statistically inferred delimiters.
-
12. The method in 1 wherein the identification as fixed is based on clustering records and identifying frequent tokens in a cluster as fixed.
-
13. The method in 1 wherein the identification of structured and unstructured parts is based on aligning records and identifying common tokens as fixed, and identifying non-common tokens as unstructured or structured.
-
14. The method in 1 wherein a variable part of a record is identified as structured if it can be mapped into a typical relational database numerical format.
-
15. The method in 1 wherein a variable part is identified as structured if it can be mapped into a relational database field commonly used in the business.
-
16. The method in 1 wherein a variable part is identified as unstructured if it did not pass any of the tests for structured data.
-
17. The method in 12 wherein clustering is letter n-gram clustering.
-
18. The method in 13 wherein alignment is done via dynamic programming.
-
19. The method in 10 wherein separation is done iteratively.
-
20. The method in 19 wherein iterative separation is guided by the user.
-
21. The method in 1 wherein a special symbol is chosen to represent digits and textual data is regularized by replacing each digit with this special symbol.
-
22. The method in 12 wherein the clustering algorithm used is one of k-means, Expectation-Maximization clustering or hierarchical clustering.
-
23. The method in 18 wherein a pair of different aligned tokens in an alignment of two templates is converted into a wildcard character in the merged template.
-
24. The method in 18 wherein a pair of different aligned tokens in an alignment of two templates is converted into a special symbol that is the union of values that match either the first or the second aligned token.
-
25. The method in 18 wherein a pair of different aligned tokens in an alignment of two templates is converted into a regular expression that matches both the first and the second aligned token.
-
26. The method in 1 wherein a post processing step examines all token sequences that match a specific wild card character in a template and replace the wild card character with a regular expression that matches all matching token strings.
-
27. The method in 1 wherein a first set of templates is derived in a first step, all data items matching the first set of templates are removed from the collection, and template analysis is then applied to the remainder.
-
28. The method in 27 wherein the template induction and document deletion steps are applied in more than one iteration.
-
29. A method for processing semi-structured records for making computer based business decisions, the method comprising:
-
receiving a set of semi-structured records from a real business process, each of the records being in a first format;
storing the set of semi-structured records in the first format into one or more memories;
tokenizing the set of semi-structured records into one or more strings of token elements;
clustering the one or more strings of token elements associated with the set of semi-structured records into a plurality of clusters;
identifying one or more low frequency tokens in each of the clusters in the plurality of clusters;
replacing at least one of the low-frequency tokens with a predetermined wildcard character in at least one of the clusters to convert at least one of the records in the set of semi-structured records in the first format into a second format;
storing the set of semi-structured records in the second format into one or more memories;
selecting one or more of the semi-structured records in the second format;
outputting the selected one or more set of semi-structured records in the second format;
associating at least one of the semi-structured records in the first format with a respective semi-structured record in the second format;
identifying a fixed component and a variable component in at least one of the semi-structured records in the first format;
determining one or more patterns associated with the identified fixed component and variable component; and
displaying at least one of the patterns.
-
-
30. A method for processing semi-structured data, the method comprising:
-
receiving semi-structured data into a first format from a real business process, the semi-structured data being machine generated;
tokenizing the semi-structured data into a second format and storing the semi-structured data in the second format into one or more memories;
clustering the tokenized data to form a plurality of clusters;
identifying at least one selected low frequency term in at least one of the clusters; and
processing at least one of the clusters and the associated selected low frequency term to form at least two records in the second format to form a single template for at least the two records. - View Dependent Claims (31, 32, 33, 34, 35)
-
-
36. A computer based system for processing semi-structured data, the system comprising one or more memories, the one or more memories including:
-
one or more codes directed to receiving a set of semi-structured records into a first format from a real business process, the semi-structured records being machine generated;
one or more codes directed to tokenizing the set of semi-structured records into a second format and storing the set of semi-structured data in the second format into one or more memories;
one or more codes directed to clustering the tokenized data to form a plurality of clusters;
one or more codes directed to identifying a selected low frequency term in at least one of the clusters; and
one or more codes directed to processing at least two records in the second format and the selected low frequency term to form a single template for the at least two records.
-
-
37. A computer based system for processing semi-structured records for making computer based business decisions, the system comprising one or more memories, the one or more memories including:
-
one or more codes directed to receiving a set of semi-structured records from a real business process, each of the records being in a first format;
one or more codes directed to storing the set of semi-structured records in the first format into the one or more memories;
one or more codes directed to tokenizing the set of semi-structured records into one or more strings of token elements;
one or more codes directed to clustering the one or more strings of token elements associated with the set of semi-structured records into a plurality of clusters;
one or more codes directed to identifying one or more low frequency tokens in each of the clusters in the plurality of clusters;
one or more codes directed to replacing at least one of the low-frequency tokens with a predetermined wildcard character in at least one of the clusters to convert the semi-structured records in the first format into a second format;
one or more codes directed to storing the set of semi-structured records in the second format into one or more memories;
one or more codes directed to selecting one or more of the semi-structured records in the second format;
one or more codes directed to outputting the selected one or more semi-structured records in the second format;
one or more codes directed to associating at least one of the semi-structured records in the first format with a respective semi-structured record in the second format;
one or more codes directed to identifying a fixed component and a variable component in at least one of the semi-structured records in the first format;
one or more codes directed to determining one or more patterns associated with the identified fixed component and variable component; and
one or more codes directed to displaying at least one of the patterns.
-
Specification