System and method for processing semi-structured business data using selected template designs

US 20050065967A1
Filed: 07/20/2004
Published: 03/24/2005
Est. Priority Date: 07/25/2003
Status: Active Grant

First Claim

Patent Images

1. A method for processing semi-structured data for the support of business decisions, the method comprising:

inputting semi-structured data in a first format from a real business process;

converting the semi-structured data in a first format into a second format;

storing the semi-structured data in the second format into memory;

processing the stored information with the components of one or more business decisions to couple the business decision with the semi-structured data;

separating the semi-structured data in the second format into a fixed part, a structured data part and an unstructured data part;

identifying one or more factors related to the business decision in the fixed part, the structured part or the unstructured part;

determining one or more aggregate patterns coupled to the identified factors from the separated data;

displaying the factor, the pattern, and the business decision.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for processing semi-structured data. The method includes receiving semi-structured data into a first format from a real business process. Preferably, the semi-structured data are machine generated. The method includes tokenizing the semi-structured data into a second format and storing the semi-structured data in the second format into one or more memories and clustering the tokenized data to form a plurality of clusters. The method also includes identifying a selected low frequency term in each of the clusters, and processing at least two of the clusters and the associated selected low frequency terms to form a single template for the at least two of the clusters. In a preferred embodiment, the method replaces the selected low frequency term with a wild card character.

91 Citations

View as Search Results

37 Claims

1. A method for processing semi-structured data for the support of business decisions, the method comprising:
- inputting semi-structured data in a first format from a real business process;
  
  converting the semi-structured data in a first format into a second format;
  
  storing the semi-structured data in the second format into memory;
  
  processing the stored information with the components of one or more business decisions to couple the business decision with the semi-structured data;
  
  separating the semi-structured data in the second format into a fixed part, a structured data part and an unstructured data part;
  
  identifying one or more factors related to the business decision in the fixed part, the structured part or the unstructured part;
  
  determining one or more aggregate patterns coupled to the identified factors from the separated data;
  
  displaying the factor, the pattern, and the business decision.

2. The method in 1 wherein the semi-structured data is processed together with structured data or unstructured data or both.

3. The method in 1 wherein the separation of semi-structured data is done interactively.

4. The method in 1 wherein the analysis is compared to a previous analysis and changed templates, new templates, and missing templates in the new data set are identified (template drift detection).

5. The method in 1 wherein the structured data are extracted and stored in a relational database.

6. The method in 5 wherein the structured data are further processed by mapping them into a set of data types commonly used in relational databases.

7. The method in 5 wherein the structured data are further processed by mapping them into a set of data types provided by a user.

8. The method in 5 wherein the structured data are further processed by mapping them into a set of data types used in the business.

9. The method in 1 wherein the semi-structured data consist of a set of text records.

10. The method in 1 wherein the records are separated into three parts based on parsing them into tokens and identifying each token as being fixed, structured or unstructured.

11. The method in 1 wherein the tokenization is based on user-supplied, commonly used or statistically inferred delimiters.

12. The method in 1 wherein the identification as fixed is based on clustering records and identifying frequent tokens in a cluster as fixed.

13. The method in 1 wherein the identification of structured and unstructured parts is based on aligning records and identifying common tokens as fixed, and identifying non-common tokens as unstructured or structured.

14. The method in 1 wherein a variable part of a record is identified as structured if it can be mapped into a typical relational database numerical format.

15. The method in 1 wherein a variable part is identified as structured if it can be mapped into a relational database field commonly used in the business.

16. The method in 1 wherein a variable part is identified as unstructured if it did not pass any of the tests for structured data.

17. The method in 12 wherein clustering is letter n-gram clustering.

18. The method in 13 wherein alignment is done via dynamic programming.

19. The method in 10 wherein separation is done iteratively.

20. The method in 19 wherein iterative separation is guided by the user.

21. The method in 1 wherein a special symbol is chosen to represent digits and textual data is regularized by replacing each digit with this special symbol.

22. The method in 12 wherein the clustering algorithm used is one of k-means, Expectation-Maximization clustering or hierarchical clustering.

23. The method in 18 wherein a pair of different aligned tokens in an alignment of two templates is converted into a wildcard character in the merged template.

24. The method in 18 wherein a pair of different aligned tokens in an alignment of two templates is converted into a special symbol that is the union of values that match either the first or the second aligned token.

25. The method in 18 wherein a pair of different aligned tokens in an alignment of two templates is converted into a regular expression that matches both the first and the second aligned token.

26. The method in 1 wherein a post processing step examines all token sequences that match a specific wild card character in a template and replace the wild card character with a regular expression that matches all matching token strings.

27. The method in 1 wherein a first set of templates is derived in a first step, all data items matching the first set of templates are removed from the collection, and template analysis is then applied to the remainder.

28. The method in 27 wherein the template induction and document deletion steps are applied in more than one iteration.

29. A method for processing semi-structured records for making computer based business decisions, the method comprising:
- receiving a set of semi-structured records from a real business process, each of the records being in a first format;
  
  storing the set of semi-structured records in the first format into one or more memories;
  
  tokenizing the set of semi-structured records into one or more strings of token elements;
  
  clustering the one or more strings of token elements associated with the set of semi-structured records into a plurality of clusters;
  
  identifying one or more low frequency tokens in each of the clusters in the plurality of clusters;
  
  replacing at least one of the low-frequency tokens with a predetermined wildcard character in at least one of the clusters to convert at least one of the records in the set of semi-structured records in the first format into a second format;
  
  storing the set of semi-structured records in the second format into one or more memories;
  
  selecting one or more of the semi-structured records in the second format;
  
  outputting the selected one or more set of semi-structured records in the second format;
  
  associating at least one of the semi-structured records in the first format with a respective semi-structured record in the second format;
  
  identifying a fixed component and a variable component in at least one of the semi-structured records in the first format;
  
  determining one or more patterns associated with the identified fixed component and variable component; and
  
  displaying at least one of the patterns.

30. A method for processing semi-structured data, the method comprising:
- receiving semi-structured data into a first format from a real business process, the semi-structured data being machine generated;
  
  tokenizing the semi-structured data into a second format and storing the semi-structured data in the second format into one or more memories;
  
  clustering the tokenized data to form a plurality of clusters;
  
  identifying at least one selected low frequency term in at least one of the clusters; and
  
  processing at least one of the clusters and the associated selected low frequency term to form at least two records in the second format to form a single template for at least the two records.
- View Dependent Claims (31, 32, 33, 34, 35)
- - 31. The method of claim 30 wherein the processing the at least two records and the associated selected low frequency term replaces the selected low frequency term with a wild card character.
  - 32. The method of claim 30 further comprising storing each of the clusters into one or more memories.
  - 33. The method of claim 30 wherein the processing two or more records merges the two or more records into the single template.
  - 34. The method of claim 31 wherein the wild card character represents a variable field in the template.
  - 35. The method of claim 30 further comprising displaying each of the templates to a human user.

36. A computer based system for processing semi-structured data, the system comprising one or more memories, the one or more memories including:
- one or more codes directed to receiving a set of semi-structured records into a first format from a real business process, the semi-structured records being machine generated;
  
  one or more codes directed to tokenizing the set of semi-structured records into a second format and storing the set of semi-structured data in the second format into one or more memories;
  
  one or more codes directed to clustering the tokenized data to form a plurality of clusters;
  
  one or more codes directed to identifying a selected low frequency term in at least one of the clusters; and
  
  one or more codes directed to processing at least two records in the second format and the selected low frequency term to form a single template for the at least two records.

37. A computer based system for processing semi-structured records for making computer based business decisions, the system comprising one or more memories, the one or more memories including:
- one or more codes directed to receiving a set of semi-structured records from a real business process, each of the records being in a first format;
  
  one or more codes directed to storing the set of semi-structured records in the first format into the one or more memories;
  
  one or more codes directed to tokenizing the set of semi-structured records into one or more strings of token elements;
  
  one or more codes directed to clustering the one or more strings of token elements associated with the set of semi-structured records into a plurality of clusters;
  
  one or more codes directed to identifying one or more low frequency tokens in each of the clusters in the plurality of clusters;
  
  one or more codes directed to replacing at least one of the low-frequency tokens with a predetermined wildcard character in at least one of the clusters to convert the semi-structured records in the first format into a second format;
  
  one or more codes directed to storing the set of semi-structured records in the second format into one or more memories;
  
  one or more codes directed to selecting one or more of the semi-structured records in the second format;
  
  one or more codes directed to outputting the selected one or more semi-structured records in the second format;
  
  one or more codes directed to associating at least one of the semi-structured records in the first format with a respective semi-structured record in the second format;
  
  one or more codes directed to identifying a fixed component and a variable component in at least one of the semi-structured records in the first format;
  
  one or more codes directed to determining one or more patterns associated with the identified fixed component and variable component; and
  
  one or more codes directed to displaying at least one of the patterns.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
OpenSpan, Inc. (Pegasystems Incorporated)
Original Assignee
Enkatatechnologies Incorporated
Inventors
Yu, Chia-Hao, Velipasaoglu, Omer Emre, Stukov, Stan, Schuetze, Hinrich H.

Granted Patent

US 7,389,306 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/88   Mark-up to mark-up conversi...

Y10S 707/944   Business related

Y10S 707/954   Relational

Y10S 707/99942   Manipulating data structure...

Y10S 707/99943   Generating database or data...

Y10S 707/99944   Object-oriented database st...

Y10S 707/99945   Object-oriented database st...

System and method for processing semi-structured business data using selected template designs

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

91 Citations

37 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for processing semi-structured business data using selected template designs

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

91 Citations

37 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links