System and method for processing semi-structured business data using selected template designs

US 7,389,306 B2
Filed: 07/20/2004
Issued: 06/17/2008
Est. Priority Date: 07/25/2003
Status: Expired due to Fees

First Claim

Patent Images

1. A method for processing semi-structured data for the support of business decisions, the method comprising:

inputting semi-structured data in a first format from a real business process;

converting the semi-structured data in a first format into a second format;

storing the semi-structured data in the second format into memory;

processing the stored information with the components of one or more business decisions to couple the business decision with the semi-structured data;

separating the semi-structured data in the second format into a fixed part, a structured data part and an unstructured data part;

identifying one or more relational class related to the business decision in the structured part;

replacing members of the relational class with special symbols to create a rewritten corpus;

determining a template from one or more aggregate patterns coupled to the identified relational class from the rewritten corpus;

displaying the relational class, the template, and the business decision; and

determining from variable fields within the template, an indicator of a business event.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for processing semi-structured data. The method includes receiving semi-structured data into a first format from a real business process. Preferably, the semi-structured data are machine generated. The method includes tokenizing the semi-structured data into a second format and storing the semi-structured data in the second format into one or more memories and clustering the tokenized data to form a plurality of clusters. The method also includes identifying a selected low frequency term in each of the clusters, and processing at least two of the clusters and the associated selected low frequency terms to form a single template for the at least two of the clusters. In a preferred embodiment, the method replaces the selected low frequency term with a wild card character.

52 Citations

View as Search Results

29 Claims

1. A method for processing semi-structured data for the support of business decisions, the method comprising:
- inputting semi-structured data in a first format from a real business process;
  
  converting the semi-structured data in a first format into a second format;
  
  storing the semi-structured data in the second format into memory;
  
  processing the stored information with the components of one or more business decisions to couple the business decision with the semi-structured data;
  
  separating the semi-structured data in the second format into a fixed part, a structured data part and an unstructured data part;
  
  identifying one or more relational class related to the business decision in the structured part;
  
  replacing members of the relational class with special symbols to create a rewritten corpus;
  
  determining a template from one or more aggregate patterns coupled to the identified relational class from the rewritten corpus;
  
  displaying the relational class, the template, and the business decision; and
  
  determining from variable fields within the template, an indicator of a business event.

2. The method in 1 wherein the semi-structured data is processed together with structured data or unstructured data or both.

3. The method in 1 wherein the separation of semi-structured data is done interactively.

4. The method in 1 further comprising detecting template drift by comparing the template to a second template of the structured data part to identify changed templates, new templates, or missing templates.

5. The method in 1 wherein the structured data are extracted and stored in a relational database.
- View Dependent Claims (9)
- - 9. The method of claim 5 further comprising using the extracted relational data as input for text classification.

6. The method in 5 wherein the structured data are further processed by mapping them into a set of data types commonly used in relational databases.

7. The method in 5 wherein the structured data are further processed by mapping them into a set of data types provided by a user.

8. The method in 5 wherein the structured data are further processed by mapping them into a set of data types used in the business.

10. The method in 1 wherein the semi-structured data consist of a set of text records.

11. The method in 10 wherein the records are separated into three parts based on parsing them into tokens and identifying each token as being fixed, structured or unstructured.

12. The method in 11 wherein the parsing into tokens is based on user-supplied, commonly used or statistically inferred delimiters.

13. The method in 11 wherein the fixed data is separated based on clustering records and identifying frequent tokens in a cluster as fixed.

14. The method in 13 wherein clustering is letter n-gram clustering.

15. The method in 13 wherein the clustering algorithm used is one of k-means, Expectation-Maximization clustering or hierarchical clustering.

16. The method in 11 wherein the separation of structured and unstructured parts is based on aligning records and identifying common tokens as fixed, and identifying non-common tokens as unstructured or structured.

17. The method in 16 wherein alignment is done via dynamic programming.

18. The method in 17 wherein a pair of different aligned tokens in an alignment of two templates is converted into a wildcard character in the merged template.

19. The method in 17 wherein a pair of different aligned tokens in an alignment of two templates is converted into a special symbol that is the union of values that match either the first or the second aligned token.

20. The method in 17 wherein a pair of different aligned tokens in an alignment of two templates is converted into a regular expression that matches both the first and the second aligned token.

21. The method in 11 wherein a variable part of a record is identified as structured if it can be mapped into a typical relational database numerical format.

22. The method in 11 wherein a variable part is identified as structured if it can be mapped into a relational database field commonly used in the business.

23. The method in 11 wherein a variable part is identified as unstructured if it did not pass any of the tests for structured data.

24. The method in 11 wherein separation is done iteratively.

25. The method in 24 wherein iterative separation is guided by the user.

26. The method in 1 wherein the special symbols are chosen to represent digits and textual data is regularized by replacing each digit with this special symbol.

27. The method in 1 wherein a post processing step examines all token sequences that match a specific wild card character in a template and replace the wild card character with a regular expression that matches all matching token strings.

28. The method in 1 wherein a first set of templates is derived in a first step, all data items matching the first set of templates are removed from the collection, and template analysis is then applied to the remainder.

29. The method in 28 wherein the template induction and document deletion steps are applied in more than one iteration.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
OpenSpan, Inc. (Pegasystems Incorporated)
Original Assignee
Enkata Technologies, Inc.
Inventors
Schuetze, Hinrich H., Yu, Chia-Hao, Velipasaoglu, Omer Emre, Stukov, Stan
Primary Examiner(s)
WOO, ISAAC M

Application Number

US10/895,624
Publication Number

US 20050065967A1
Time in Patent Office

1,428 Days
Field of Search

707 1- 10, 707100-1041, 707200-206
US Class Current

707/602
CPC Class Codes

G06F 16/88   Mark-up to mark-up conversi...

Y10S 707/944   Business related

Y10S 707/954   Relational

Y10S 707/99942   Manipulating data structure...

Y10S 707/99943   Generating database or data...

Y10S 707/99944   Object-oriented database st...

Y10S 707/99945   Object-oriented database st...

System and method for processing semi-structured business data using selected template designs

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

52 Citations

29 Claims

Specification

Use Cases

Quick Links

Others

System and method for processing semi-structured business data using selected template designs

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

52 Citations

29 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others