System and method for continuous diagnosis of data streams

US 7,464,068 B2
Filed: 06/30/2004
Issued: 12/09/2008
Est. Priority Date: 06/30/2004
Status: Expired due to Fees

First Claim

Patent Images

1. An apparatus for facilitating the mining of time-evolving data streams, said apparatus comprising:

an input arrangement for accepting a data stream comprising unlabeled data; and

an arrangement for determining an amount of drifts in the data stream comprising unlabeled data;

said determining arrangement;

employs a signature profile of an inductive model in determining an amount of drifts in the data stream;

reconstructs the inductive model via actively acquiring true labels for a small sample of the unlabeled data in the data stream in order to estimate loss, wherein the inductive model is reconstructed if the estimated loss is more than an empirically determined threshold; and

employs statistical measures to estimate the error rate of the inductive model;

wherein reconstruction of an original decision tree comprises at least one of;

updating a class probability distribution in leaf nodes in the tree; and

extending leaf nodes in the tree.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In connection with the mining of time-evolving data streams, a general framework that mines changes and reconstructs models from a data stream with unlabeled instances or a limited number of labeled instances. In particular, there are defined herein statistical profiling methods that extend a classification tree in order to guess the percentage of drifts in the data stream without any labelled data. Exact error can be estimated by actively sampling a small number of true labels. If the estimated error is significantly higher than empirical expectations, there preferably re-sampled a small number of true labels to reconstruct the decision tree from the leaf node level.

12 Citations

View as Search Results

9 Claims

1. An apparatus for facilitating the mining of time-evolving data streams, said apparatus comprising:
- an input arrangement for accepting a data stream comprising unlabeled data; and
  
  an arrangement for determining an amount of drifts in the data stream comprising unlabeled data;
  
  said determining arrangement;
  
  employs a signature profile of an inductive model in determining an amount of drifts in the data stream;
  
  reconstructs the inductive model via actively acquiring true labels for a small sample of the unlabeled data in the data stream in order to estimate loss, wherein the inductive model is reconstructed if the estimated loss is more than an empirically determined threshold; and
  
  employs statistical measures to estimate the error rate of the inductive model;
  
  wherein reconstruction of an original decision tree comprises at least one of;
  
  updating a class probability distribution in leaf nodes in the tree; and
  
  extending leaf nodes in the tree.
- View Dependent Claims (2, 3, 4)
- - 2. The apparatus according to claim 1, wherein said determining arrangement determines a percentage of drifts in the data stream.
  - 3. The apparatus according to claim 1, wherein said determining arrangement employs a signature profile in reconstructing the inductive model via minor model replacement.
  - 4. The apparatus according to claim 1, wherein said determining arrangement employs statistical measures to define the profile of the inductive model.

5. A method of facilitating the mining of time-evolving data streams, said method comprising the steps of:
- accepting a data stream comprising unlabeled data; and
  
  determining an amount of drifts in the data stream comprising unlabeled data;
  
  said determining step comprising;
  
  employing a signature profile of an inductive model in determining an amount of drifts in the data stream;
  
  reconstructing the inductive model via actively acquiring true labels for a small sample of the unlabeled data in the data stream in order to estimate loss, wherein the inductive model is reconstructed if the estimated loss is more than an empirically determined threshold; and
  
  employing statistical measures to estimate the error rate of the inductive model;
  
  wherein reconstruction of an oriciinal decision tree comprises at least one of;
  
  updating a class probability distribution in leaf nodes in the tree; and
  
  extending leaf nodes in the tree.
- View Dependent Claims (6, 7, 8)
- - 6. The method according to claim 5, wherein said determining step comprises determining a percentage of drifts in the data stream.
  - 7. The method according to claim 5, wherein said employing step comprises employing a signature profile in reconstructing the inductive model via minor model replacement.
  - 8. The method according to claim 5, wherein said determining step comprises employing statistical measures to define the profile of the inductive model.

9. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for facilitating the mining of time-evolving data streams, said method comprising the steps of:
- accepting a data stream comprising unlabeled data; and
  
  determining an amount of drifts in the data stream comprising unlabeled data;
  
  said determining step comprising;
  
  employing a signature profile of an inductive model in determining an amount of drifts in the data stream;
  
  reconstructing the inductive model via actively acquiring true labels for a small sample of the unlabeled data in the data stream in order to estimate loss, wherein the inductive model is reconstructed if the estimated loss is more than an empirically determined threshold; and
  
  employing statistical measures to estimate the error rate of the inductive model;
  
  wherein reconstruction of an original decision tree comprises at least one of;
  
  updating a class probability distribution in leaf nodes in the tree; and
  
  extending leaf nodes in the tree.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Yu, Philip S., Fan, Wei, Wang, Haixun
Primary Examiner(s)
Lee; Wilson
Assistant Examiner(s)
HO, BINH VAN

Application Number

US10/880,913
Publication Number

US 20060010093A1
Time in Patent Office

1,623 Days
Field of Search

707/5, 707/6, 707/1
US Class Current

1/1
CPC Class Codes

G06F 16/40   of multimedia data, e.g. sl...

G06F 16/906   Clustering; Classification

G06F 2216/03   Data mining

Y10S 707/99931   Database or file accessing

Y10S 707/99935   Query augmenting and refini...

System and method for continuous diagnosis of data streams

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

12 Citations

9 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for continuous diagnosis of data streams

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

12 Citations

9 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links