System and method for continuous diagnosis of data streams
First Claim
1. An apparatus for facilitating the mining of time-evolving data streams, said apparatus comprising:
- an input arrangement for accepting a data stream comprising unlabeled data; and
an arrangement for determining an amount of drifts in the data stream comprising unlabeled data;
said determining arrangement;
employs a signature profile of an inductive model in determining an amount of drifts in the data stream;
reconstructs the inductive model via actively acquiring true labels for a small sample of the unlabeled data in the data stream in order to estimate loss, wherein the inductive model is reconstructed if the estimated loss is more than an empirically determined threshold; and
employs statistical measures to estimate the error rate of the inductive model;
wherein reconstruction of an original decision tree comprises at least one of;
updating a class probability distribution in leaf nodes in the tree; and
extending leaf nodes in the tree.
1 Assignment
0 Petitions
Accused Products
Abstract
In connection with the mining of time-evolving data streams, a general framework that mines changes and reconstructs models from a data stream with unlabeled instances or a limited number of labeled instances. In particular, there are defined herein statistical profiling methods that extend a classification tree in order to guess the percentage of drifts in the data stream without any labelled data. Exact error can be estimated by actively sampling a small number of true labels. If the estimated error is significantly higher than empirical expectations, there preferably re-sampled a small number of true labels to reconstruct the decision tree from the leaf node level.
12 Citations
9 Claims
-
1. An apparatus for facilitating the mining of time-evolving data streams, said apparatus comprising:
- an input arrangement for accepting a data stream comprising unlabeled data; and
an arrangement for determining an amount of drifts in the data stream comprising unlabeled data;
said determining arrangement;
employs a signature profile of an inductive model in determining an amount of drifts in the data stream;
reconstructs the inductive model via actively acquiring true labels for a small sample of the unlabeled data in the data stream in order to estimate loss, wherein the inductive model is reconstructed if the estimated loss is more than an empirically determined threshold; and
employs statistical measures to estimate the error rate of the inductive model;
wherein reconstruction of an original decision tree comprises at least one of;
updating a class probability distribution in leaf nodes in the tree; and
extending leaf nodes in the tree. - View Dependent Claims (2, 3, 4)
- an input arrangement for accepting a data stream comprising unlabeled data; and
-
5. A method of facilitating the mining of time-evolving data streams, said method comprising the steps of:
- accepting a data stream comprising unlabeled data; and
determining an amount of drifts in the data stream comprising unlabeled data;
said determining step comprising;
employing a signature profile of an inductive model in determining an amount of drifts in the data stream;
reconstructing the inductive model via actively acquiring true labels for a small sample of the unlabeled data in the data stream in order to estimate loss, wherein the inductive model is reconstructed if the estimated loss is more than an empirically determined threshold; and
employing statistical measures to estimate the error rate of the inductive model;
wherein reconstruction of an oriciinal decision tree comprises at least one of;
updating a class probability distribution in leaf nodes in the tree; and
extending leaf nodes in the tree. - View Dependent Claims (6, 7, 8)
- accepting a data stream comprising unlabeled data; and
-
9. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for facilitating the mining of time-evolving data streams, said method comprising the steps of:
- accepting a data stream comprising unlabeled data; and
determining an amount of drifts in the data stream comprising unlabeled data;
said determining step comprising;
employing a signature profile of an inductive model in determining an amount of drifts in the data stream;
reconstructing the inductive model via actively acquiring true labels for a small sample of the unlabeled data in the data stream in order to estimate loss, wherein the inductive model is reconstructed if the estimated loss is more than an empirically determined threshold; and
employing statistical measures to estimate the error rate of the inductive model;
wherein reconstruction of an original decision tree comprises at least one of;
updating a class probability distribution in leaf nodes in the tree; and
extending leaf nodes in the tree.
- accepting a data stream comprising unlabeled data; and
Specification