Method and system for linearly detecting data deviations in a large database
First Claim
1. A method for detecting deviations in a database having a plurality of data items, each data item being characterized by an attribute value, each subset of the data items being an itemset, and each itemset having a similarity value based on the attribute values of the data items in the itemset, the method comprising the steps of:
- determining a frequency of occurrence for each attribute value;
identifying any itemset whose similarity value satisfies a predetermined deviation criterion as a deviation, based on relative frequencies of occurrence of the attribute values; and
computing a smoothing factor that represents a reduction in similarity between any two itemsets when a subset of the data items common to the two itemsets is disregarded.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for detecting deviations in a database is disclosed, comprising the steps of: determining respective frequencies of occurrence for the attribute values of the data items, and identifying any itemset whose similarity value satisfies a predetermined criterion as a deviation, based on the frequencies of occurrence. The determination of the frequencies of occurrence includes computing an overall similarity value for the database, and for each first itemset, computing a difference between the overall similarity value and the similarity value of a second itemset. The second itemset has all the data items except those of the first itemset. Preferably, a smoothing factor is used for indicating how much dissimilarity in an itemset can be reduced by removing a subset of items from the itemset. The smoothing factor is evaluated as each item is incrementally removed from the itemset, thereby allowing a data item to be identified as a deviation when the difference if similarity value is the highest.
68 Citations
15 Claims
-
1. A method for detecting deviations in a database having a plurality of data items, each data item being characterized by an attribute value, each subset of the data items being an itemset, and each itemset having a similarity value based on the attribute values of the data items in the itemset, the method comprising the steps of:
-
determining a frequency of occurrence for each attribute value; identifying any itemset whose similarity value satisfies a predetermined deviation criterion as a deviation, based on relative frequencies of occurrence of the attribute values; and computing a smoothing factor that represents a reduction in similarity between any two itemsets when a subset of the data items common to the two itemsets is disregarded. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A computer program product for use with a computer system for detecting deviations in a database having a plurality of data items, each data item being characterized by an attribute value, each subset of the data items being an itemset, and each itemset having a similarity value based on the attribute values of the data items in the itemset, the computer program product comprising:
-
a computer-readable medium; means, provided on the computer-readable medium, for directing the system to determine a frequency of occurrence for each attribute value; means, provided on the computer-readable medium, for directing the system to identify any itemset whose similarity value satisfies a predetermined deviation criterion as a deviation, based on relative frequencies of occurrence of the attribute values; and means, provided on the computer-readable medium, for directing the system to compute a smoothing factor that represents a reduction in similarity between any two itemsets when a subset of the data items common to the two itemsets is disregarded. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A computer-based system for detecting deviations in a database having a plurality of data items, each data item being characterized by an attribute value, each subset of the data items being an itemset, and each itemset having a similarity value based on the attribute values of the data items in the itemset, the system comprising:
-
means for determining a frequency of occurrence for each attribute value; means for identifying any itemset whose similarity value satisfies a predetermined deviation criterion as a deviation, based on relative frequencies of occurrence of the attribute values; and means for computing a smoothing factor that represents a reduction in similarity between any two itemsets when a subset of the data items common to the two itemsets is disregarded. - View Dependent Claims (12, 13, 14, 15)
-
Specification