Detection of patterns in data records
First Claim
1. A computer-implemented method for processing data comprising character strings, the method comprising:
- receiving an input from a user comprising positive examples and negative examples of a specified data type, the positive examples comprising first character strings that belong to the specified data type, and the negative examples comprising second character strings that do not belong to the specified data type, wherein receiving the input comprises displaying a collection of the character strings, and accepting from the user an indication of which of the character strings to include in the specified data type and which of the character strings to exclude from the specified data type;
processing the first and second character strings to create a set of attributes that characterize the positive examples, wherein the processing comprises;
assigning a different character code from character codes to each character type of characters forming the first and second character strings; and
encoding the first and second character strings as sequences of the character codes without distinguishing between sequential characters of the same type, wherein one character code is assigned to letters and another character code is assigned to digits;
building a decision tree, based on the attributes, which when applied to the first and second strings, distinguishes the positive examples from the negative examples; and
applying the decision tree to the data so as to classify a character string of the character strings as belonging to the specified data type.
2 Assignments
0 Petitions
Accused Products
Abstract
A computer-implemented method for processing data includes receiving an input from a user including positive examples and negative examples of a specified data type. The positive examples include first character strings that belong to the specified data type, and the negative examples include second character strings that do not belong to the specified data type. The first and second character strings are processed to create a set of attributes that characterize the positive examples. A decision tree is built, based on the attributes, which when applied to the first and second strings, distinguishes the positive examples from the negative examples. The decision tree is applied to the data so as to identify occurrences of the specified data type.
-
Citations
36 Claims
-
1. A computer-implemented method for processing data comprising character strings, the method comprising:
-
receiving an input from a user comprising positive examples and negative examples of a specified data type, the positive examples comprising first character strings that belong to the specified data type, and the negative examples comprising second character strings that do not belong to the specified data type, wherein receiving the input comprises displaying a collection of the character strings, and accepting from the user an indication of which of the character strings to include in the specified data type and which of the character strings to exclude from the specified data type; processing the first and second character strings to create a set of attributes that characterize the positive examples, wherein the processing comprises; assigning a different character code from character codes to each character type of characters forming the first and second character strings; and encoding the first and second character strings as sequences of the character codes without distinguishing between sequential characters of the same type, wherein one character code is assigned to letters and another character code is assigned to digits; building a decision tree, based on the attributes, which when applied to the first and second strings, distinguishes the positive examples from the negative examples; and applying the decision tree to the data so as to classify a character string of the character strings as belonging to the specified data type. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. Apparatus for processing data comprising character strings, comprising:
-
a user interface, which is coupled to receive an input from a user comprising positive examples and negative examples of a specified data type, the positive examples comprising first character strings that belong to the specified data type, and the negative examples comprising second character strings that do not belong to the specified data type, wherein the user interface is arranged to display a collection of the character strings, and to accept from the user an indication of which of the character strings to include in the specified data type and which of the character strings to exclude from the specified data type; and a pattern processor, which is arranged to; process the first and second character strings to create a set of attributes that characterize the positive examples, wherein the processing comprises; assigning a different character code from character codes to each character type of characters forming the first and second character strings, and encoding the first and second character strings as sequences of the character codes without distinguishing between sequential characters of the same type, wherein one character code is assigned to letters and another character code is assigned to digits; build a decision tree, based on the attributes, which when applied to the first and second strings, distinguishes the positive examples from the negative examples; and apply the decision tree to the data so as to classify a character string of the character strings as belonging to the specified data type. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
-
-
25. A computer software product for processing data comprising character strings, the product comprising a computer-readable storage medium in which program instructions are stored, the program instructions, when executed by a computer, cause the computer to:
-
receive an input from a user comprising positive examples and negative examples of a specified data type, the positive examples comprising first character strings that belong to the specified data type, and the negative examples comprising second character strings that do not belong to the specified data type, wherein the instructions cause the computer to display a collection of the character strings, and to accept from the user an indication of which of the character strings to include in the specified data type and which of the character strings to exclude from the specified data type; process the first and second character strings to create a set of attributes that characterize the positive examples, wherein the processing comprises; assigning a different character code from character codes to each character type of characters forming the first and second character strings, and encoding the first and second character strings as sequences of the character codes without distinguishing between sequential characters of the same type, wherein one character code is assigned to letters and another character code is assigned to digits; build a decision tree, based on the attributes, which when applied to the first and second strings, distinguishes the positive examples from the negative examples; and apply the decision tree to the data so as to classify a character string of the character strings as belonging to the specified data type. - View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36)
-
Specification