Classification of data records by comparison of records to a training database using probability weights
First Claim
1. A system for classifying natural language data, comprising:
- means for storing a new record including a plurality of predictor data fields containing the natural language data expressed in natural language values,means for storing a plurality of training records,each training record includinga plurality of predictor data fields, each predictor data field containing a feature, wherein each feature is a natural language term, anda target data field containing a target value representing a classification of a training record, andprobability weight means for storing, for each feature, a probability weight value representing a probability that a new record will have the target value contained in the target data field if a feature contained in a corresponding predictor data field occurs in the new record,query means for extracting features from the new record and querying the training records with each feature extracted from the new record,the query means being responsive to a match between a feature extracted from the new record and a feature stored in said training record for providing the probability weight corresponding to the feature, andmetric means for receiving the probability weights from the query means and accumulating for each training record a comparison score representing the probability that said training record matches the new record, andproviding an output indicating said target field value of said training record as said target value of the new record.
11 Assignments
0 Petitions
Accused Products
Abstract
Classification of natural language data wherein the natural language data has an open-ended range of possible values or the data values do not have a relative order. A training database stores training records, wherein each training record includes predictor data fields. Each predictor data field containes a feature, wherein each feature is a natural language term, and a target data field containing a target value representing a classification of the record. Features may also include conjunctions of natural language terms and each feature may also be a member of a category subset of features. The training database stores, for each feature, a probability weight value representing the probability that a record will have the target value contained in the target data field if a feature contained in a corresponding predictor data field occurs in the record. Features are extracted from a new record and each feature from the new record is used to query the training records to determine the probability weights from the training records having matching features. The probability weights are accumulated for each training record to determine a comparison score representing the probability that the training record matches the new record and provide an output indicating the training records most probability matching the new record.
269 Citations
37 Claims
-
1. A system for classifying natural language data, comprising:
-
means for storing a new record including a plurality of predictor data fields containing the natural language data expressed in natural language values, means for storing a plurality of training records, each training record including a plurality of predictor data fields, each predictor data field containing a feature, wherein each feature is a natural language term, and a target data field containing a target value representing a classification of a training record, and probability weight means for storing, for each feature, a probability weight value representing a probability that a new record will have the target value contained in the target data field if a feature contained in a corresponding predictor data field occurs in the new record, query means for extracting features from the new record and querying the training records with each feature extracted from the new record, the query means being responsive to a match between a feature extracted from the new record and a feature stored in said training record for providing the probability weight corresponding to the feature, and metric means for receiving the probability weights from the query means and accumulating for each training record a comparison score representing the probability that said training record matches the new record, and providing an output indicating said target field value of said training record as said target value of the new record. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. In a data parallel system, means for classifying natural language data, comprising:
-
a plurality of processing elements for storing a corresponding plurality of training records, each training record residing in a process in a memory associated with the corresponding processor element and including a plurality of predictor data fields, each predictor data field containing a feature, wherein each feature is a natural language term, and a target data field containing a target value representing a classification of a training record, a probability weight means for storing, for each feature, a probability weight value representing a probability that a new record will have said target value contained in the target data field if a feature contained in a corresponding predictor data field occurs in the new record, a query means for extracting features from the new record and querying the training records with each feature extracted from the new record, including a control means for storing the new record, extracting the features from the new record and transmitting the features to the processing elements, and the processing elements for reading the features from each associated training record and responsive to a match between a feature extracted from the new record and a feature stored in said training record for reading the probability weight corresponding to the feature, and a metric means for receiving the probability weights from the query means and accumulating for each training record a comparison score representing the probability that the training record matches the new record and providing an output indicating said training record most probably matching the new record, including the processing elements for receiving the probability weights, each processor being responsive to instructions from the control means for accumulating for each corresponding training record a comparison score representing the probability that the training record matches the new record, and a global combining means for providing an output indicating a training record most probability matching the new record.
-
-
18. A system for generating training records for use in classifying natural language data, comprising:
-
means for storing a plurality of historic records, each historic record including a target data field containing a target value representing a classification of a historic record, and a plurality of predictor data fields, each predictor data field containing a natural language term, and means for storing a plurality of training records, each training record including a plurality of predictor data fields, each predictor data field containing a feature, wherein each feature is a natural language term, and said target data field containing said target value representing a classification of a training record, and a probability weight memory for storing, for each feature, a probability weight value representing a probability that a new record will have the target value contained in the target data field if said feature contained in a corresponding predictor data field occurs in the new record, means for reading the natural language terms from each of the historic records and identifying each feature appearing in the historic records, probability weight generating means for selecting in turn each feature identified from the historic records, and for each feature, selecting in turn each historic record target value for which the feature appears, determining, for each historic record target value for each feature a probability weight, and generating, for each historic record, a corresponding training record and storing in the predictor data fields of the training record the features identified from the corresponding historic record, and in the target data field of the training record the target value from the historic record, and storing the probability weight generated for the feature and target in the probability weight memory.
-
-
19. In a data parallel system, means for generating training records for use in classifying natural language data, comprising:
-
a first plurality of processing elements for storing a corresponding plurality of historic records, each historic record residing in a process in a memory of a corresponding processor element and including a target data field containing a target value representing a classification of a historic record, and a plurality of predictor data fields, each predictor data field containing a natural language term, and a second plurality of processing elements for storing a corresponding plurality of training records, each training record residing in said process in said memory of a corresponding processor element and including a plurality of predictor data fields, each predictor data field containing a feature, wherein each feature is a natural language term, and a target data field containing a target value representing a classification of a training record, a probability weight memory for storing, for each feature, a probability weight value representing a probability that a new record will have the target value contained in the target data field if said feature contained in a corresponding predictor data field occurs in the new record, a query means including a control means and the processing elements for reading the natural language terms from each of the historic records and identifying each feature appearing in the historic records, a probability weight generating means including the control means and the processing elements for selecting in turn each feature identified from the historic records, and for each feature, selecting in turn each historic record target value for which the feature appears, and, a global combining means for determining, for each historic record target value and for each feature a probability weight value representing the probability that a record will have the target value contained in the target data field if said feature contained in a corresponding predictor data field occurs in the record, and storing in the predictor data fields of each training record the features identified from the corresponding historic record, and in the target data fields of the corresponding training record the target value from the historic record, and storing the probability weight generated for the feature and target value in a probability weight memory.
-
-
20. In a data parallel system, means for classifying natural language data, comprising:
-
a plurality of processing elements for storing a corresponding plurality of training records, each training record including a plurality of predictor data fields, each predictor data field containing a feature, wherein each feature is a natural language term, and a target data field containing a target value representing a classification of a training record, control means for storing a new record containing features comprised of the natural language data, and for broadcasting the features of the new record to the processing elements storing the training records, each processing element being responsive to the broadcast features of the new record to construct, in a process associated with each training data record, a boolean comparison table for storing indications of matches between the broadcast new record features and the features of the corresponding training record, scanning means for reading the indications of matches in each boolean comparison table and determining, for each broadcast feature of the new record and each target value of the training records, number of indications of matches between the broadcast feature and the features of the training record for each target value, and number of indications of matches between the broadcast feature and the features of the training record over all target values, and for determining, for each broadcast feature and each target value, a probability weight representing a probability that the new sample will have the target value of the record if the new sample feature appears in the record. - View Dependent Claims (21, 22, 23)
-
-
24. In a data parallel system, means for constructing probability weight tables for storing probability weights of a plurality of features of each of a plurality of training records for use in classifying natural language data, wherein each training record includes a plurality of predictor data fields, each predictor data field containing a feature wherein each feature is a natural language term, and a target data field containing a target value representing a classification of a record, wherein a probability weight is a probability that a new sample of a record containing natural language data will have the target value of a training record if a feature of the training record appears in the new sample of the new sample, comprising:
-
a plurality of processors for storing a corresponding plurality of record feature structures, each record feature structure corresponding to said training record and including a plurality of record feature structure entries, each record feature structure entry containing a feature or a possible conjunction of features of the corresponding training record, and wherein the record feature structure entries of each record feature structure are sorted according to the values of the keys formed by a concatenation of the values of the features of each entry and the target value of the corresponding training record, a plurality of processors for storing a corresponding plurality of probability weight tables, each probability weight table corresponding to said training record and containing an entry for each possible conjunction of features of the training record, a single feature of said training record being represented in the corresponding probability weight table as a conjunction of the single feature with itself, means for selecting sets of record feature structures, wherein each set of record feature structures have a common value for a feature portion of their keys, and selecting, within each set of register feature structures, a plurality of subsets of record feature structures, wherein the record feature structures of each subset of record feature structures have a common target field value, and wherein each subset of record feature structures corresponds to a feature of the record feature structure, means for determining, for each set of record feature structures, the number of record feature structures in the set of record feature structures, and for each subset of record feature structures in the set of register feature structures, the number of record feature structures in the subset, and for each subset in the set of record feature structures, dividing the number of record feature structures in the subset by the number of record feature structures in the set of record feature structures to determine a conjunctive probability weight of the corresponding feature of the record feature structure, and writing the conjunctive probability weight for each feature of each record feature structure into a corresponding entry of the probability weight table of the corresponding training record.
-
-
25. In a data parallel system including a means for storing a plurality of weight tables for storing probability weights of features of each of a plurality of training records, wherein a probability weight is a probability that a new sample of a record containing natural language data will have the target value of a training record if a feature of the training record appears in the new sample of the new sample, means for classifying new records containing natural language data, comprising:
-
a plurality of processors for storing a corresponding plurality of training records, each training record including a plurality of predictor data fields, each predictor data field containing a feature, wherein each feature is a natural language term, and a target data field containing a target value representing a classification of said training record, a plurality of processors for storing a corresponding plurality of probability weight tables, each probability weight table corresponding to said training record and containing an entry for each possible conjunction of features of the training record, a single feature of said training record being represented in the corresponding probability weight table as a conjunction of the single feature with itself, a control means for storing a new record containing features comprised of natural language data, and for broadcasting the features of the new sample to the processors storing the training records, each processor being responsive to the broadcast features of the new sample to construct, in a process associated with each training data record, a boolean comparison table for storing indications of matches between the broadcast new sample features and the features of the corresponding training record, scanning means for performing logical AND operations on each combination of the match indications in each boolean comparison table to find conjunctive feature matches, wherein each AND operation represents a conjunctive feature, and, for each conjunctive feature match found in a boolean comparison table, using the values of the conjunctive features resulting in the match as indices into the probability weight table of the corresponding training record and reading from the probability weight table a conjunctive probability weight of the conjunctive feature. - View Dependent Claims (26)
-
-
27. In a data parallel system including a plurality of processing elements, each processing element including a memory for storing data and an associated processor to operate on the memory for performing operations on the data residing in the memory, and control means for issuing instructions for directing operations of the system, each processor being responsive to the instructions for performing the operations in parallel on the data stored in the associated memory, a method for classifying natural language data, comprising steps of:
-
storing a plurality of training records in a corresponding plurality of processor elements, each training record residing in a process in the memory of the processor element and including a plurality of predictor data fields, each predictor data field containing a feature, wherein each feature is a natural language term, and a target data field containing a target value representing a classification of training record, and storing in a probability weight memory, and for each feature, a probability weight value representing a probability that a new record will have the target value contained in the target data field if a feature contained in a corresponding predictor data field occurs in the new record, querying the training records with each feature extracted from a new record, by storing the new record in the control means and, by operation of the control means, extracting the features from a new record and transmitting the features to the processors of the processing elements, and in the processors of the processing elements, reading the features from the training records stored in each associated training record, and responsive to each match between a feature extracted from the new record and a feature stored in a training record, reading the probability weight corresponding to the feature, and accumulating, in the processing element for each corresponding training record and according to a selected metric, a comparison score representing the probability that said training record matches the new record and providing an output indicating said training record most probably matching the new record, and selecting the target field value of a training record as a target value of the new record.
-
-
28. A method for implementing in a data parallel system which includes a plurality of processing elements, each processing element including a memory for storing data and an associated processor to operate on the memory for performing operations on the data residing in the memory, a global combining means for performing operations on outputs of the processing elements, and control means for issuing instructions for directing operations of the system, each processor being responsive to the instructions for performing the operations in parallel on the data stored in the associated memory, said method for generating training records for use in classifying natural language data, comprising steps of:
-
storing in a first plurality of processing elements a corresponding plurality of historic records, each historic record residing in a process in the memory of a corresponding processor element and including a target data field containing a target value representing a classification of a historic record, and a plurality of predictor data fields, each predictor data field containing a natural language term, and storing in a second plurality of processing elements a corresponding plurality of training records, each training record residing in said process in the memory of a corresponding processor element and including a plurality of predictor data fields, each predictor data field containing said feature, wherein said each feature is a natural language term, and a target data field containing a target value representing a classification of a training record, and by operation of the control means and the processing elements, reading the natural language terms from each of the historic records and identifying each feature appearing in the historic records, generating a probability weight for said each feature, including by operation of the control means and the processing elements, selecting in turn said each feature identified from the historic records, and for said each feature, selecting in turn each historic record target value for which the feature appears, and, by operation of the global combining means, determining, for each historic record target value and for said each feature a probability weight value representing a probability that a record will have the target value contained in the target data field if said feature contained in a corresponding predictor data field occurs in the record, and storing in the predictor data fields of each training record the features identified from the corresponding historic record, and in the target data fields of the corresponding training record the target value from the historic record, and storing in a probability weight memory, for said each feature, a probability weight value generated for each feature and target value.
-
-
29. A method for implementing in a data parallel system which includes a plurality of processing elements, each processing element including a memory for storing data and an associated processor to operate on the memory for performing operations on the data residing in the memory, a global combining means for performing operations on outputs of the processing elements, and control means for issuing instructions for directing operations of the data parallel system, each processor of each processing element being responsive to the instructions for performing the operations in parallel on the data stored in the memory of the processing element, said method for classifying natural language data, comprising the steps of:
-
storing in the processing elements a plurality of training records, each training record including a plurality of predictor data fields, each predictor data field containing a feature, wherein each feature is a natural language term, and a target data field containing a target value representing a classification of a training record, storing a new record containing features comprised of the natural language data in the control means and, by operation of the control means, broadcasting the features of the new record to the processing elements storing the training records, in each processing element, and responsive to the broadcast features of the new record, constructing in a process associated with each training data record a boolean comparison table for storing indications of matches between the broadcast new record features and the features of the corresponding training record, in the processing elements, scanning each of the boolean comparison tables for indications of matches and determining, for each broadcast feature and each target value of the training records, a number of indications of matches between the broadcast feature and the features of the training record for each target value, and a number of indications of matches between the broadcast feature and the features of the training record over all target values, and determining, for each broadcast feature and each target value, a probability weight representing a probability that the new record will have the target value of a training record if a new sample feature appears in the training record. - View Dependent Claims (30, 31, 32)
-
-
33. In a data parallel system which includes a plurality of processing elements, each processing element including a memory for storing data and an associated processor to operate on the memory for performing operations on the data residing in the memory, a global combining means for performing operations on outputs of the processing elements, and control means for issuing instructions for directing operations of the data parallel system, each processor being responsive to the instructions for performing the operations in parallel on the data, stored in the associated memory, said method for constructing probability weight tables for storing probability weights of a plurality of features of each of a plurality of training records for use in classifying natural language data, wherein each training record includes a plurality of predictor data fields, each predictor data field containing a feature wherein each feature is a natural language term, and a target data field containing a target value representing a classification of a record, wherein a probability weight is a probability that a new sample of a record containing natural language data will have the target value of a training record if a feature of the training record appears in the new sample of the new sample, comprising the steps of:
-
storing in a first plurality of processors a corresponding plurality of record feature structures, each record feature structure corresponding to said training record and including a plurality of record feature structure entries, each record feature structure entry containing a feature or a possible conjunction of features of the corresponding training record, and wherein the record feature structure entries of each record feature structure are sorted according to the values of the keys formed by the concatenation of the values of the features of each entry and the target value of the corresponding training record, storing in a second plurality of processors a corresponding plurality of probability weight tables, each probability weight table corresponding to said training record and containing an entry for each possible conjunction of features of the training record, a single feature of said training record being represented in the corresponding probability weight table as a conjunction of the single feature, in the processors, selecting sets of record feature structures, wherein each set of record feature structures have a common value for a feature portion of their keys, and within each set of register feature structures, a plurality of subsets of record feature structures, wherein a record feature structure of each subset of record feature structures have a common target field value, and wherein each subset of record feature structures corresponds to a feature of the record feature structure, in the processors, determining, for each set of record feature structures, a number of record feature structures in the set of record feature structures, and for each subset of record feature structures in the set of register feature structures, the number of record feature structures in the subset, and for each subset in the set of record feature structures, dividing the number of record feature structures in the subset by the number of record feature structures in the set of record feature structures to determine a conjunctive probability weight of the corresponding feature of the record feature structure, and writing the conjunctive probability weight for each feature of each record feature structure into a corresponding entry of the probability weight table of the corresponding training record.
-
-
34. A method for implementing in a data parallel system which includes a plurality of processing elements, each processing element including a memory for storing data and an associated processor to operate on the memory for performing operations on the data residing in the memory, a global combining means for performing operations on outputs of the processing elements, and control means for issuing instructions for directing operations of the system, each processor of each processing element being responsive to the instructions for performing the operations in parallel on the data stored in the memory of the processing element, said method for classifying new records containing natural language data, comprising the steps of:
-
storing in a plurality of processors a corresponding plurality of training records, each training record including a plurality of predictor data fields, each predictor data field containing a feature, wherein said feature is a natural language term, and a target data field containing a target value representing a classification of a training record, storing in a plurality of processors a corresponding plurality of probability weight tables, each probability weight table corresponding to said training record and containing a probability weight entry for each possible conjunction of features of the training record, wherein a single feature of said training record is represented in the corresponding probability weight table as a conjunction of a single feature, and wherein a probability weight is a probability that a new sample record containing natural language data will have the target value of said training record if said feature of the training record appears in the new record, storing a new record containing features comprised of natural language data and broadcasting the features of the new sample to the processors storing the training records, in each processor and responsive to the broadcast features of the new sample, constructing in a process associated with each training data record a boolean comparison table for storing indications of matches between the broadcast new sample features and the features of the corresponding training record, in the processors, performing logical AND scanning operation on each combination of the indications of matches in each boolean comparison table to find conjunctive feature matches, wherein each AND operation represents a conjunctive feature, and, for each indication of said conjunctive feature match found in a boolean comparison table, using the values of the conjunctive features resulting in the indication of said conjunctive feature match as indices into the probability weight table of the corresponding training record and reading from the probability weight table the conjunctive probability weight of the conjunctive feature. - View Dependent Claims (35)
-
-
36. A system for classifying natural language data, comprising:
-
means for storing a new record including a plurality of predictor data fields containing the natural language data expressed in natural language values, means for storing a plurality of training records, each training record including a plurality of predictor data fields, each predictor data field containing a feature, wherein said feature is a natural language term, said feature is a member of one of a plurality of category subsets of features, and said feature appearing in identical form in a multiplicity of the category subsets comprises a corresponding multiplicity of separate and distinct features, and a target data field containing a target value representing a classification of said training record, and probability weight means for storing, for said feature, a probability weight value representing a probability that a new record will have the target value contained in the target data field if said feature contained in a corresponding predictor data field occurs in the new record, query means for extracting features from the new record and querying the training records with said feature extracted from the new record, the query means being responsive to a match between said feature extracted from the new record and said feature stored in a training record for providing the probability weight corresponding to the feature, and metric means for receiving the probability weights from the query means and accumulating for said training record a comparison score representing the probability that said training record matches the new record, and providing an output indicating a target field value of said training record as a target value of the new record.
-
-
37. A system for comparing a new data record to training data records, comprising:
-
means for storing a new record including a plurality of data fields containing new record data values, means for storing a plurality of training records, each training record including a plurality of data fields containing training record data values, probability weight means for storing a probability weight for each training record data value, each probability weight value representing a probability that a new record will have a match with said training record if a data value in a training record data field occurs in a new record data field, and comparison means for querying the training records with each data value of the new record, accumulating for said training record a comparison score of the probability weights for each match between a new record data value and a training record data value, and providing an output indicating a comparison score.
-
Specification