Method of translating free-format data records into a normalized format based on weighted attribute variants
First Claim
Patent Images
1. A system for generating a normalized version of a free-formatted data record, said data record being characterized by a plurality of data words, said system comprising a dictionary comprising sequences of data words associated with respective attribute fields corresponding to respective attribute fields defining a normalized data record and associated with respective weight values,means for partitioning said free-formatted record into a plurality of n-word tuples, where n≧
- 1,means for associating each of said n-word tuples, that compares with a respective one of said sequences, with the associated one of said attribute fields and a respective score value, said score value being derived as a function of the associated one of said weight values, and for identifying for each of said attribute fields the associated n-word tuples having the largest score values, andmeans for forming different combinations from individual ones of the identified n-word tuples and output as said normalized data record that one of said combinations formed from those of said identified n-word tuples having score values which produce the largest sum.
5 Assignments
0 Petitions
Accused Products
Abstract
A facility is provided for normalizing the format of stored data records using a dictionary that is generated from a training set of data records having predefined formats.
-
Citations
7 Claims
-
1. A system for generating a normalized version of a free-formatted data record, said data record being characterized by a plurality of data words, said system comprising a dictionary comprising sequences of data words associated with respective attribute fields corresponding to respective attribute fields defining a normalized data record and associated with respective weight values,
means for partitioning said free-formatted record into a plurality of n-word tuples, where n≧ - 1,
means for associating each of said n-word tuples, that compares with a respective one of said sequences, with the associated one of said attribute fields and a respective score value, said score value being derived as a function of the associated one of said weight values, and for identifying for each of said attribute fields the associated n-word tuples having the largest score values, and means for forming different combinations from individual ones of the identified n-word tuples and output as said normalized data record that one of said combinations formed from those of said identified n-word tuples having score values which produce the largest sum. - View Dependent Claims (2, 3)
- 1,
-
4. A system for converting a free-formatted data record into a normalized data record, said free-formatted data record being characterized by a plurality of data components, said system comprising
a plurality of test records each having a predefined format and each being formed from a plurality of data attributes corresponding to respective attribute fields defining a normalized data record, means for partitioning the data components forming each of said test records into respective specific tokens and free matching tokens, for associating each of said specific and said free matching tokens with respective ones of said data attributes based on said predefined format and for storing each of said specific and free matching tokens and its associated one of said attributes in a dictionary if it has not yet been stored therein or for incrementing an associated counter if it has been so stored, means, responsive to receipt of said free-formatted record from a source of information records, for partitioning the data components forming said record into a number of different sets of specific tokens and a number of different sets of free-matching tokens, for generating a score for each of the tokens of each of said sets that matches a corresponding token stored in said dictionary, in which said score is a function of the contents of the counter associated with the corresponding token, and means for forming the tokens associated with scores into different combinations and for outputting as the normalized version of said record that one of said combinations that is formed from those tokens having scores which produce the largest sum.
-
5. A method of converting data record having a free-formatted format into a data record having a normalized format, said free-formatted data record being characterized by a plurality of data words, said method comprising the steps of
generating a dictionary comprising sequences of data words associated with respective attribute fields and associated with respective weight values, partitioning said free-formatted record into a plurality of ordered n-word tuples, where n≧ - 1,
associating each of said n-word tuples, that compares with a respective one of said sequences, with the associated one of said attribute fields and a respective score value, said score value being derived as a function of the associated one of said weight values, and identifying for each of said attribute fields the associated n-word tuples having the largest score values, and forming different combinations from individual ones of the identified n-word tuples and outputting as said normalized data record that one of said combinations formed from those of said identified n-word tuples having score values which produce the largest sum.
- 1,
-
6. A method of translating a data record comprising a plurality of data words and having a free-formatted format into a data record having a normalized format, said method comprising the steps of
partitioning said free-formatted data record into a number of different combinations of data words, forming different translated versions of said free-formatted data record from those of said combinations that compare with individual ones of a plurality of sequences of data words forming a dictionary of such sequences, in which each of said sequences is associated with (a) individual ones of a plurality of attribute fields defining the format of a normalized data record and (b) respective score values indicative of the likelihood of such association, and outputting as the normalized format for said free-formatted data record that one of said different translated versions that is formed from those of said combinations that are associated with score values that produce the largest sum of score values.
-
7. A method of converting a data record formed from a plurality of data words and having a free-formatted format into a data record having a normalized format, said method comprising the steps of
storing in a memory a dictionary comprising sequences of data words associated with (a) respective attribute fields corresponding to the fields of a normalized data record and (b) respective weight values, partitioning said free-formatted data record into a plurality of n-word tuples, where n≧ - 1,
associating each of said n-word tuples that compares with at least a respective one said of sequences with the associated one of said attribute fields and a respective score value, said score value being derived as a function of the associated one of said weight values, and identifying for each of said attribute fields the associated n-word tuples having the largest score values, and forming different combinations from individual ones of the identified n-word tuples and outputting as said normalized data record that one of said combinations formed from those of said identified n-word tuples having score values which produce the largest sum.
- 1,
Specification