Document categorization and evaluation via cross-entrophy
First Claim
Patent Images
1. A computerized data processing system for categorizing documents by applying candidate functions to data classification comprising:
- a. computer processor means for processing data;
b. storage means for storing data on a storage medium;
c. first means for creating a first fixed-size sample of data from a first document;
d. second means for creating a second fixed-size sample of data from a second document;
e. third means for determining a match length within said first document, wherein said match length comprises the longest string of consecutive characters of said second fixed-size sample of data that also appears as a string of consecutive characters in said first fixed-size sample of data;
f. fourth means for determining said match length at every successive character of said second fixed-size sample of data;
g. fifth means for determining a mean match length, wherein said mean match length comprises the total sum of said match lengths of said second fixed-size sample of data divided by the number of said characters in said second fixed-size sample of data;
h. sixth means for determining a cross-entropy between said first document and said second document, wherein said cross-entropy comprises the logarithm of the number of said characters in said first fixed-size sample of data divided by said mean match length, and wherein the number of said characters in said first fixed-size sample of data is equal to the number of said characters in said second fixed-size sample of data;
i. seventh means for determining a KL-distance from said first document to said second document, wherein said KL-distance comprises the difference between said cross-entropy of said first document and an entropy of said first document, wherein said entropy is the mean match length within said first document; and
j. eighth means for retrieving documents in a document retrieval system using at least one of the following selected from the group of said total sum of said match lengths, said mean match length, said cross-entropy, and said KL-distance.
1 Assignment
0 Petitions
Accused Products
Abstract
A computerized data processing system for categorizing documents that applies candidate functions, such as entropy, cross-entropy, and KL-distance, to data classification is disclosed. A computerized method for categorizing documents employing the candidate functions is also disclosed. The computerized data processing system and method of this invention allows for the automatic categorization, retrieval, and filtration of documents based upon the degree and/or rate of divergence from a reference standard.
82 Citations
14 Claims
-
1. A computerized data processing system for categorizing documents by applying candidate functions to data classification comprising:
-
a. computer processor means for processing data;
b. storage means for storing data on a storage medium;
c. first means for creating a first fixed-size sample of data from a first document;
d. second means for creating a second fixed-size sample of data from a second document;
e. third means for determining a match length within said first document, wherein said match length comprises the longest string of consecutive characters of said second fixed-size sample of data that also appears as a string of consecutive characters in said first fixed-size sample of data;
f. fourth means for determining said match length at every successive character of said second fixed-size sample of data;
g. fifth means for determining a mean match length, wherein said mean match length comprises the total sum of said match lengths of said second fixed-size sample of data divided by the number of said characters in said second fixed-size sample of data;
h. sixth means for determining a cross-entropy between said first document and said second document, wherein said cross-entropy comprises the logarithm of the number of said characters in said first fixed-size sample of data divided by said mean match length, and wherein the number of said characters in said first fixed-size sample of data is equal to the number of said characters in said second fixed-size sample of data;
i. seventh means for determining a KL-distance from said first document to said second document, wherein said KL-distance comprises the difference between said cross-entropy of said first document and an entropy of said first document, wherein said entropy is the mean match length within said first document; and
j. eighth means for retrieving documents in a document retrieval system using at least one of the following selected from the group of said total sum of said match lengths, said mean match length, said cross-entropy, and said KL-distance. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computerized method for categorizing documents by applying candidate functions to data classification comprising:
-
a. providing a computer processor means for processing data;
b. providing a storage means for storing data on a storage medium;
c. determining a first fixed-size sample of data from a first document;
d. determining a second fixed-size sample of data from a second document;
e. determining the match length within said first document consisting of the longest string of consecutive characters in said second fixed-size sample of data that also appears as a string of consecutive characters in said first fixed-size sample of data;
f. determining said match length at every successive character of said second fixed-size sample;
g. determining a mean match length, wherein said mean match length comprises the total sum of said match lengths of said second fixed-size sample of data divided by the number of said characters in said second fixed-size sample of data;
h. determining the cross-entropy between said first document and said second document, wherein said cross-entropy comprises the logarithm of the number of said characters in said first fixed-size sample of data divided by said mean match length, wherein the number of said characters in said first fixed-size sample of data is equal to the number of said characters in said second fixed-size sample of data;
i. determining a KL-distance from said first document to said second document, wherein said KL-distance comprises the difference between said cross-entropy of said first document and an entropy of said first document, wherein said entropy is the mean match length within said first document; and
j. retrieving documents in a document retrieval system using at least one of the following selected from said total sum of said match lengths, said mean match length, said cross-entropy, or said KL-distance. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
Specification