×

Document categorization and evaluation via cross-entrophy

  • US 6,397,205 B1
  • Filed: 11/22/1999
  • Issued: 05/28/2002
  • Est. Priority Date: 11/24/1998
  • Status: Expired due to Fees
First Claim
Patent Images

1. A computerized data processing system for categorizing documents by applying candidate functions to data classification comprising:

  • a. computer processor means for processing data;

    b. storage means for storing data on a storage medium;

    c. first means for creating a first fixed-size sample of data from a first document;

    d. second means for creating a second fixed-size sample of data from a second document;

    e. third means for determining a match length within said first document, wherein said match length comprises the longest string of consecutive characters of said second fixed-size sample of data that also appears as a string of consecutive characters in said first fixed-size sample of data;

    f. fourth means for determining said match length at every successive character of said second fixed-size sample of data;

    g. fifth means for determining a mean match length, wherein said mean match length comprises the total sum of said match lengths of said second fixed-size sample of data divided by the number of said characters in said second fixed-size sample of data;

    h. sixth means for determining a cross-entropy between said first document and said second document, wherein said cross-entropy comprises the logarithm of the number of said characters in said first fixed-size sample of data divided by said mean match length, and wherein the number of said characters in said first fixed-size sample of data is equal to the number of said characters in said second fixed-size sample of data;

    i. seventh means for determining a KL-distance from said first document to said second document, wherein said KL-distance comprises the difference between said cross-entropy of said first document and an entropy of said first document, wherein said entropy is the mean match length within said first document; and

    j. eighth means for retrieving documents in a document retrieval system using at least one of the following selected from the group of said total sum of said match lengths, said mean match length, said cross-entropy, and said KL-distance.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×