Document categorization and evaluation via cross-entrophy

US 6,397,205 B1
Filed: 11/22/1999
Issued: 05/28/2002
Est. Priority Date: 11/24/1998
Status: Expired due to Fees

First Claim

Patent Images

1. A computerized data processing system for categorizing documents by applying candidate functions to data classification comprising:

a. computer processor means for processing data;

b. storage means for storing data on a storage medium;

c. first means for creating a first fixed-size sample of data from a first document;

d. second means for creating a second fixed-size sample of data from a second document;

e. third means for determining a match length within said first document, wherein said match length comprises the longest string of consecutive characters of said second fixed-size sample of data that also appears as a string of consecutive characters in said first fixed-size sample of data;

f. fourth means for determining said match length at every successive character of said second fixed-size sample of data;

g. fifth means for determining a mean match length, wherein said mean match length comprises the total sum of said match lengths of said second fixed-size sample of data divided by the number of said characters in said second fixed-size sample of data;

h. sixth means for determining a cross-entropy between said first document and said second document, wherein said cross-entropy comprises the logarithm of the number of said characters in said first fixed-size sample of data divided by said mean match length, and wherein the number of said characters in said first fixed-size sample of data is equal to the number of said characters in said second fixed-size sample of data;

i. seventh means for determining a KL-distance from said first document to said second document, wherein said KL-distance comprises the difference between said cross-entropy of said first document and an entropy of said first document, wherein said entropy is the mean match length within said first document; and

j. eighth means for retrieving documents in a document retrieval system using at least one of the following selected from the group of said total sum of said match lengths, said mean match length, said cross-entropy, and said KL-distance.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computerized data processing system for categorizing documents that applies candidate functions, such as entropy, cross-entropy, and KL-distance, to data classification is disclosed. A computerized method for categorizing documents employing the candidate functions is also disclosed. The computerized data processing system and method of this invention allows for the automatic categorization, retrieval, and filtration of documents based upon the degree and/or rate of divergence from a reference standard.

82 Citations

View as Search Results

14 Claims

1. A computerized data processing system for categorizing documents by applying candidate functions to data classification comprising:
- a. computer processor means for processing data;
  
  b. storage means for storing data on a storage medium;
  
  c. first means for creating a first fixed-size sample of data from a first document;
  
  d. second means for creating a second fixed-size sample of data from a second document;
  
  e. third means for determining a match length within said first document, wherein said match length comprises the longest string of consecutive characters of said second fixed-size sample of data that also appears as a string of consecutive characters in said first fixed-size sample of data;
  
  f. fourth means for determining said match length at every successive character of said second fixed-size sample of data;
  
  g. fifth means for determining a mean match length, wherein said mean match length comprises the total sum of said match lengths of said second fixed-size sample of data divided by the number of said characters in said second fixed-size sample of data;
  
  h. sixth means for determining a cross-entropy between said first document and said second document, wherein said cross-entropy comprises the logarithm of the number of said characters in said first fixed-size sample of data divided by said mean match length, and wherein the number of said characters in said first fixed-size sample of data is equal to the number of said characters in said second fixed-size sample of data;
  
  i. seventh means for determining a KL-distance from said first document to said second document, wherein said KL-distance comprises the difference between said cross-entropy of said first document and an entropy of said first document, wherein said entropy is the mean match length within said first document; and
  
  j. eighth means for retrieving documents in a document retrieval system using at least one of the following selected from the group of said total sum of said match lengths, said mean match length, said cross-entropy, and said KL-distance.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The data processing system of claim 1 further comprising categorization means for categorizing documents wherein said cross-entropy is determined between a plurality of said first documents, wherein said plurality of said first documents are reference documents, and said second document, wherein said second document is a novel document, and wherein one document selected from said first documents with a value of said cross-entropy closest to zero shall be categorized as the closest document to said second document, and wherein said document categorized as the closest document to said second document shall have its category assigned to said second document.
  - 3. The data processing system of claim 2 further comprising wherein said second document is a plurality of documents.
  - 4. The data processing system of claim 1 further comprising similarity detection means for filtering documents wherein said cross-entropy is determined between a plurality of said first documents and said second document, wherein said second document is a reference document, and wherein one document selected from said first documents with a value of said cross-entropy higher than a threshold value shall be filtered out.
  - 5. The data processing system of claim 4 further comprising wherein said second document is a plurality of documents.
  - 6. The data processing system of claim 1 further comprising similarity detection means for determining similarities in language style of a plurality of documents wherein said KL-distance is determined between a plurality of said first documents and a said second document, wherein said second document is a reference document, and wherein one document selected from said plurality of first documents having a KL-distance closest to zero is closest in similarity to said second document.
  - 7. The data processing system of claim 6 further comprising wherein said second document is a plurality of documents.

8. A computerized method for categorizing documents by applying candidate functions to data classification comprising:
- a. providing a computer processor means for processing data;
  
  b. providing a storage means for storing data on a storage medium;
  
  c. determining a first fixed-size sample of data from a first document;
  
  d. determining a second fixed-size sample of data from a second document;
  
  e. determining the match length within said first document consisting of the longest string of consecutive characters in said second fixed-size sample of data that also appears as a string of consecutive characters in said first fixed-size sample of data;
  
  f. determining said match length at every successive character of said second fixed-size sample;
  
  g. determining a mean match length, wherein said mean match length comprises the total sum of said match lengths of said second fixed-size sample of data divided by the number of said characters in said second fixed-size sample of data;
  
  h. determining the cross-entropy between said first document and said second document, wherein said cross-entropy comprises the logarithm of the number of said characters in said first fixed-size sample of data divided by said mean match length, wherein the number of said characters in said first fixed-size sample of data is equal to the number of said characters in said second fixed-size sample of data;
  
  i. determining a KL-distance from said first document to said second document, wherein said KL-distance comprises the difference between said cross-entropy of said first document and an entropy of said first document, wherein said entropy is the mean match length within said first document; and
  
  j. retrieving documents in a document retrieval system using at least one of the following selected from said total sum of said match lengths, said mean match length, said cross-entropy, or said KL-distance.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The computerized method for categorizing documents of claim 8 further comprising providing categorization of said first documents by determining said cross-entropy between a plurality of said first documents, wherein said plurality of said first documents are reference documents, and said second document, wherein said second document is a novel document, and wherein one document selected from said plurality of said first documents having a cross-entropy value closest to zero shall be categorized as the closest document to said second document, and wherein said document categorized as the closest document to said second document shall have its category assigned to said second document.
  - 10. The computerized method of claim 9 further including wherein said second document is a plurality of documents.
  - 11. The computerized method for filtering documents of claim 8 further comprising providing filtration of said first documents by determining said cross-entropy between a plurality of said first documents and said second document, wherein said second document is a reference document, and wherein one document selected from said plurality of said first documents having a cross-entropy value higher than a threshold value shall be filtered out.
  - 12. The computerized method of claim 11 further including wherein said second document is a plurality of documents.
  - 13. The computerized method for categorizing documents of claim 8 further comprising providing similarity judgment characterization of said first documents by determining said KL-distance between a plurality of said first documents and said second document, wherein said second document is a reference document, and wherein one document selected from said plurality of said first documents having a KL-distance closest to zero is closest in similarity to said second document.
  - 14. The computerized method of claim 13 further including wherein said second document is a plurality of documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Duquesne University of The Holy Ghost
Original Assignee
Duquesne University of The Holy Ghost
Inventors
Juola, Patrick
Primary Examiner(s)
Breene, John
Assistant Examiner(s)
BLACK, LINH

Application Number

US09/444,588
Time in Patent Office

918 Days
Field of Search

707/1-10,100-104.1,200-206,500-542
US Class Current

1/1
CPC Class Codes

G06F 16/35   Clustering; Classification

G06F 16/93   Document management systems

Y10S 707/99932   Access augmentation or opti...

Document categorization and evaluation via cross-entrophy

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

82 Citations

14 Claims

Specification

Use Cases

Quick Links

Others

Document categorization and evaluation via cross-entrophy

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

82 Citations

14 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others