Document categorization by word length distribution analysis
First Claim
Patent Images
1. A computer-implemented method for categorizing digitized documents comprising the steps of:
- providing an electronic representation of an image of a document;
developing word length distribution information of said image from said electronic representation wherein said word length distribution information includes a document feature vector characterizing said document, said document feature vector comprises elements representative of distribution of estimates of word lengths, said elements comprise conditional probabilities of words of A characters proximate to words of B characters, for a plurality of values of A and B; and
categorizing said document responsive to said word length distribution information and word length distribution information for representative categories of documents.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method for efficient document categorization are disclosed. In one embodiment, word length distribution information is used as a basis for categorization. Greater than 90% accuracy in classification may be achieved in, e.g., distinguishing newspaper articles from scientific journal articles. Word length distribution information may be developed without optical character recognition (OCR), permitting use of degraded document images.
20 Citations
23 Claims
-
1. A computer-implemented method for categorizing digitized documents comprising the steps of:
-
providing an electronic representation of an image of a document; developing word length distribution information of said image from said electronic representation wherein said word length distribution information includes a document feature vector characterizing said document, said document feature vector comprises elements representative of distribution of estimates of word lengths, said elements comprise conditional probabilities of words of A characters proximate to words of B characters, for a plurality of values of A and B; and categorizing said document responsive to said word length distribution information and word length distribution information for representative categories of documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A computer program product for categorizing documents comprising:
-
code for providing an electronic representation of an image of a document; code for developing word length distribution information of said image from said electronic representation wherein said word length distribution information includes a document feature vector characterizing said document, said document feature vector comprise elements representative of distribution of estimate of word lengths, said elements comprise conditional probabilities of words of A characters proximate to words of B characters, for a plurality of values of A and B; code for categorizing said document responsive to said word length distribution information; and a computer-readable storage medium for storing said codes. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A computer-implemented method for categorizing digitized documents comprising the steps of:
-
providing an electronic representation of an image of a document; developing word length information of said image from said electronic representation, wherein said word length information includes a document feature vector characterizing said document, said document feature vector comprises elements representative of word length estimates, said elements comprise statistics of word lengths of a plurality of words located within a predetermined proximity; and categorizing said document responsive to said word length information and word length information for representative categories of documents. - View Dependent Claims (20, 21, 22, 23)
-
Specification