Method and system for document classification based on document structure and written style
First Claim
1. A method of determining the classification of a document for the purpose of searching, comprising:
- a) receiving the textual content of said document in the form of linguistic sentences and alphabetical words;
b) receiving meta-data on images of said document including each image size, title or description;
c) categorizing said linguistic sentences into subjective and non-subjective sentences;
d) categorizing said alphabetical words into complex and non-complex words;
e) categorizing said images into descriptive and non-descriptive images;
f) calculating the document subjectivity classification as the count of said subjective sentences or the ratio of subjective sentences to non-subjective sentences or total sentences in said document;
g) calculating the document complexity classification as the count of complex alphabetical words or the ratio of complex alphabetical words to non-complex words or total words; and
h) calculating the document descriptive-images classification as the count of descriptive-images.
0 Assignments
0 Petitions
Accused Products
Abstract
A document classification method and system based on document structure and style. The classification method and system categorizes document alphabetical words into complex and non-complex words, categorizes document linguistic sentences into subjective and non-subjective sentences and categorizes document images into descriptive and non-descriptive. The categorization is further used to calculate a complexity, subjectivity and descriptive-images classification of a document. This classification system can be used by a web search engine to filter, sort or tag a set of document references based on user selection.
-
Citations
15 Claims
-
1. A method of determining the classification of a document for the purpose of searching, comprising:
- a) receiving the textual content of said document in the form of linguistic sentences and alphabetical words;
b) receiving meta-data on images of said document including each image size, title or description;
c) categorizing said linguistic sentences into subjective and non-subjective sentences;
d) categorizing said alphabetical words into complex and non-complex words;
e) categorizing said images into descriptive and non-descriptive images;
f) calculating the document subjectivity classification as the count of said subjective sentences or the ratio of subjective sentences to non-subjective sentences or total sentences in said document;
g) calculating the document complexity classification as the count of complex alphabetical words or the ratio of complex alphabetical words to non-complex words or total words; and
h) calculating the document descriptive-images classification as the count of descriptive-images. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- a) receiving the textual content of said document in the form of linguistic sentences and alphabetical words;
-
9. A document classification system that determines the subjectivity, complexity and descriptive-images classification of a document, comprising:
- a memory; and
a document classification circuit or routine that;
a) receives linguistic sentences, alphabetical words and images meta-data of said document;
b) categorizes said received linguistic sentences into subjective and non-subjective sentences and based on this categorization calculates the subjectivity classification of said document;
c) categorizes said received alphabetical words into complex and non-complex words and based on this categorization calculates the complexity classification of said document; and
d) categorizes said images into descriptive and non-descriptive images and based on this categorization calculates the descriptive-images classification. - View Dependent Claims (10, 11, 12, 13, 14, 15)
- a memory; and
Specification