Method and system for document classification based on document structure and written style
First Claim
1. A method for determining the subjectivity, complexity and descriptive image classifications for a plurality of documents containing text sentences, alphabetic words and images having meta-data, comprising:
- for each document of a plurality of documents;
a) categorizing at least one sentence as a subjective sentence and at least one sentence as non-subjective sentence;
wherein the at least one subjective sentence is categorized as a subjective sentence because the sentence includes one or more pronouns of the group;
I, we, he, she, they or you; and
wherein the at least one non-subjective sentence is categorized as a non-subjective sentence because the sentence does not include one or more pronouns of the group;
I, we, he, she, they or you;
b) categorizing at least one alphabetical word as a complex word and at least one alphabetical word as a non-complex word;
wherein the at least one complex word is categorized as a complex word because the number of syllables is over a threshold number of syllables; and
wherein the at least one non-complex word is categorized as a non-complex word because the number of syllables is less than a threshold number of syllables;
c) categorizing at least one image as descriptive image and at least one image as non-descriptive;
wherein the at least one descriptive image is categorized as a descriptive image because the at least one descriptive image includes an image size greater than a designated image size and the image meta-data includes a title or description; and
wherein the at least one non-descriptive image is categorized as a non-descriptive image because the at least one non-descriptive image includes an image size less than a designated image size or the image meta-data does not include a title or a description;
d) designating a document subjectivity classification that is equal to the ratio of subjective sentences to non-subjective sentences; and
designating a document complexity classification that is equal to the ratio of complex alphabetical words to non-complex alphabetical words; and
designating a document descriptive image classification that is equal to the total number of the descriptive images in a document;
e) wherein the subjectivity classification, the complexity classification and the descriptive image classification are associated with the document and stored in a database;
designating at least one document of the plurality of documents as a subjective document because its subjectivity classification is higher than a predetermined value;
designating at least one document of the plurality of documents as a non-subjective document because its subjectivity classification is lower than a predetermined value;
designating at least one document of the plurality of documents as a complex document because its complexity classification is higher than a predetermined value;
designating at least one document of the plurality of documents as a non-complex document because its complexity classification is lower than a predetermined value.
0 Assignments
0 Petitions
Accused Products
Abstract
A classification method and system for documents containing text sentences and images having meta-data. The classification method and system categorizes document sentences into subjective and non-subjective sentences and categorizes document images into descriptive and non-descriptive. The categorization is further used to calculate subjectivity and descriptive-images classification of a document. This classification system can be used by a web search engine to filter, sort or tag a set of document references based on user selection.
-
Citations
2 Claims
-
1. A method for determining the subjectivity, complexity and descriptive image classifications for a plurality of documents containing text sentences, alphabetic words and images having meta-data, comprising:
-
for each document of a plurality of documents; a) categorizing at least one sentence as a subjective sentence and at least one sentence as non-subjective sentence; wherein the at least one subjective sentence is categorized as a subjective sentence because the sentence includes one or more pronouns of the group;
I, we, he, she, they or you; andwherein the at least one non-subjective sentence is categorized as a non-subjective sentence because the sentence does not include one or more pronouns of the group;
I, we, he, she, they or you;b) categorizing at least one alphabetical word as a complex word and at least one alphabetical word as a non-complex word; wherein the at least one complex word is categorized as a complex word because the number of syllables is over a threshold number of syllables; and wherein the at least one non-complex word is categorized as a non-complex word because the number of syllables is less than a threshold number of syllables; c) categorizing at least one image as descriptive image and at least one image as non-descriptive; wherein the at least one descriptive image is categorized as a descriptive image because the at least one descriptive image includes an image size greater than a designated image size and the image meta-data includes a title or description; and wherein the at least one non-descriptive image is categorized as a non-descriptive image because the at least one non-descriptive image includes an image size less than a designated image size or the image meta-data does not include a title or a description; d) designating a document subjectivity classification that is equal to the ratio of subjective sentences to non-subjective sentences; and designating a document complexity classification that is equal to the ratio of complex alphabetical words to non-complex alphabetical words; and designating a document descriptive image classification that is equal to the total number of the descriptive images in a document; e) wherein the subjectivity classification, the complexity classification and the descriptive image classification are associated with the document and stored in a database; designating at least one document of the plurality of documents as a subjective document because its subjectivity classification is higher than a predetermined value; designating at least one document of the plurality of documents as a non-subjective document because its subjectivity classification is lower than a predetermined value; designating at least one document of the plurality of documents as a complex document because its complexity classification is higher than a predetermined value; designating at least one document of the plurality of documents as a non-complex document because its complexity classification is lower than a predetermined value.
-
-
2. A document classification system for determining the subjectivity, complexity and descriptive image classifications for a plurality of documents containing text sentences, alphabetic words and images having meta-data, comprising:
-
at least one non-transitory computer readable medium; and at least one processor; and instructions stored on the at least one non-transitory computer readable medium which when executed by the at least one processor are configured to perform the following steps; for each document of a plurality of documents; a) categorizing at least one sentence as a subjective sentence and at least one sentence as non-subjective sentence; wherein the at least one subjective sentence is categorized as a subjective sentence because the sentence includes one or more pronouns of the group;
I, we, he, she, they or you; andwherein the at least one non-subjective sentence is categorized as a non-subjective sentence because the sentence does not include one or more pronouns of the group;
I, we, he, she, they or you;b) categorizing at least one alphabetical word as a complex word and a non-complex word; wherein the at least one alphabetical word is categorized as complex if the number of syllables is over a threshold; and wherein the at least one alphabetical word is categorized as non-complex if the number of syllables is less than a threshold value; c) categorizing at least one image as descriptive image and at least one image as non-descriptive; wherein the at least one descriptive image is categorized as a descriptive image because the at least one descriptive image includes an image size greater than a designated size and the image meta-data includes a title or description; and wherein the at least one non-descriptive image is categorized as a non-descriptive image because the at least one non-descriptive image includes an image size less than a designated image size or the image meta-data does not include a title or a description; and d) designating a document subjectivity classification that is equal to the ratio of subjective sentences to non-subjective sentences; and designating a document complexity classification that is equal to the ratio of complex alphabetical words to non-complex alphabetical words; and designating a document descriptive image classification that is equal to the total number of the descriptive images per document; and e) wherein the subjectivity classification, the complexity classification and the descriptive image classification are associated with the document and stored in a database; designating at least one document of the plurality of documents as a subjective document because its subjectivity classification is higher than a predetermined value; and designating at least one document of the plurality of documents as a non-subjective document because its subjectivity classification is lower than a predetermined value; and designating at least one document of the plurality of documents as a complex document because its complexity classification is higher than a predetermined value; and designating for each document of the plurality of documents as a non-complex document because its complexity classification is lower than a predetermined value.
-
Specification