×

Method and system for document classification based on document structure and written style

  • US 8,082,248 B2
  • Filed: 05/29/2008
  • Issued: 12/20/2011
  • Est. Priority Date: 05/29/2008
  • Status: Active Grant
First Claim
Patent Images

1. A method for determining the subjectivity, complexity and descriptive image classifications for a plurality of documents containing text sentences, alphabetic words and images having meta-data, comprising:

  • for each document of a plurality of documents;

    a) categorizing at least one sentence as a subjective sentence and at least one sentence as non-subjective sentence;

    wherein the at least one subjective sentence is categorized as a subjective sentence because the sentence includes one or more pronouns of the group;

    I, we, he, she, they or you; and

    wherein the at least one non-subjective sentence is categorized as a non-subjective sentence because the sentence does not include one or more pronouns of the group;

    I, we, he, she, they or you;

    b) categorizing at least one alphabetical word as a complex word and at least one alphabetical word as a non-complex word;

    wherein the at least one complex word is categorized as a complex word because the number of syllables is over a threshold number of syllables; and

    wherein the at least one non-complex word is categorized as a non-complex word because the number of syllables is less than a threshold number of syllables;

    c) categorizing at least one image as descriptive image and at least one image as non-descriptive;

    wherein the at least one descriptive image is categorized as a descriptive image because the at least one descriptive image includes an image size greater than a designated image size and the image meta-data includes a title or description; and

    wherein the at least one non-descriptive image is categorized as a non-descriptive image because the at least one non-descriptive image includes an image size less than a designated image size or the image meta-data does not include a title or a description;

    d) designating a document subjectivity classification that is equal to the ratio of subjective sentences to non-subjective sentences; and

    designating a document complexity classification that is equal to the ratio of complex alphabetical words to non-complex alphabetical words; and

    designating a document descriptive image classification that is equal to the total number of the descriptive images in a document;

    e) wherein the subjectivity classification, the complexity classification and the descriptive image classification are associated with the document and stored in a database;

    designating at least one document of the plurality of documents as a subjective document because its subjectivity classification is higher than a predetermined value;

    designating at least one document of the plurality of documents as a non-subjective document because its subjectivity classification is lower than a predetermined value;

    designating at least one document of the plurality of documents as a complex document because its complexity classification is higher than a predetermined value;

    designating at least one document of the plurality of documents as a non-complex document because its complexity classification is lower than a predetermined value.

View all claims
  • 0 Assignments
Timeline View
Assignment View
    ×
    ×