×

Classification method and apparatus

  • US 8,276,067 B2
  • Filed: 09/10/2008
  • Issued: 09/25/2012
  • Est. Priority Date: 04/28/1999
  • Status: Expired due to Term
First Claim
Patent Images

1. A method for the computerized classification of an unclassified text document into one of a plurality of predefined classes based on a classification model obtained from the classification of a plurality of preclassified text documents which respectively have been classified as belonging to one of said plurality of classes, said document and said documents respectively comprising a plurality of terms which respectively comprise one or more symbols of a finite set of symbols;

  • wherein said method involves the computerized building of said classification model, comprising the following method;

    representing each of said plurality of text documents, which are digitally represented in an computer, by a vector of n dimensions, said n dimensions forming a vector space, whereas the value of each dimension of said vector corresponds to the frequency of occurrence of a certain term in the document corresponding to said vector, so that said n dimensions span up a vector space;

    representing the classification of said already classified documents into classes by separating said vector space into a plurality of subspaces by calculating one or more hyperplanes, such that each subspace comprises one or more documents as represented by their corresponding vectors in said vector space, so that said each subspace corresponds to a respective class;

    calculating a maximum margin surrounding said hyperplanes in said vector space such that said margin contains none of the vectors contained in the subspaces corresponding to said classification classes;

    wherein said method further involves, on basis of said classification model, the computerized classification of said unclassified text document as belonging to one of said plurality of classes, comprising the following method;

    representing said text document, which is digitally represented in a computer, by a vector of n dimensions, said n dimensions spanning up said vector space, whereas the value of each dimension of said vector corresponds to the frequency of occurrence of a certain term in the document corresponding to said vector;

    classifying said document into one of said plurality of classes by determining into which of said plurality of subspaces of said vector space said vector falls and identifying said document as belonging to a certain class which corresponds to the subspace into which said vector falls;

    calculating a confidence level for the classification of said document as belonging to said certain class based on the distances between the vector representing said document and all hyperplanes surrounding said subspace which corresponds to said certain class normalized by the corresponding margins such that a document which lies outside said margins is assigned a confidence level of “

    1” and

    a document which falls into said margins is assigned a value between “

    0” and



    1”

    .

View all claims
  • 14 Assignments
Timeline View
Assignment View
    ×
    ×