Classification method and apparatus

US 8,276,067 B2
Filed: 09/10/2008
Issued: 09/25/2012
Est. Priority Date: 04/28/1999
Status: Expired due to Term

First Claim

Patent Images

1. A method for the computerized classification of an unclassified text document into one of a plurality of predefined classes based on a classification model obtained from the classification of a plurality of preclassified text documents which respectively have been classified as belonging to one of said plurality of classes, said document and said documents respectively comprising a plurality of terms which respectively comprise one or more symbols of a finite set of symbols;

wherein said method involves the computerized building of said classification model, comprising the following method;

representing each of said plurality of text documents, which are digitally represented in an computer, by a vector of n dimensions, said n dimensions forming a vector space, whereas the value of each dimension of said vector corresponds to the frequency of occurrence of a certain term in the document corresponding to said vector, so that said n dimensions span up a vector space;

representing the classification of said already classified documents into classes by separating said vector space into a plurality of subspaces by calculating one or more hyperplanes, such that each subspace comprises one or more documents as represented by their corresponding vectors in said vector space, so that said each subspace corresponds to a respective class;

calculating a maximum margin surrounding said hyperplanes in said vector space such that said margin contains none of the vectors contained in the subspaces corresponding to said classification classes;

wherein said method further involves, on basis of said classification model, the computerized classification of said unclassified text document as belonging to one of said plurality of classes, comprising the following method;

representing said text document, which is digitally represented in a computer, by a vector of n dimensions, said n dimensions spanning up said vector space, whereas the value of each dimension of said vector corresponds to the frequency of occurrence of a certain term in the document corresponding to said vector;

classifying said document into one of said plurality of classes by determining into which of said plurality of subspaces of said vector space said vector falls and identifying said document as belonging to a certain class which corresponds to the subspace into which said vector falls;

calculating a confidence level for the classification of said document as belonging to said certain class based on the distances between the vector representing said document and all hyperplanes surrounding said subspace which corresponds to said certain class normalized by the corresponding margins such that a document which lies outside said margins is assigned a confidence level of “

1” and

a document which falls into said margins is assigned a value between “

0” and

“

1”

.

View all claims

14 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for building a classification model for classifying unclassified documents based on the classification of a plurality of documents which respectively have been classified as belonging to one of a plurality of classes, said documents being digitally represented in a computer, said documents respectively comprising a plurality of terms which respectively comprise one or more symbols of a finite set of symbols, and said method comprising the following steps: representing each of said plurality of documents by a vector of n dimensions, said n dimensions forming a vector space, whereas the value of each dimension of said vector corresponds to the frequency of occurrence of a certain term in the document corresponding to said vector, so that said n dimensions span up a vector space; representing the classification of said already classified documents into classes by separating said vector space into a plurality of subspaces by one or more hyperplanes, such that each subspace comprises one or more documents as represented by their corresponding vectors in said vector space, so that said each subspace corresponds to a class.

Citations

2 Claims

1. A method for the computerized classification of an unclassified text document into one of a plurality of predefined classes based on a classification model obtained from the classification of a plurality of preclassified text documents which respectively have been classified as belonging to one of said plurality of classes, said document and said documents respectively comprising a plurality of terms which respectively comprise one or more symbols of a finite set of symbols;
- wherein said method involves the computerized building of said classification model, comprising the following method;
  
  representing each of said plurality of text documents, which are digitally represented in an computer, by a vector of n dimensions, said n dimensions forming a vector space, whereas the value of each dimension of said vector corresponds to the frequency of occurrence of a certain term in the document corresponding to said vector, so that said n dimensions span up a vector space;
  
  representing the classification of said already classified documents into classes by separating said vector space into a plurality of subspaces by calculating one or more hyperplanes, such that each subspace comprises one or more documents as represented by their corresponding vectors in said vector space, so that said each subspace corresponds to a respective class;
  
  calculating a maximum margin surrounding said hyperplanes in said vector space such that said margin contains none of the vectors contained in the subspaces corresponding to said classification classes;
  
  wherein said method further involves, on basis of said classification model, the computerized classification of said unclassified text document as belonging to one of said plurality of classes, comprising the following method;
  
  representing said text document, which is digitally represented in a computer, by a vector of n dimensions, said n dimensions spanning up said vector space, whereas the value of each dimension of said vector corresponds to the frequency of occurrence of a certain term in the document corresponding to said vector;
  
  classifying said document into one of said plurality of classes by determining into which of said plurality of subspaces of said vector space said vector falls and identifying said document as belonging to a certain class which corresponds to the subspace into which said vector falls;
  
  calculating a confidence level for the classification of said document as belonging to said certain class based on the distances between the vector representing said document and all hyperplanes surrounding said subspace which corresponds to said certain class normalized by the corresponding margins such that a document which lies outside said margins is assigned a confidence level of “
  
  1” and
  
  a document which falls into said margins is assigned a value between “
  
  0” and
  
  “
  
  1”
  
  .

2. An apparatus for the computerized classification of an unclassified text document into one of a plurality of predefined classes based on a classification model obtained from the classification of a plurality of preclassified text documents which respectively have been classified as belonging to one of said plurality of classes, said document and said documents respectively comprising a plurality of terms which respectively comprise one or more symbols of a finite set of symbols;
- wherein said apparatus involves the computerized building of said classification model, said apparatus comprising a processor responsive to a stored program of instructions for;
  
  representing each of said plurality of text documents, which are digitally represented in an computer, by a vector of n dimensions, said n dimensions forming a vector space, whereas the value of each dimension of said vector corresponds to the frequency of occurrence of a certain term in the document corresponding to said vector, so that said n dimensions span up a vector space;
  
  representing the classification of said already classified documents into classes by separating said vector space into a plurality of subspaces by calculating one or more hyperplanes, such that each subspace comprises one or more documents as represented by their corresponding vectors in said vector space, so that said each subspace corresponds to a respective class;
  
  calculating a maximum margin surrounding said hyperplanes in said vector space such that said margin contains none of the vectors contained in the subspaces corresponding to said classification classes;
  
  wherein said apparatus further involves, on basis of said classification model, the computerized classification of said unclassified text document as belonging to one of said plurality of classes, said apparatus comprising a processor responsive to a stored program of instructions for;
  
  representing said text document, which is digitally represented in a computer, by a vector of n dimensions, said n dimensions spanning up said vector space, whereas the value of each dimension of said vector corresponds to the frequency of occurrence of a certain term in the document corresponding to said vector;
  
  classifying said document into one of said plurality of classes by determining into which of said plurality of subspaces of said vector space said vector falls and identifying said document as belonging to a certain class which corresponds to the subspace into which said vector falls; and
  
  calculating a confidence level for the classification of said document as belonging to said certain class based on the distances between the vector representing said document and all hyperplanes surrounding said subspace which corresponds to said certain class normalized by the corresponding margins such that a document which lies outside said margins is assigned a confidence level of “
  
  1” and
  
  a document which falls into said margins is assigned a value between “
  
  0” and
  
  “
  
  1”
  
  .

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hyland Switzerland, Sarl (HSI Holdings II Incorporated)
Original Assignee
BDGB Enterprise Software SARL (Ninestar Corporation)
Inventors
Rujan, Pal, Urbschat, Harry
Primary Examiner(s)
STORK, KYLE R

Application Number

US12/208,088
Publication Number

US 20090216693A1
Time in Patent Office

1,476 Days
Field of Search

715/234, 715/243, 715/254, 715/273
US Class Current

715/273
CPC Class Codes

G06F 16/355   Class or cluster creation o...

G06F 18/2411   based on the proximity to a...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Classification method and apparatus

First Claim

14 Assignments

0 Petitions

Accused Products

Abstract

Citations

2 Claims

Specification

Solutions

Use Cases

Quick Links

Classification method and apparatus

First Claim

14 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

2 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links