Method and apparatus for classification of high dimensional data

US 6,563,952 B1
Filed: 10/18/1999
Issued: 05/13/2003
Est. Priority Date: 10/18/1999
Status: Expired due to Fees

First Claim

Patent Images

1. A method of classifying high dimensional data, the method comprising the steps of:

(a) storing a raw data training set in a memory, said raw data including a multiplicity of entries each having a plurality of attributes;

(b) flattening said raw data training set by converting each raw data entry into a respective binary string;

(c) classifying a specific flattened data entry not in the training set based on said flattened training set by building a decision tree based on the attributes of the training set entries.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention is an apparatus and method for classifying high-dimensional sparse datasets. A raw data training set is flattened by converting it from categorical representation to a boolean representation. The flattened data is then used to build a class model on which new data not in the training set may be classified. In one embodiment, the class model takes the form of a decision tree, and large itemsets and cluster information are used as attributes for classification. In another embodiment, the class model is based on the nearest neighbors of the data to be classified. An advantage of the invention is that, by flattening the data, classification accuracy is increased by eliminating artificial ordering induced on the attributes. Another advantage is that the use of large itemsets and clustering increases classification accuracy.

Citations

15 Claims

1. A method of classifying high dimensional data, the method comprising the steps of:
- (a) storing a raw data training set in a memory, said raw data including a multiplicity of entries each having a plurality of attributes;
  
  (b) flattening said raw data training set by converting each raw data entry into a respective binary string;
  
  (c) classifying a specific flattened data entry not in the training set based on said flattened training set by building a decision tree based on the attributes of the training set entries.
- View Dependent Claims (2)
- - 2. The method of claim 1, wherein classification step (c) is further performed by:

3. A method of classifying high dimensional data, the method comprising the steps of:
- (a) storing a raw data training set in a memory, said raw data including a multiplicity of entries each having a plurality of attributes;
  
  (b) flattening said raw data training set by converting each raw data entry into a respective binary string;
  
  (c) classifying a specific flattened data entry not in the training set based on said flattened training set by;
  
  (i) identifying large itemsets with support above a predetermined threshold;
  
  (ii) building a decision tree based at least in part on said large itemsets.

4. A method of classifying high dimensional data, the method comprising the steps of:
- (a) storing a raw data training set in a memory, said raw data including a multiplicity of entries each having a plurality of attributes;
  
  (b) flattening said raw data training set by converting each raw data entry into a respective binary string;
  
  (c) classifying a specific flattened data entry not in the training set based on said flattened training set by;
  
  (i) determining m nearest neighbors of a specific entry to be classified, where m is a predetermined value;
  
  (ii) choosing a classification for said specific entry based on the classification of the m nearest neighbors.
- View Dependent Claims (5)
- - 5. The method of claim 4, wherein the classification for said specific entry is determined by:

6. A method of classifying high dimensional data, the method comprising the steps of:
- (a) storing a raw data training set in a memory, said raw data including a multiplicity of entries each having a plurality of attributes;
  
  (b) flattening said raw data training set by converting each raw data entry into a respective binary string;
  
  (c) classifying a specific flattened data entry not in the training set based on said flattened training set by;
  
  (i) determining m nearest neighbors of a specific entry to be classified, where m is a predetermined value;
  
  (ii) choosing a classification for said specific entry based on the classification of the m nearest neighbors, wherein the classification for said specific entry is determined by;
  
  when all m nearest neighbors belong to a single class, said single class is assigned to the specific entry;
  
  when a majority of the m nearest neighbors belong to a single class, said single class is assigned to the specific entry;
  
  when there exists no class majority among the m nearest neighbors, at least all classes with the highest occurrence among the m nearest neighbor are reported to a user.

7. A method of classifying high dimensional data, the method comprising the steps of:
- (a) storing a raw data training set in a memory, said raw data including a multiplicity of entries each having a plurality of attributes;
  
  (b) flattening said raw data training set by converting each raw data entry into a respective binary string;
  
  (c) classifying a specific flattened data entry not in the training set based on said flattened training set by;
  
  (i) determining m nearest neighbors of a specific entry to be classified, where m is a predetermined value;
  
  (ii) choosing a classification for said specific entry based on the classification of the m nearest neighbors, wherein the classification for said specific entry is determined by;
  
  when all m nearest neighbors belong to a single class, assigning the single class to the specific entry;
  
  when the m nearest neighbors belong to at least two classes, weighting by using an inverse frequency of the at least two classes in the training set, generating weighted votes, and assigning the class having the greatest weighted votes to the specific entry.

8. A computer system for classifying high-dimensional data comprising:
- a memory;
  
  data, including a training set comprised of a multiplicity of entries each having a plurality of attributes, and a specific data entry not in the training set;
  
  communications procedures to receive the data as input;
  
  control procedures to flatten the training set and the specific data entry, and to classify the flattened specific data entry based on the flattened training set; and
  
  a processor coupled to the memory and configured to execute said control and communications procedures, wherein the processor is further configured to classify the specific data entry using a decision tree based on the attributes.
- View Dependent Claims (9)
- - 9. The apparatus of claim 8, wherein the attributes include large itemsets.

10. A computer system for classifying high-dimensional data comprising:
- a memory;
  
  data, including a training set comprised of a, multiplicity of entries each having a plurality of attributes, and a specific data entry not in the training set;
  
  communications procedures to receive the data as input;
  
  control procedures to flatten the training set and the specific data entry, and to classify the flattened specific data entry based on the flattened training set; and
  
  a processor coupled to the memory and configured to execute said control and communications procedures, wherein the processor is further configured to classify the specific data entry using a nearest neighbor class model.
- View Dependent Claims (11)
- - 11. The apparatus of claim 10, wherein the processor is further configured to determine the classification of the specific entry by:

12. A computer system for classifying high-dimensional data comprising:
- a memory;
  
  data, including a training set comprised of a multiplicity of entries each having a plurality of attributes, and a specific data entry not in the training set;
  
  communications procedures to receive the data as input;
  
  control procedures to flatten the training set and the specific data entry, and to classify the flattened specific data entry based on the flattened training set; and
  
  a processor coupled to the memory and configured to execute said control and communications procedures, wherein the processor is further configured to classify the specific data entry using a nearest neighbor class model and to determine the classification of the specific entry by;
  
  when all m nearest neighbors belong to a single class, said single class is assigned to the specific entry;
  
  when a majority of the m nearest neighbors belong to a single class, said single class is assigned to the specific entry;
  
  when there exists no class majority among the m nearest neighbors, at least all classes with the highest occurrence among the m nearest neighbors are reported to a user.

13. A computer system for classifying high-dimensional data comprising:
- a memory;
  
  data, including a training set comprised of a multiplicity of entries each having a plurality of attributes, and a specific data entry not in the training set;
  
  communications procedures to receive the data as input;
  
  control procedures to flatten the training set and the specific data entry, and to classify the flattened specific data entry based on the flattened training set; and
  
  a processor coupled to the memory and configured to execute said control and communications procedures, wherein the processor is further configured to classify the specific data entry using a nearest neighbor class model and to determine the classification of the specific entry by;
  
  when all m nearest neighbors belong to a single class, assigning the single class to the specific entry;
  
  when the m nearest neighbors belong to at least two classes, weighting by using an inverse frequency of the at least two classes in the training set, generating weighted votes, and assigning the class having the greatest weighted votes to the specific entry.

14. A computer program product for use in conjunction with a computer system, the computer program product comprising a computer readable storage medium and a computer program mechanism embedded therein, the computer program mechanism, comprising:
- a program module that directs the computer system including a processor to function in a specified manner for classifying high-dimensional data, the program module including instructions for;
  
  (a) storing a raw data training set in a memory, said raw data including a multiplicity of entries each having a plurality of attributes;
  
  (b) flattening said raw data training set by converting each raw data entry into a respective binary string; and
  
  (c) classifying a specific flattened data entry not in the training set based on said flattened training set by;
  
  (i) identifying large item sets with support above a predetermined threshold; and
  
  (ii) building a decision tree based at least in part on said large item sets.

15. A computer program product for use in conjunction with a computer system, the computer program product comprising a computer readable storage medium and a computer program mechanism embedded therein, the computer program mechanism, comprising:
- a program module that directs the computer system including a processor to function in a specified manner for classifying high-dimensional data, the program module including instructions for;
  
  (a) storing a raw data training set in a memory, said raw data including a multiplicity of entries each having a plurality of attributes;
  
  (b) flattening said raw data training set by converting each raw data entry into a respective binary string;
  
  (c) classifying a specific flattened data entry not in the training set based on said flattened training set by;
  
  (i) determining m nearest neighbors of a specific entry to be classified, where m is a predetermined value;
  
  (ii) choosing a classification for said specific entry based on the classification of the m nearest neighbors.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hitachi America Limited (Hitachi, Ltd.)
Original Assignee
Hitachi America Limited (Hitachi, Ltd.)
Inventors
Srivastava, Anurag, Ranka, Sanjay, Singh, Vineet, Ramkumar, G. D.
Primary Examiner(s)
CHANG, JON CARLTON

Application Number

US09/420,252
Time in Patent Office

1,303 Days
Field of Search

382/224, 382/225, 382/226, 382/227, 382/228
US Class Current

382/225
CPC Class Codes

G06F 18/24147 Distances to closest patter...

Method and apparatus for classification of high dimensional data

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for classification of high dimensional data

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links