Discriminative feature selection for data sequences

US 20040153307A1
Filed: 03/22/2004
Published: 08/05/2004
Est. Priority Date: 03/30/2001
Status: Abandoned Application

First Claim

Patent Images

1. A discriminative feature selection method for selecting a set of features from training data comprising a plurality of data sequences, said data sequences being generated from at least two data sources, and wherein each data sequence comprises a sequence of data symbols from an alphabet, said method comprising:

building a suffix tree from said training data, said suffix tree comprising suffixes of said data sequences having an empirical probability of occurrence from at least one of said sources greater than a first predetermined threshold; and

pruning from said suffix tree all suffixes for which there exists in said suffix tree a shorter suffix having equivalent predictive capability for all of said data sources.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A discriminative feature selection method for selecting a set of features from a set of training data sequences is described. The training data sequences are generated by at least two data sources, and each data sequence consists of a sequence of data symbols taken from an alphabet. The method is performed by first building a suffix tree from the training data. The suffix tree contains only suffixes of the data sequences having an empirical probability of occurrence greater than a first predetermined threshold, from at least one of the sources. Next the suffix tree is pruned of all suffixes for which there exists in the suffix tree a shorter suffix having equivalent predictive capability, for all of the data sources.

15 Citations

View as Search Results

15 Claims

1. A discriminative feature selection method for selecting a set of features from training data comprising a plurality of data sequences, said data sequences being generated from at least two data sources, and wherein each data sequence comprises a sequence of data symbols from an alphabet, said method comprising:
- building a suffix tree from said training data, said suffix tree comprising suffixes of said data sequences having an empirical probability of occurrence from at least one of said sources greater than a first predetermined threshold; and
  
  pruning from said suffix tree all suffixes for which there exists in said suffix tree a shorter suffix having equivalent predictive capability for all of said data sources.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. A discriminative feature selection method according to claim 1, comprising using said suffix tree to determine a source of a test sequence.
  - 3. A discriminative feature selection method according to claim 1, wherein building said suffix tree comprises:
    - initializing said tree to include an empty suffix;
      
      initializing a subsequence length to one; and
      
      for every data suffix of said length in said training data, performing a tree generation iteration comprising;
      
      for every suffix of said length within said training data, and for each of said sources, estimating an empirical probability of occurrence of said suffix given said source;
      
      if the empirical probability of occurrence of said suffix given one of said sources is not less than said first threshold, adding said suffix to said suffix tree;
      
      if said length is less than a predetermined maximum length, incrementing said length by one and performing a further tree generation iteration; and
      
      if said length equals a predetermined maximum length, discontinuing said suffix tree building process.
  - 4. A discriminative feature selection method according to claim 1, wherein a shorter suffix has equivalent predictive capability as a longer suffix if a Kulback-Liebler divergence between an empirical probability of said alphabet given said longer suffix and an empirical probability of said alphabet given said shorter suffix is less than a second predetermined threshold, for all of said sources.
  - 5. A discriminative feature selection method according to claim 1, wherein pruning said suffix tree comprises:
    - for all suffixes in said suffix tree estimating a conditional mutual information between said alphabet and said sources given said suffix;
      
      setting a length equal to a predetermined maximum length; and
      
      performing a pruning iteration, said iteration comprising the steps of;
      
      for every suffix said length within said suffix tree, performing the steps of;
      
      selecting a spanned-tree spanned by said suffix;
      
      determining a maximum conditional information, comprising a maximum of conditional mutual information of all suffixes within said spanned-tree; and
      
      if a difference between;
      
      said maximum conditional information and a conditional information of a suffix shorter than said length within said spanned-tree is no greater than a second predetermined threshold, removing said suffix from said suffix tree;
      
      if said length equals one, discontinuing said pruning process; and
      
      if said length is greater than one, decrementing said length and performing a further pruning iteration.
  - 6. A discriminative feature selection method according to claim 1, wherein said data sequences comprise sequences of amino acids, and wherein said data sources comprise protein families.
  - 7. A discriminative feature selection method according to claim 1, wherein said data sequences comprise sequences of nucleotides, and said data sources comprise a positive data source generating nucleotide sequences which indicate binding sites within a gene, and a negative data source generating random sequences of nucleotides.
  - 8. A discriminative feature selection method according to claim 7, wherein said suffix tree is built only for data sequences having a probability of occurrence from said positive source greater than said first predetermined threshold.
  - 9. A discriminative feature selection method according to claim 1, wherein said data sequences comprise sequences of text characters, and wherein said data sources comprise text datasets.

10. A discriminative feature selector, for selecting a set of features from training data comprising a plurality of data sequences, said data sequences being generated from at least two data sources, and wherein each data sequence comprises a sequence of data symbols from an alphabet, the feature selector comprising:
- a tree generator for building a suffix tree from said training data, said suffix tree comprising suffixes of said data sequences having a probability of occurrence from at least one of said sources greater than a first predetermined threshold; and
  
  a pruner for pruning from said suffix tree all suffixes for which there exists in said suffix tree a shorter suffix having equivalent predictive capability.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. A discriminative feature selector according to claim 10, further comprising a source determiner for using said suffix tree to determine a source of a test sequence.
  - 12. A discriminative feature selector according to claim 10, wherein said data sequences comprise sequences of amino acids, and wherein said data sources comprise protein families.
  - 13. A discriminative feature selector according to claim 10, wherein said data sequences comprise sequences of nucleotides, and said data source comprise a positive data source generating nucleotide sequences which indicate binding sites within a gene, and a negative data source generating random sequences of nucleotides.
  - 14. A discriminative feature selector according to claim 13, wherein said suffix tree is built only for data sequences having a probability of occurrence from said positive source greater than said first predetermined threshold.
  - 15. A discriminative feature selector according to claim 10, wherein said data sequences comprise sequences of text characters, and wherein said data sources comprise text datasets.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Yissum Research Development Company of the Hebrew University of Jerusalem Ltd. (Hebrew University of Jerusalem)
Original Assignee
Yissum Research Development Company of the Hebrew University of Jerusalem Ltd. (Hebrew University of Jerusalem)
Inventors
Slonim, Noam, Tishby, Naftali, Fine, Shai

Application Number

US10/471,757
Publication Number

US 20040153307A1
Time in Patent Office

Days
Field of Search
US Class Current

704/4
CPC Class Codes

G06F 40/289 Phrasal analysis, e.g. fini...

Discriminative feature selection for data sequences

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

15 Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Discriminative feature selection for data sequences

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

15 Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links