Method and apparatus for automatic comparison of data sequences using local and global relationships

US 8,271,403 B2
Filed: 12/08/2006
Issued: 09/18/2012
Est. Priority Date: 12/09/2005
Status: Expired due to Fees

First Claim

Patent Images

1. A method for automatic byte stream comparison of at least two data sequences, a first and second data sequence comprising one or more of symbols, images, text, ASCII characters, genetic data, protein data, bytes, binary data, or tokens as objects are, the method comprising the steps of:

performing an evaluation of;

a) a local relationship between any pair of subsequences in two or more sequences received from a byte stream, wherein subsequences for evaluation of the local relationship are specified on a com uterized detector by a subsequence selection mode comprising one of;

words, wherein the words are subsequences separated by a given set of delimiters;

n-grams, wherein the n-grams are overlapping subsequences of a given length n; and

all possible subsequences of two or more sequences;

b) performing an evaluation of a global relationship by aggregation of a plurality of evaluations of said local relationships, wherein evaluation of the global relationships is performed by one of the following data structures or a representation thereof;

a hash table or indexed table;

a trie or compacted trie;

a suffix tree or suffix array; and

a generalized suffix tree or generalized suffix array; and

c) wherein the totality of local and global relationships comprises a measure s for similarity or dissimilarity of two or more sequences.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The invention is concerned with a method and an apparatus for automatic comparison of at least two data sequences characterized in—an evaluation of a local relationship between any pair of subsequences in two or more sequences; —an evaluation of a global relationship by means of aggregation of the evaluations of said local relationships.

44 Citations

View as Search Results

8 Claims

1. A method for automatic byte stream comparison of at least two data sequences, a first and second data sequence comprising one or more of symbols, images, text, ASCII characters, genetic data, protein data, bytes, binary data, or tokens as objects are, the method comprising the steps of:
- performing an evaluation of;
  
  a) a local relationship between any pair of subsequences in two or more sequences received from a byte stream, wherein subsequences for evaluation of the local relationship are specified on a com uterized detector by a subsequence selection mode comprising one of;
  
  words, wherein the words are subsequences separated by a given set of delimiters;
  
  n-grams, wherein the n-grams are overlapping subsequences of a given length n; and
  
  all possible subsequences of two or more sequences;
  
  b) performing an evaluation of a global relationship by aggregation of a plurality of evaluations of said local relationships, wherein evaluation of the global relationships is performed by one of the following data structures or a representation thereof;
  
  a hash table or indexed table;
  
  a trie or compacted trie;
  
  a suffix tree or suffix array; and
  
  a generalized suffix tree or generalized suffix array; and
  
  c) wherein the totality of local and global relationships comprises a measure s for similarity or dissimilarity of two or more sequences.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method according to claim 1, wherein the totality of local and global relationships comprises one of the following similarity or dissimilarity measures s:
    - Manhattan or taxicab distance;
      
      Euclidean distance;
      
      Minkowski distance;
      
      Canberra distance;
      
      Chi-Square distance;
      
      Chebyshev distance;
      
      Geodesic distance;
      
      Jensen or symmetric Kullback-Leibler divergence;
      
      Position-independent Hamming distance;
      
      1^stand 2^ndKulczynski similarity coefficient;
      
      Czekanowski or Sorensen-Dice similarity coefficient;
      
      Jaccard similarity coefficient;
      
      Simpson similarity coefficient;
      
      Sokal-Sneath or Anderberg similarity coefficient;
      
      Otsuka or Ochiai similarity coefficient; and
      
      Braun-Blanquet similarity coefficient.
  - 3. The method according to claim 1, wherein the first and second data sequences (X, Y) each contain a plurality of objects to be detected;
    - a similarity measure s is automatically computed for subsequences in the first and second data sequences; and
      
      depending on the similarity measure s, further processing steps are taken.
  - 4. The method according to claim 3, wherein the first and second data sequences (X, Y) comprise data transmitted between computers in a computer network and depending on an on-line computation of the similarity measure s, at least one signal indicating an abnormal data stream or an intrusion is automatically generated.
  - 5. The method according to claim 4, wherein the computers are part of a network for transmission of monetary information.
  - 6. The method according to claim 1, wherein the first and second data sequences comprise one or more of genetic data, data exchanged between computers, text, image data, binary data, and symbols.

7. An apparatus for the comparison of data sequences comprising:
- a computerized storage device programmed for representing data sequences in a data structure selected from one of;
  
  a hash table or indexed table;
  
  a trie or compacted trie;
  
  a suffix tree or suffix array; and
  
  a generalized suffix tree or generalized suffix array;
  
  a computer processor for performing an evaluation of a local relationship between any pair of subsequences in said data sequences;
  
  a computer processor for performing an evaluation of a global relationship by aggregation of a plurality of evaluations of said local relationships;
  
  a computer processor for computation of a totality of the local and global relationship, wherein at least one of the first and second data sequences comprise one or more of symbols, images, text, ASCII characters, genetic data, protein data, bytes, binary data, or tokens as objects for which the local relationship is evaluated; and
  
  a computer processor for generating from the totality of local and global relationships a measure s for similarity or dissimilarity of two or more sequences.

8. A system for processing and analysis of data sequences comprising:
- means for input of data sequences comprising a data structure selected from one of;
  
  a hash table or indexed table;
  
  a trie or compacted trie;
  
  a suffix tree or suffix array; and
  
  a generalized suffix tree or generalized suffix array, means for comparison of data sequences comprising one of;
  
  Manhattan or taxicab distance;
  
  Euclidean distance;
  
  Minkowski distance;
  
  Canberra distance;
  
  Chi-Square distance;
  
  Chebyshev distance;
  
  Geodesic distance;
  
  Jensen or symmetric Kullback-Leibler divergence;
  
  Position-independent Hamming distance;
  
  1st and 2nd Kulczynski similarity coefficient;
  
  Czekanowski or Sorensen-Dice similarity coefficient;
  
  Jaccard similarity coefficient;
  
  Simpson similarity coefficient;
  
  Sokal-Sneath or Anderberg similarity coefficient;
  
  Otsuka or Ochiai similarity coefficient; and
  
  Braun-Blanquet similarity coefficient,means for analysis of data sequences including classification, regression, novelty detection, ranking, clustering, and structural inference;
  
  means for reporting of results of the analysis; and
  
  means for generating from the totality of local and global relationships a measure s for similarity or dissimilarity of two or more sequences.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Fraunhofer Gesellschaft Zur Foerderung Der Angewandten Forsching E.V.
Original Assignee
Fraunhofer Gesellschaft Zur Foerderung Der Angewandten Forsching E.V.
Inventors
Rieck, Konrad, Laskov, Pavel, Mueller, Klaus-Robert, Duessel, Patrick
Primary Examiner(s)
VINCENT, DAVID ROBERT

Application Number

US12/096,126
Publication Number

US 20090024555A1
Time in Patent Office

2,111 Days
Field of Search

706/12, 706/45, 706/20
US Class Current

706/12
CPC Class Codes

G06F 18/22   Matching criteria, e.g. pro...

G06F 7/02   Comparing digital values G0...

G06N 20/10   using kernel methods, e.g. ...

G16B 30/00   ICT specially adapted for s...

G16B 30/10   Sequence alignment; Homolog...

G16B 40/00   ICT specially adapted for b...

G16B 40/20   Supervised data analysis

G16B 40/30   Unsupervised data analysis

H04L 63/1416   Event detection, e.g. attac...

H04L 63/1441   Countermeasures against mal...

H04L 9/3231   Biological data, e.g. finge...

H04L 9/3236   using cryptographic hash fu...

Method and apparatus for automatic comparison of data sequences using local and global relationships

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

44 Citations

8 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for automatic comparison of data sequences using local and global relationships

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

44 Citations

8 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links