Method and apparatus for automatic comparison of data sequences using local and global relationships
First Claim
Patent Images
1. A method for automatic byte stream comparison of at least two data sequences, a first and second data sequence comprising one or more of symbols, images, text, ASCII characters, genetic data, protein data, bytes, binary data, or tokens as objects are, the method comprising the steps of:
- performing an evaluation of;
a) a local relationship between any pair of subsequences in two or more sequences received from a byte stream, wherein subsequences for evaluation of the local relationship are specified on a com uterized detector by a subsequence selection mode comprising one of;
words, wherein the words are subsequences separated by a given set of delimiters;
n-grams, wherein the n-grams are overlapping subsequences of a given length n; and
all possible subsequences of two or more sequences;
b) performing an evaluation of a global relationship by aggregation of a plurality of evaluations of said local relationships, wherein evaluation of the global relationships is performed by one of the following data structures or a representation thereof;
a hash table or indexed table;
a trie or compacted trie;
a suffix tree or suffix array; and
a generalized suffix tree or generalized suffix array; and
c) wherein the totality of local and global relationships comprises a measure s for similarity or dissimilarity of two or more sequences.
1 Assignment
0 Petitions
Accused Products
Abstract
The invention is concerned with a method and an apparatus for automatic comparison of at least two data sequences characterized in—an evaluation of a local relationship between any pair of subsequences in two or more sequences; —an evaluation of a global relationship by means of aggregation of the evaluations of said local relationships.
44 Citations
8 Claims
-
1. A method for automatic byte stream comparison of at least two data sequences, a first and second data sequence comprising one or more of symbols, images, text, ASCII characters, genetic data, protein data, bytes, binary data, or tokens as objects are, the method comprising the steps of:
performing an evaluation of; a) a local relationship between any pair of subsequences in two or more sequences received from a byte stream, wherein subsequences for evaluation of the local relationship are specified on a com uterized detector by a subsequence selection mode comprising one of;
words, wherein the words are subsequences separated by a given set of delimiters;
n-grams, wherein the n-grams are overlapping subsequences of a given length n; and
all possible subsequences of two or more sequences;b) performing an evaluation of a global relationship by aggregation of a plurality of evaluations of said local relationships, wherein evaluation of the global relationships is performed by one of the following data structures or a representation thereof;
a hash table or indexed table;
a trie or compacted trie;
a suffix tree or suffix array; and
a generalized suffix tree or generalized suffix array; andc) wherein the totality of local and global relationships comprises a measure s for similarity or dissimilarity of two or more sequences. - View Dependent Claims (2, 3, 4, 5, 6)
-
7. An apparatus for the comparison of data sequences comprising:
-
a computerized storage device programmed for representing data sequences in a data structure selected from one of; a hash table or indexed table; a trie or compacted trie; a suffix tree or suffix array; and a generalized suffix tree or generalized suffix array; a computer processor for performing an evaluation of a local relationship between any pair of subsequences in said data sequences; a computer processor for performing an evaluation of a global relationship by aggregation of a plurality of evaluations of said local relationships; a computer processor for computation of a totality of the local and global relationship, wherein at least one of the first and second data sequences comprise one or more of symbols, images, text, ASCII characters, genetic data, protein data, bytes, binary data, or tokens as objects for which the local relationship is evaluated; and a computer processor for generating from the totality of local and global relationships a measure s for similarity or dissimilarity of two or more sequences.
-
-
8. A system for processing and analysis of data sequences comprising:
-
means for input of data sequences comprising a data structure selected from one of; a hash table or indexed table; a trie or compacted trie; a suffix tree or suffix array; and a generalized suffix tree or generalized suffix array, means for comparison of data sequences comprising one of; Manhattan or taxicab distance; Euclidean distance; Minkowski distance; Canberra distance; Chi-Square distance; Chebyshev distance; Geodesic distance; Jensen or symmetric Kullback-Leibler divergence; Position-independent Hamming distance; 1st and 2nd Kulczynski similarity coefficient; Czekanowski or Sorensen-Dice similarity coefficient; Jaccard similarity coefficient; Simpson similarity coefficient; Sokal-Sneath or Anderberg similarity coefficient; Otsuka or Ochiai similarity coefficient; and Braun-Blanquet similarity coefficient, means for analysis of data sequences including classification, regression, novelty detection, ranking, clustering, and structural inference; means for reporting of results of the analysis; and means for generating from the totality of local and global relationships a measure s for similarity or dissimilarity of two or more sequences.
-
Specification