Method for finding a reference token sequence in an original token string within a database of token strings using appended non-contiguous substrings

US 5,577,249 A
Filed: 08/08/1995
Issued: 11/19/1996
Est. Priority Date: 07/31/1992
Status: Expired due to Term

First Claim

Patent Images

1. A method for finding a reference string of tokens in one or more original token strings within a database comprising the steps of:

creating one or more original tuples for each of the original token strings in the database by;

a. partitioning each original token string into three or more original substrings of contiguous tokens;

b. appending together two or more original substrings of the original token string to form one or more original tuples associated with the original token string, at least one of the original tuples being formed by appending together two or more non-contiguous original substrings of the original token string;

creating a unique original index for each original tuple created from the original token string by using an index algorithm, the original index being associated with the original token string from which the original tuple was created, each original index associated with information that is used to locate the original token string in the database containing the tuple from which the original index was derived and to determine the position of the matched reference sequence in the original token string;

creating one or more reference tuples from the reference string of tokens by;

c. partitioning the reference string of tokens into three or more reference substrings of contiguous tokens;

d. appending together two or more reference substrings to form one or more reference tuples, at least one of the reference tuples being formed by appending together two or more non-contiguous reference substrings;

creating a unique reference index for each reference tuple using the index algorithm;

comparing at least one reference index to at least one original index;

tracking the matches between the reference index and original index;

selecting an original token string in the database based on the number of matches between one or more original indexes and one or more reference indexes.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

This method non sequentially compares a reference sequence of tokens to an original sequence of tokens to determine subsequences of tokens which exactly or similarly match. The method has a novel approach for creating a large number of indexes by partitioning strings of tokens into substrings, appending non contiguous substrings together to form tuples, and creating indexes from the tuples. Indexes are created in this manner for both the original and reference strings. Techniques are also provided to approximately or exactly locate the substrings which where used to create the tuples and indexes from the original sequence of tokens. Original and reference indexes are compared and matches are tracked. Higher numbers of matches result in higher scores (votes) in a table and indicate a stronger similarity between the sequences on the the original and reference strings. Using this method, the degree of similarity can also be determined. The Method is useful when comparing a reference sequence of tokens to a large database of original strings of tokens. It has applications in the biological sciences (human genome mapping or analyzing proteins) and in image, speech, and music recognition.

Citations

28 Claims

1. A method for finding a reference string of tokens in one or more original token strings within a database comprising the steps of:
- creating one or more original tuples for each of the original token strings in the database by;
  
  a. partitioning each original token string into three or more original substrings of contiguous tokens;
  
  b. appending together two or more original substrings of the original token string to form one or more original tuples associated with the original token string, at least one of the original tuples being formed by appending together two or more non-contiguous original substrings of the original token string;
  
  creating a unique original index for each original tuple created from the original token string by using an index algorithm, the original index being associated with the original token string from which the original tuple was created, each original index associated with information that is used to locate the original token string in the database containing the tuple from which the original index was derived and to determine the position of the matched reference sequence in the original token string;
  
  creating one or more reference tuples from the reference string of tokens by;
  
  c. partitioning the reference string of tokens into three or more reference substrings of contiguous tokens;
  
  d. appending together two or more reference substrings to form one or more reference tuples, at least one of the reference tuples being formed by appending together two or more non-contiguous reference substrings;
  
  creating a unique reference index for each reference tuple using the index algorithm;
  
  comparing at least one reference index to at least one original index;
  
  tracking the matches between the reference index and original index;
  
  selecting an original token string in the database based on the number of matches between one or more original indexes and one or more reference indexes.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. A method of finding a reference string of tokens in one or more original token strings with a database, as in claim 1, where the original substrings of tokens are all of a fixed length.
  - 3. A method of finding a reference string of tokens in one or more original token strings within a database, as in claim 1, where the original substrings of tokens are of different lengths.
  - 4. A method of finding a reference string of tokens in one or more original token strings within a database, as in claim 1, where the original tuples formed are of a fixed length.
  - 5. A method of finding a reference string of tokens in one or more original token strings with a database, as in claim 1, where the original tuples formed are of a different lengths.
  - 6. A method of finding a reference string of tokens in one or more original token strings within a database, as in claim 1, where the original tuples are formed by using the following rules:
    - each subsequent substring forming the original tuple has a starting position in tokens further away from a beginning token of the original token string than a prior starting position of all prior substrings forming the original tuple;
      
      no substring can be used form a tuple if its starting position and length in tokens cause the tuple to exceed the length of the original token string;
      
      The starting tokens of two subsequent substrings in an original string must be between a minimum and a maximum radius of coherence of one another, the radius distances being measured in tokens.
  - 7. A method of finding a reference string of tokens in one or more original token strings within a database, as in claim 1, where the original index and the reference index are numbers.
  - 8. A method of finding a reference string of tokens in one or more original token strings within a database, as in claim 1, where the original index and the reference index are integers.

9. A method for recognizing and accessing a reference string of tokens in one or more original token strings within a database comprising the steps of:
- creating one or more original tuples for each of the original token strings in the database by;
  
  a. partitioning each original token string into three or more original substrings of contiguous tokens;
  
  b. appending together two or more original substrings of the original token string to form one or more original tuples associated with the original token string, at least one original tuple being formed by appending two or more non-contiguous original substrings of the original token string;
  
  creating a unique original index for each original tuple created from the original token string by using an index algorithm, the original index being associated with the original token string from which the original tuple was created;
  
  using the original index to point to a cell in a first memory look-up structure and storing in the cell an information record associated with the original string, the information record containing pointing information used to locate the original token string in the database containing the tuple from which the original index was derived and displacement information used to determine the position of the matched reference sequence in the original token string;
  
  creating one or more reference tuples from the reference string of tokens by;
  
  c. partitioning the reference string of tokens into three or more reference substrings of contiguous tokens;
  
  d. appending together two or more reference substrings to form one or more reference tuples, at least one on the reference tuples being formed by appending together two or more non-contiguous reference substrings;
  
  creating a unique reference index for each reference tuple using the index algorithmcomparing at least one reference index to at least one original index using the memory look-up structure;
  
  tracking the matches between the reference index and original index;
  
  storing the tracking results in a second memory look-up structure;
  
  selecting an original token string in the database based on the number of matches between one or more original indexes and one or more reference indexes.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 10. A method for recognizing and accessing a reference string of tokens in one or more original token strings within a database, as in claim 9, where the cell contains a list of information records where the information records contains a reference to an original token string.
  - 11. A method for recognizing and accessing a reference string of tokens in one or more original token strings within a database, as in claim 9, where the cell contains a list of information records which include a reference to an original token string and associated displacement information.
  - 12. A method for recognizing and accessing a reference string of tokens in one or more original token strings within a database, as in claim 9, where the displacement information is computed based on a position in tokens of at least one substring of the original token string used to form the tuple of the original index.
  - 13. A method for recognizing and accessing a reference string of tokens in one or more original token strings within a database, as in claim 9, where the displacement information is a distance in tokens between an average of the position of each original substring used to form the tuple and a given token on the original token string.
  - 14. A method for recognizing and accessing a reference string of tokens in one or more original token strings within a database, as in claim 9, where the match between the original token string and the reference string is an exact match.
  - 15. A method for recognizing and accessing a reference string of tokens in one or more original token strings within a database, as in claim 9, where the match between the original token string and the reference string is a similar match.
  - 16. A method for recognizing and accessing a reference string of tokens in one or more original token strings within a database, as in claim 9, where information in the second memory structure includes a value which indicates the degree of similarity between the original token string and the reference string.
  - 17. A method for recognizing and accessing a reference string of tokens in one or more original token strings within a database, as in claim 9, where the first look-up structure is a data structure that includes structures like a vector, array, and hash table.
  - 18. A method for recognizing and accessing a reference string of tokens in one or more original token strings within a database, as in claim 17, where the first look-up structure is an array.
  - 19. A method for recognizing and accessing a reference string of tokens in one or more original token strings within a database, as in claim 9, where the second look-up structure is a static data structure that includes structures like a vector, array, and hash table.
  - 20. A method for recognizing and accessing a reference string of tokens in one or more original token strings within a database, as in claim 9, where the second look-up structure is a dynamic data structure that includes structures like a vector, array, and hash table.
  - 21. A method for recognizing and accessing a reference string of tokens in one or more original token strings within a database, as in claim 20, where the second look-up structure is a dynamic hash table.
  - 22. A method for recognizing and accessing a reference string of tokens in one or more original token strings within a database, as in claim 9, where the second memory look-up structure is updated every time a reference index matches a original index.

23. A method for recognizing and accessing a reference string of nucleotides in one or more original DNA strings within a database comprising the steps of:
- creating one or more original tuples for each of the original DNA strings in the database by;
  
  a. partitioning each original DNA string into three or more substrings of contiguous nucleotides;
  
  b. appending together two or more original DNA substrings of the original DNA string to form one or more original tuples associated with each original DNA string;
  
  creating a unique original index for each original tuple created from the original DNA string using an index algorithm, the original index being associated with the original DNA string from which the original tuple was created;
  
  creating one or more reference tuples from the reference string of tokens by;
  
  c. partitioning the reference string of nucleotides into three or more reference substrings of contiguous nucleotides;
  
  d. appending together two or more reference substrings to form one or more reference tuples, at least one of the reference tuples being formed by appending together two or more non-contiguous reference substrings;
  
  creating a unique reference index for each reference tuple using the index algorithm;
  
  comparing at least one reference index to at least one original index to determine if the indexes match;
  
  tracking the matches between the reference index and original index;
  
  selecting an original DNA string in the database based on the number of matches between one or more original indexes and one or more reference indexes.

24. A method for recognizing and accessing a reference string of amino acids in one or more original protein strings within a database comprising the steps of:
- creating one or more original tuples for each of the original protein strings in the database by;
  
  a. partitioning each original protein string into three or more substrings of contiguous amino acids;
  
  b. forming one or more original tuples associated with each original protein string by appending together two or more original amino acid substrings of the original string, one or more of the original tuples being formed by appending together at least two non contiguous original amino acid substrings;
  
  creating a unique original index for each original tuple created from the original protein string using an index algorithm, the original index being associated with the original protein string from which the original tuple was created;
  
  creating one or more reference tuples from the reference string of tokens by;
  
  c. partitioning the reference string of amino acids into three or more contiguous reference substrings of amino acids;
  
  d. forming two or more reference tuples by appending together two or more reference substrings, one or more of the reference tuples being formed by appending two or more non contiguous reference substrings;
  
  creating a unique reference index for each reference tuple using the index algorithm;
  
  comparing at least one reference index to at least one original index;
  
  tracking the matches between the reference index and original index;
  
  selecting an original protein string in the database based on the number of matches between one or more original indexes and one or more reference indexes.

25. A method for recognizing and accessing a reference string of characters in one or more original character strings within a database comprising the steps of:
- creating one or more original tuples for each of the original character strings in the database by;
  
  a. partitioning each original character string into three or more substrings of contiguous characters;
  
  b. forming one or more original tuples associated with each original character string by appending together two or more original character substrings of the original string, one or more of the original tuples being formed by appending together two or more non contiguous original character substrings;
  
  creating a unique original index for each original tuple created from the original character string using an index algorithm, the original index being associated with the original character string from which the original tuple was created;
  
  creating one or more reference tuples from the reference string of tokens by;
  
  c. partitioning the reference string of characters into three or more non contiguous reference substrings of characters;
  
  d. forming two or more reference tuples by appending together two or more reference substrings, one or more of the reference tuples being formed by appending two or more non contiguous reference substrings;
  
  creating a unique reference index for each reference tuple using the index algorithm;
  
  comparing at least one reference index to at least one original index;
  
  tracking the matches between the reference index and original index;
  
  selecting an original character string in the database based on the number of matches between one or more original indexes and one or more reference indexes.

26. A method for recognizing and accessing a reference string of phonemes in one or more original phoneme strings within a database comprising the steps of:
- creating one or more original tuples for each of the original phoneme strings in the database by;
  
  a. partitioning each original phoneme string into three or more original substrings of contiguous phonemes;
  
  b. forming one or more original tuples associated with each original phoneme string by appending together two or more original substrings of the original string, one or more of the original tuples being formed by appending together at least two non contiguous original substrings;
  
  creating a unique original index for each original tuple created from an original phoneme string using an index algorithm, the original index being associated with the original phoneme string from which the original tuple was created;
  
  creating one or more reference tuples from the reference string of phonemes by;
  
  c. partitioning the reference string of phoneme into three or more contiguous reference substrings of phonemes;
  
  forming two or more reference tuples by appending together two or more reference substrings, one or more of the reference tuples being formed by appending two or more non contiguous reference substrings;
  
  creating a unique reference index for each reference tuple using the index algorithm;
  
  comparing at least one reference index to at least one original index;
  
  tracking the matches between the reference index and original index;
  
  selecting an original phoneme string in the database based on the number of matches between one or more original indexes and one or more reference indexes.

27. A method for recognizing and accessing a reference string of notes in one or more original note strings within a database comprising the steps of:
- creating one or more original tuples for each of the original note strings in the database by;
  
  a. partitioning each original note string into three or more original substrings of contiguous notes;
  
  b. forming one or more original tuples associated with each original note string by appending together two or more original substrings of the original string, one or more of the original tuples being formed by appending together at least two non contiguous original substrings;
  
  creating a unique original index for each original tuple created from the original note string using an index algorithm, the original index being associated with the original note string from which the original tuple was created;
  
  creating one or more reference tuples from the reference string of note by;
  
  c. partitioning the reference string of notes into three or more contiguous reference substrings of notes;
  
  d. forming two or more reference tuples by appending together two or more reference substrings, one or more of the reference tuples being formed by appending two or more non contiguous reference substrings;
  
  creating a unique reference index for each reference tuple using the index algorithm;
  
  comparing at least one reference index to at least one original index;
  
  tracking the matches between the reference index and original index;
  
  selecting an original note string in the database based on the number of matches between one or more original indexes and one or more reference indexes.

28. A computer system for recognizing and accessing a reference string of tokens in one or more original token strings within a database comprising:
- a database having a set of original token strings;
  
  a means for creating at least one original tuple for each of the original token strings in the database, the tuple formed by;
  
  a. partitioning each original token string into three or more contiguous original substrings of tokens;
  
  b. forming one or more original tuple associated with each original string by appending together two or more original substrings of the original string, one or more of the original tuples being formed by appending together at least two non contiguous original substrings;
  
  a unique original index for each original tuple created from the original string using an index algorithm, the original index being associated with the original string from which the original tuple was created;
  
  a first memory look-up structure with cells, the cells being accessed by the original index and containing information associated with the original string from which the original tuple was created;
  
  one or more reference tuples created from the reference string of tokens by;
  
  c. partitioning the reference string of tokens into three or more non contiguous reference substrings of tokens;
  
  d. forming the reference tuples by appending together at least two reference substrings, one or more of the reference tuples being formed by appending two or more non contiguous reference substrings;
  
  unique reference index for each reference tuple created using the index algorithm, the reference index compared to at least one reference index to at least one original index;
  
  a second memory look-up structure for tracking matches between the reference index and original index, an original token string in the database being selected based on the number of matches between one or more original indexes and one or more reference indexes.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Califano, Andrea
Primary Examiner(s)
Black, Thomas G.
Assistant Examiner(s)
HOMERE, JEAN RAYMOND

Application Number

US08/512,794
Time in Patent Office

469 Days
Field of Search

395/600, 395/800, 395/13, 382/117, 382/209, 364/DIG. 1
US Class Current

707/694
CPC Class Codes

G06F 16/90344   by using string matching te...

Y02A 90/10   Information and communicati...

Y10S 707/968   Partitioning

Method for finding a reference token sequence in an original token string within a database of token strings using appended non-contiguous substrings

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

28 Claims

Specification

Solutions

Use Cases

Quick Links

Method for finding a reference token sequence in an original token string within a database of token strings using appended non-contiguous substrings

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

28 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links