Minimization of surprisal data through application of hierarchy filter pattern

US 10,331,626 B2
Filed: 09/03/2014
Issued: 06/25/2019
Est. Priority Date: 05/18/2012
Status: Active Grant

First Claim

Patent Images

1. A computer program product for minimizing surprisal data representing an entire genome of an organism for compression and transmission, comprising a source computer having one or more processors and one or more computer-readable memories coupled to the one or more processors, comprising:

one or more computer-readable storage devices, and program instructions, stored on the one or more storage devices, the program instructions comprising;

program instructions to, at a source computer, read and identify characteristics of the organism'"'"'s medical history and background associated with a genetic sequence of an organism;

program instructions to receive an input of rank of at least two identified characteristics of the organism'"'"'s medical history and background associated with the genetic sequence of the organism;

program instructions to generate a hierarchy of ranked, identified characteristics based on the rank of the at least two identified characteristics of the genetic sequence of the organism;

program instructions to compare the hierarchy of ranked, identified characteristics to a repository of reference genomes; and

program instructions that if at least one reference genome from the repository matches the hierarchy of ranked, identified characteristics, program instructions to;

i) storing the at least one matched reference genome in a repository;

ii) breaking the at least one matched reference genome into pieces comprising nucleotides of the genetic sequence which comprises at least one gene, at least some of the pieces being associated with the identified characteristics;

iii) storing the pieces which are associated with the identified characteristics in the repository;

iv) combining the stored pieces of the at least one matched reference genome into a filter pattern;

v) comparing pieces of the nucleotides of the genetic sequence of the organism which comprises at least one gene which correspond to the stored pieces of the at least one matched reference genome to the nucleotides of the filter pattern of the pieces of the at least one matched reference genome, to find differences where nucleotides of the genetic sequence of the organism which are different from the nucleotides of the at least one matched reference genome;

vi) using the differences to create surprisal data representing an entire genome of the organism and storing the surprisal data in the repository, the surprisal data comprising a starting location of the differences within the reference genome, how the reference genomes were broken into pieces, a count of a number of differences at the location within the at least one matched reference genome and the nucleotides from the genetic sequence of the organism which are different from the nucleotides of the reference genome; and

vii) transmitting to a destination, a compressed, minimized genome representing an entire genome by sending the surprisal data, an indication of the at least one matched reference genome, and how the reference genome were broken into pieces and not sending sequences of nucleotides that are the same in the genetic sequence of the organism and the at least one matched reference genome.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer product and system of minimizing surprisal data comprising: at a source, reading and identifying characteristics of an organism'"'"'s background associated with a genetic sequence of the organism; receiving an input of rank of at least two identified characteristics of the genetic sequence; generating a hierarchy of ranked, identified characteristics based on the rank of the identified characteristics; comparing the hierarchy of ranked, identified characteristics to a repository of reference genomes; and if at least one reference genome from the repository matches the ranked characteristics, breaking the matched reference genomes into pieces, combining pieces associated with the identified characteristics from the matched reference genome to form a filter pattern to be compared to the nucleotides of the genetic sequence of the organism. The differences from the comparison are used to create surprisal data representing an entire genome of the organism.

Citations

13 Claims

1. A computer program product for minimizing surprisal data representing an entire genome of an organism for compression and transmission, comprising a source computer having one or more processors and one or more computer-readable memories coupled to the one or more processors, comprising:
- one or more computer-readable storage devices, and program instructions, stored on the one or more storage devices, the program instructions comprising;
  
  program instructions to, at a source computer, read and identify characteristics of the organism'"'"'s medical history and background associated with a genetic sequence of an organism;
  
  program instructions to receive an input of rank of at least two identified characteristics of the organism'"'"'s medical history and background associated with the genetic sequence of the organism;
  
  program instructions to generate a hierarchy of ranked, identified characteristics based on the rank of the at least two identified characteristics of the genetic sequence of the organism;
  
  program instructions to compare the hierarchy of ranked, identified characteristics to a repository of reference genomes; and
  
  program instructions that if at least one reference genome from the repository matches the hierarchy of ranked, identified characteristics, program instructions to;
  
  i) storing the at least one matched reference genome in a repository;
  
  ii) breaking the at least one matched reference genome into pieces comprising nucleotides of the genetic sequence which comprises at least one gene, at least some of the pieces being associated with the identified characteristics;
  
  iii) storing the pieces which are associated with the identified characteristics in the repository;
  
  iv) combining the stored pieces of the at least one matched reference genome into a filter pattern;
  
  v) comparing pieces of the nucleotides of the genetic sequence of the organism which comprises at least one gene which correspond to the stored pieces of the at least one matched reference genome to the nucleotides of the filter pattern of the pieces of the at least one matched reference genome, to find differences where nucleotides of the genetic sequence of the organism which are different from the nucleotides of the at least one matched reference genome;
  
  vi) using the differences to create surprisal data representing an entire genome of the organism and storing the surprisal data in the repository, the surprisal data comprising a starting location of the differences within the reference genome, how the reference genomes were broken into pieces, a count of a number of differences at the location within the at least one matched reference genome and the nucleotides from the genetic sequence of the organism which are different from the nucleotides of the reference genome; and
  
  vii) transmitting to a destination, a compressed, minimized genome representing an entire genome by sending the surprisal data, an indication of the at least one matched reference genome, and how the reference genome were broken into pieces and not sending sequences of nucleotides that are the same in the genetic sequence of the organism and the at least one matched reference genome.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The computer program product of claim 1, further comprising a destination computer having one or more processors and one or more computer-readable memories coupled to the one or more processors performing program instructions to comprising:
    - receive the compressed genome from the source computer, the compressed genome comprising surprisal data, the indication of the at least one matched reference genome used to compress the genome, a count of a number of differences at the location within the at least one matched reference genome and how the reference genomes were broken into pieces;
      
      retrieve the at least one indicated matched reference genome from a repository; and
      
      break the at least one indicated matched reference genome from the repository into pieces associated with the identified characteristics and storing the pieces associated with the identified characteristic in the repository;
      
      combine the pieces into the filter pattern;
      
      alter the filter pattern comprised of pieces of at least one matched reference genome based on the surprisal data by replacing nucleotides at each location in the at least one matched reference genome specified by the surprisal data with the nucleotides from the genetic sequence of the organism in the surprisal data associated with the location;
      
      resulting in an entire genome of the organism.
  - 3. The computer program product of claim 1, wherein the organism is an animal.
  - 4. The computer program product of claim 1, wherein the organism is a microorganism.
  - 5. The computer program product of claim 1, wherein the organism is a plant.
  - 6. The computer program product of claim 1, wherein the organism is a human.

7. A computer system for minimizing surprisal data representing an entire genome of an organism for compression and transmission comprising:
- a source computer having one or more processors, one or more computer-readable memories coupled to the one or more processors and one or more computer-readable, storage devices coupled to the one or more processors, and program instructions, stored on at least one of the one or more storage devices for execution by at least one of the one or more processors via at least one of the one or more memories, the program instructions comprising;
  
  program instructions to compare nucleotides of the genetic sequence of the organism to nucleotides from a reference genome, to, at a source, read and identify characteristics of the organism'"'"'s medical history and background associated with a genetic sequence of an organism;
  
  program instructions to receive an input of rank of at least two identified characteristics of the organism'"'"'s medical history and background associated with the genetic sequence of the organism;
  
  program instructions to generate a hierarchy of ranked, identified characteristics based on the rank of the at least two identified characteristics associated with the genetic sequence of the organism;
  
  program instructions to compare the hierarchy of ranked, identified characteristics to a repository of reference genomes; and
  
  program instructions that if at least one reference genome from the repository matches the hierarchy of ranked, identified characteristics, program instructions to;
  
  i) storing the at least one matched reference genome in a repository;
  
  ii) breaking the at least one matched reference genome into pieces comprising nucleotides of the genetic sequence which comprises at least one gene, at least some of the pieces being associated with the identified characteristics;
  
  iii) storing the pieces which are associated with the identified characteristics in the repository;
  
  iv) combining the stored pieces of the at least one matched reference genome into a filter pattern;
  
  v) comparing pieces of the nucleotides of the genetic sequence of the organism which comprises at least one gene which correspond to the stored pieces of the at least one matched reference genome to the nucleotides of the filter pattern of the pieces of the at least one matched reference genome, to find differences where nucleotides of the genetic sequence of the organism which are different from the nucleotides of the at least one matched reference genome;
  
  vi) using the differences to create surprisal data and store the surprisal data in the repository, the surprisal data comprising a starting location of the differences within the reference genome, a count of a number of differences at the location within the at least one matched reference genome, how the reference genomes were broken into pieces and the nucleotides from the genetic sequence of the organism which are different from the nucleotides of the reference genome; and
  
  vii) program instructions for transmitting to a destination, a compressed, minimized genome representing an entire genome by sending the surprisal data and the indication of the at least one matched reference genome, and not sending sequences of nucleotides that are the same in the genetic sequence of the organism and the at least one matched reference genome.
- View Dependent Claims (8, 9, 10, 11, 12, 13)
- - 8. The system of claim 7, further comprising a destination computer having one or more processors, one or more computer-readable memories coupled to the one or more processors, and one or more storage devices coupled to the one or more processors, and program instructions, stored on at least one of the one or more storage devices for execution by at least one of the one or more processors via at least one of the one or more memories, the program instructions comprising:
    - program instructions to receive the compressed genome from the source computer, the compressed genome comprising surprisal data, the indication of the at least one matched reference genome used to compress the genome, a count of a number of differences at the location within the at least one matched reference genome and how the reference genomes were broken into pieces;
      
      program instructions to retrieve the at least one indicated matched reference genome from a repository;
      
      program instructions to break the at least one indicated matched reference genome from the repository into pieces associated with the identified characteristics and storing the pieces associated with the identified characteristic in the repository;
      
      program instructions to combine the pieces into the filter pattern; and
      
      program instructions to alter the filter pattern comprised of pieces of at least one matched reference genome based on the surprisal data by replacing nucleotides at each location in the at least one matched reference genome specified by the surprisal data with the nucleotides from the genetic sequence of the organism in the surprisal data associated with the location;
      
      resulting in an entire genome of the organism.
  - 9. The system of claim 7, in which the surprisal data further comprises a count of a number of differences at the location within the reference genome.
  - 10. The system of claim 7, wherein the organism is an animal.
  - 11. The system of claim 7, wherein the organism is a microorganism.
  - 12. The system of claim 7, wherein the organism is a plant.
  - 13. The system of claim 7, wherein the organism is a human.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Friedlander, Robert R., Kraemer, James R.
Primary Examiner(s)
Zeman, Mary K

Application Number

US14/476,234
Publication Number

US 20150095293A1
Time in Patent Office

1,756 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/1744   using compression, e.g. spa...

G06F 16/24578   using ranking

G16B 30/00   ICT specially adapted for s...

G16B 50/00   ICT programming tools or da...

G16B 50/50   Compression of genetic data

Minimization of surprisal data through application of hierarchy filter pattern

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

Minimization of surprisal data through application of hierarchy filter pattern

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links