Using lexical analysis and parsing in genome research

US 9,104,656 B2
Filed: 07/03/2012
Issued: 08/11/2015
Est. Priority Date: 07/03/2012
Status: Expired due to Fees

First Claim

Patent Images

1. A computer system for locating a genome pattern, comprising:

a processor; and

a storage device connected to the processor, wherein the storage device has stored thereon a program, and wherein the processor is configured to execute instructions of the program to perform operations, wherein the operations comprise;

creating one or more lexical annotators that each identify a sequence of nucleotides of nucleotide bases selected from A, C, G, and T;

providing (1) the one or more lexical annotators, (2) one or more dictionary entries, (3) one or more previously-defined parsing rule annotators, and (4) one or more characters that each represent a nucleotide;

creating a parsing rule annotator that identifies an order of and a combination of at least two elements selected from (1) the one or more lexical annotators, (2) the one or more dictionary entries, (3) the one or more previously-defined parsing rule annotators, and (4) the one or more characters that each represent a nucleotide; and

creating an Unstructured Information Management Architecture (UIMA) pipeline to locate the genome pattern using the parsing rule annotator by;

in a first stage of the UIMA pipeline, parsing a genetic sequence that is found in a Common Analysis Structure (CAS) to determine a language used and to generate tokens that are added to the CAS with a start position and an end position for each of the tokens;

in a second stage of the UIMA pipeline, executing the one or more lexical annotators against the genetic sequence to identify one or more lexical annotations that are added to the CAS with a start position and an end position for each of the one or more lexical annotations; and

in a third stage of the UIMA pipeline, using the start position and the end position for each of the tokens and the start position and the end position for each of the one or more lexical annotations to identify a match to the parsing rule annotation and to form a new annotation that is added to the CAS.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Provided are techniques for locating one or more genome patterns. One or more lexical annotators that each identifies a sequence of nucleotides are created. One or more parsing rule annotators are created using at least one of (1) one or more of the lexical annotators, (2) one or more dictionary entries, and (3) one or more previously-defined parsing rule annotators. The one or more parsing rule annotators are used to discover the one or more genome patterns comprising a combination of the lexical annotators and the parsing rule annotators.

18 Citations

View as Search Results

14 Claims

1. A computer system for locating a genome pattern, comprising:
- a processor; and
  
  a storage device connected to the processor, wherein the storage device has stored thereon a program, and wherein the processor is configured to execute instructions of the program to perform operations, wherein the operations comprise;
  
  creating one or more lexical annotators that each identify a sequence of nucleotides of nucleotide bases selected from A, C, G, and T;
  
  providing (1) the one or more lexical annotators, (2) one or more dictionary entries, (3) one or more previously-defined parsing rule annotators, and (4) one or more characters that each represent a nucleotide;
  
  creating a parsing rule annotator that identifies an order of and a combination of at least two elements selected from (1) the one or more lexical annotators, (2) the one or more dictionary entries, (3) the one or more previously-defined parsing rule annotators, and (4) the one or more characters that each represent a nucleotide; and
  
  creating an Unstructured Information Management Architecture (UIMA) pipeline to locate the genome pattern using the parsing rule annotator by;
  
  in a first stage of the UIMA pipeline, parsing a genetic sequence that is found in a Common Analysis Structure (CAS) to determine a language used and to generate tokens that are added to the CAS with a start position and an end position for each of the tokens;
  
  in a second stage of the UIMA pipeline, executing the one or more lexical annotators against the genetic sequence to identify one or more lexical annotations that are added to the CAS with a start position and an end position for each of the one or more lexical annotations; and
  
  in a third stage of the UIMA pipeline, using the start position and the end position for each of the tokens and the start position and the end position for each of the one or more lexical annotations to identify a match to the parsing rule annotation and to form a new annotation that is added to the CAS.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The computer system of claim 1, wherein the parsing rule annotator includes lexical annotators and no previously-defined parsing rule annotators.
  - 3. The computer system of claim 1, wherein the parsing rule annotator includes previously-defined parsing rule annotators and no lexical annotators.
  - 4. The computer system of claim 1, wherein a Software as a Service (SaaS) is provided to perform the system operations.
  - 5. The computer system of claim 1, wherein there are multiple occurrences of a same lexical annotation with different start positions and end positions.
  - 6. The computer system of claim 1, wherein the operations further comprise:
    - in response to determining that any token matches to a dictionary entry from the one or more dictionary entries, storing a new annotation in the CAS.
  - 7. The computer system of claim 1, wherein each of the one or more dictionary entries represents a feature of a human.

8. A computer program product for locating a genome pattern, the computer program product comprising:
- a computer readable storage medium having computer readable program code embodied therewith, wherein the computer readable program code, when executed by a processor of a computer, is configured to perform;
  
  creating one or more lexical annotators that each identify a sequence of nucleotides of nucleotide bases selected from A, C, G, and T;
  
  providing (1) the one or more lexical annotators, (2) one or more dictionary entries, (3) one or more previously-defined parsing rule annotators, and (4) one or more characters that each represent a nucleotide;
  
  creating a parsing rule annotator that identifies an order of and a combination of at least two elements selected from (1) the one or more lexical annotators and (2) the one or more dictionary entries, (3) the one or more previously-defined parsing rule annotators, and (4) the one or more characters that each represent a nucleotide; and
  
  creating an Unstructured Information Management Architecture (UIMA) pipeline to locate the genome pattern using the parsing rule annotator by;
  
  in a first stage of the UIMA pipeline, parsing a genetic sequence that is found in a Common Analysis Structure (CAS) to determine a language used and to generate tokens that are added to the CAS with a start position and an end position for each of the tokens;
  
  in a second stage of the UIMA pipeline, executing the one or more lexical annotators against the genetic sequence to identify one or more lexical annotations that are added to the CAS with a start position and an end position for each of the one or more lexical annotations; and
  
  in a third stage of the UIMA pipeline, using the start position and the end position for each of the tokens and the start position and the end position for each of the one or more lexical annotations to identify a match to the parsing rule annotation and to form a new annotation that is added to the CAS.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The computer program product of claim 8, wherein the parsing rule annotator includes lexical annotators and no previously-defined parsing rule annotators.
  - 10. The computer program product of claim 8, wherein the parsing rule annotator includes previously-defined parsing rule annotators and no lexical annotators.
  - 11. The computer program product of claim 8, wherein a Software as a Service (SaaS) is configured to perform the computer program product operations.
  - 12. The computer program product of claim 8, wherein there are multiple occurrences of a same lexical annotation with different start positions and end positions.
  - 13. The computer program product of claim 8, wherein the computer readable program code, when executed by the processor of the computer, is configured to perform:
    - in response to determining that any token matches to a dictionary entry from the one or more dictionary entries, storing a new annotation in the CAS.
  - 14. The computer program product of claim 8, wherein each of the one or more dictionary entries represents a feature of a human.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Bowman, Stephen D., Reddy, Dandala V., Werts, David B.
Primary Examiner(s)
Saeed, Usmaan
Assistant Examiner(s)
Weinrich, Brian E.

Application Number

US13/541,475
Publication Number

US 20140012865A1
Time in Patent Office

1,134 Days
Field of Search

707/755, 702/20, 702/19
US Class Current

1/1
CPC Class Codes

G06F 16/33   Querying

G06F 17/30634   Querying

G06F 19/18   for functional genomics or ...

G06F 19/22   for sequence comparison inv...

G06F 19/24   for machine learning, data ...

G06F 40/205   Parsing

G16B 20/00   ICT specially adapted for f...

G16B 20/20   Allele or variant detection...

G16B 30/00   ICT specially adapted for s...

G16B 40/00   ICT specially adapted for b...

Using lexical analysis and parsing in genome research

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

18 Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Using lexical analysis and parsing in genome research

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

18 Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links