Using lexical analysis and parsing in genome research
First Claim
Patent Images
1. A computer system for locating a genome pattern, comprising:
- a processor; and
a storage device connected to the processor, wherein the storage device has stored thereon a program, and wherein the processor is configured to execute instructions of the program to perform operations, wherein the operations comprise;
creating one or more lexical annotators that each identify a sequence of nucleotides of nucleotide bases selected from A, C, G, and T;
providing (1) the one or more lexical annotators, (2) one or more dictionary entries, (3) one or more previously-defined parsing rule annotators, and (4) one or more characters that each represent a nucleotide;
creating a parsing rule annotator that identifies an order of and a combination of at least two elements selected from (1) the one or more lexical annotators, (2) the one or more dictionary entries, (3) the one or more previously-defined parsing rule annotators, and (4) the one or more characters that each represent a nucleotide; and
creating an Unstructured Information Management Architecture (UIMA) pipeline to locate the genome pattern using the parsing rule annotator by;
in a first stage of the UIMA pipeline, parsing a genetic sequence that is found in a Common Analysis Structure (CAS) to determine a language used and to generate tokens that are added to the CAS with a start position and an end position for each of the tokens;
in a second stage of the UIMA pipeline, executing the one or more lexical annotators against the genetic sequence to identify one or more lexical annotations that are added to the CAS with a start position and an end position for each of the one or more lexical annotations; and
in a third stage of the UIMA pipeline, using the start position and the end position for each of the tokens and the start position and the end position for each of the one or more lexical annotations to identify a match to the parsing rule annotation and to form a new annotation that is added to the CAS.
1 Assignment
0 Petitions
Accused Products
Abstract
Provided are techniques for locating one or more genome patterns. One or more lexical annotators that each identifies a sequence of nucleotides are created. One or more parsing rule annotators are created using at least one of (1) one or more of the lexical annotators, (2) one or more dictionary entries, and (3) one or more previously-defined parsing rule annotators. The one or more parsing rule annotators are used to discover the one or more genome patterns comprising a combination of the lexical annotators and the parsing rule annotators.
18 Citations
14 Claims
-
1. A computer system for locating a genome pattern, comprising:
-
a processor; and a storage device connected to the processor, wherein the storage device has stored thereon a program, and wherein the processor is configured to execute instructions of the program to perform operations, wherein the operations comprise; creating one or more lexical annotators that each identify a sequence of nucleotides of nucleotide bases selected from A, C, G, and T; providing (1) the one or more lexical annotators, (2) one or more dictionary entries, (3) one or more previously-defined parsing rule annotators, and (4) one or more characters that each represent a nucleotide; creating a parsing rule annotator that identifies an order of and a combination of at least two elements selected from (1) the one or more lexical annotators, (2) the one or more dictionary entries, (3) the one or more previously-defined parsing rule annotators, and (4) the one or more characters that each represent a nucleotide; and creating an Unstructured Information Management Architecture (UIMA) pipeline to locate the genome pattern using the parsing rule annotator by; in a first stage of the UIMA pipeline, parsing a genetic sequence that is found in a Common Analysis Structure (CAS) to determine a language used and to generate tokens that are added to the CAS with a start position and an end position for each of the tokens; in a second stage of the UIMA pipeline, executing the one or more lexical annotators against the genetic sequence to identify one or more lexical annotations that are added to the CAS with a start position and an end position for each of the one or more lexical annotations; and in a third stage of the UIMA pipeline, using the start position and the end position for each of the tokens and the start position and the end position for each of the one or more lexical annotations to identify a match to the parsing rule annotation and to form a new annotation that is added to the CAS. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer program product for locating a genome pattern, the computer program product comprising:
-
a computer readable storage medium having computer readable program code embodied therewith, wherein the computer readable program code, when executed by a processor of a computer, is configured to perform; creating one or more lexical annotators that each identify a sequence of nucleotides of nucleotide bases selected from A, C, G, and T; providing (1) the one or more lexical annotators, (2) one or more dictionary entries, (3) one or more previously-defined parsing rule annotators, and (4) one or more characters that each represent a nucleotide; creating a parsing rule annotator that identifies an order of and a combination of at least two elements selected from (1) the one or more lexical annotators and (2) the one or more dictionary entries, (3) the one or more previously-defined parsing rule annotators, and (4) the one or more characters that each represent a nucleotide; and creating an Unstructured Information Management Architecture (UIMA) pipeline to locate the genome pattern using the parsing rule annotator by; in a first stage of the UIMA pipeline, parsing a genetic sequence that is found in a Common Analysis Structure (CAS) to determine a language used and to generate tokens that are added to the CAS with a start position and an end position for each of the tokens; in a second stage of the UIMA pipeline, executing the one or more lexical annotators against the genetic sequence to identify one or more lexical annotations that are added to the CAS with a start position and an end position for each of the one or more lexical annotations; and in a third stage of the UIMA pipeline, using the start position and the end position for each of the tokens and the start position and the end position for each of the one or more lexical annotations to identify a match to the parsing rule annotation and to form a new annotation that is added to the CAS. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
Specification