Domain specific representation of document text for accelerated natural language processing
First Claim
1. A computer program product, the computer program product comprising a computer readable storage medium having program code embodied therewith, the program code executable by one of a Central Processing Unit (CPU) processor and at least one Graphics Processing Unit (GPU) processor to perform:
- selecting a document from a set of documents to be analyzed;
converting a character stream from the document into a token stream based on tokenization rules;
removing irrelevant tokens from the token stream;
converting tokens remaining in the token stream into an integer domain representation based on a domain specific ontology dictionary, wherein each of the tokens is mapped to an integer based on mappings in the domain specific ontology dictionary;
storing the integer domain representation to a Graphics Processing Unit (GPU) processing queue of each of one or more GPUs;
receiving a result set from the one or more GPUs, wherein the result set includes tuples of
1) a specific pattern and
2) an offset from a beginning of the integer domain representation, and wherein the result set is generated using a compiled super Regular Expression (REGEX) that is compiled using the domain specific ontology dictionary;
storing the result set into an index for use in processing a search query; and
persisting the integer domain representation once for the document for processing of the document with a different compiled super REGEX.
1 Assignment
0 Petitions
Accused Products
Abstract
Provided are techniques for a domain specific representation of document text for accelerated natural language processing. A document is selected from a set of documents to be analyzed. A character stream from the document is converted into a token stream based on tokenization rules. Irrelevant tokens are removed from the token stream. The tokens remaining in the token stream are converted into an integer domain representation based on a domain specific ontology dictionary. The integer domain representation are stored to a Graphics Processing Unit (GPU) processing queue of each of one or more GPUs. Then, a result set is received from the one or more GPUs.
14 Citations
16 Claims
-
1. A computer program product, the computer program product comprising a computer readable storage medium having program code embodied therewith, the program code executable by one of a Central Processing Unit (CPU) processor and at least one Graphics Processing Unit (GPU) processor to perform:
-
selecting a document from a set of documents to be analyzed; converting a character stream from the document into a token stream based on tokenization rules; removing irrelevant tokens from the token stream; converting tokens remaining in the token stream into an integer domain representation based on a domain specific ontology dictionary, wherein each of the tokens is mapped to an integer based on mappings in the domain specific ontology dictionary; storing the integer domain representation to a Graphics Processing Unit (GPU) processing queue of each of one or more GPUs; receiving a result set from the one or more GPUs, wherein the result set includes tuples of
1) a specific pattern and
2) an offset from a beginning of the integer domain representation, and wherein the result set is generated using a compiled super Regular Expression (REGEX) that is compiled using the domain specific ontology dictionary;storing the result set into an index for use in processing a search query; and persisting the integer domain representation once for the document for processing of the document with a different compiled super REGEX. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer system, comprising:
-
one or more Central Processing Unit (CPU) processors and Graphics Processing Unit (GPU) processors, one or more computer-readable memories and one or more computer-readable, tangible storage devices; and program instructions, stored on at least one of the one or more computer-readable, tangible storage devices for execution by at least one of the one or more CPU processors via at least one of the one or more memories, to perform operations comprising; selecting a document from a set of documents to be analyzed; converting a character stream from the document into a token stream based on tokenization rules; removing irrelevant tokens from the token stream; converting tokens remaining in the token stream into an integer domain representation based on a domain specific ontology dictionary, wherein each of the tokens is mapped to an integer based on mappings in the domain specific ontology dictionary; storing the integer domain representation to a Graphics Processing Unit (GPU) processing queue of each of one or more GPUs; receiving a result set from the one or more GPUs, wherein the result set includes tuples of
1) a specific pattern and
2) an offset from a beginning of the integer domain representation, and wherein the result set is generated using a compiled super Regular Expression (REGEX) that is compiled using the domain specific ontology dictionary;storing the result set into an index for use in processing a search query; and persisting the integer domain representation once for the document for processing of the document with a different compiled super REGEX. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
Specification