Domain specific representation of document text for accelerated natural language processing
First Claim
Patent Images
1. A method, comprising:
- selecting a document from a set of documents to be analyzed;
converting a character stream from the document into a token stream based on tokenization rules;
removing irrelevant tokens from the token stream;
converting tokens remaining in the token stream into an integer domain representation based on a domain specific ontology dictionary, wherein each of the tokens is mapped to an integer based on mappings in the domain specific ontology dictionary;
storing the integer domain representation to a Graphics Processing Unit (GPU) processing queue of each of one or more GPUs;
receiving a result set from the one or more GPUs, wherein the result set includes tuples of
1) a specific pattern and
2) an offset from a beginning of the integer domain representation, and wherein the result set is generated using a compiled super Regular Expression (REGEX) that is compiled using the domain specific ontology dictionary;
storing the result set into an index for use in processing a search query; and
persisting the integer domain representation once for the document for processing of the document with a different compiled super REGEX.
1 Assignment
0 Petitions
Accused Products
Abstract
Provided are techniques for a domain specific representation of document text for accelerated natural language processing. A document is selected from a set of documents to be analyzed. A character stream from the document is converted into a token stream based on tokenization rules. Irrelevant tokens are removed from the token stream. The tokens remaining in the token stream are converted into an integer domain representation based on a domain specific ontology dictionary. The integer domain representation are stored to a Graphics Processing Unit (GPU) processing queue of each of one or more GPUs. Then, a result set is received from the one or more GPUs.
17 Citations
8 Claims
-
1. A method, comprising:
-
selecting a document from a set of documents to be analyzed; converting a character stream from the document into a token stream based on tokenization rules; removing irrelevant tokens from the token stream; converting tokens remaining in the token stream into an integer domain representation based on a domain specific ontology dictionary, wherein each of the tokens is mapped to an integer based on mappings in the domain specific ontology dictionary; storing the integer domain representation to a Graphics Processing Unit (GPU) processing queue of each of one or more GPUs; receiving a result set from the one or more GPUs, wherein the result set includes tuples of
1) a specific pattern and
2) an offset from a beginning of the integer domain representation, and wherein the result set is generated using a compiled super Regular Expression (REGEX) that is compiled using the domain specific ontology dictionary;storing the result set into an index for use in processing a search query; and persisting the integer domain representation once for the document for processing of the document with a different compiled super REGEX. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
Specification