Techniques for Extracting Unstructured Data
First Claim
Patent Images
1. A method comprising:
- receiving a plurality of extensible grammar expressions, wherein each extensible grammar expression includes a regular expression that searches for a set of information;
receiving a given document including unstructured data;
tokenizing the given document;
searching the tokenized given document using the regular expressions to determine if the unstructured data in the document matches one or more of the extensible grammar expressions;
extracting one or more sets of information from the unstructured data using one or more heuristics; and
outputting the one or more sets of extracted information.
1 Assignment
0 Petitions
Accused Products
Abstract
A technique for extracting unstructured data includes receiving a plurality of regular expressions and a given document. The regular expressions include a plurality of extensible grammar expressions for searching for a set of information. The regular expressions are then used to search the given document to determine if the unstructured data matches one or more of the extensible grammar expressions. If a match is determined, one or more set of information is extracted from the unstructured data using one or more heuristics.
17 Citations
20 Claims
-
1. A method comprising:
-
receiving a plurality of extensible grammar expressions, wherein each extensible grammar expression includes a regular expression that searches for a set of information; receiving a given document including unstructured data; tokenizing the given document; searching the tokenized given document using the regular expressions to determine if the unstructured data in the document matches one or more of the extensible grammar expressions; extracting one or more sets of information from the unstructured data using one or more heuristics; and outputting the one or more sets of extracted information. - View Dependent Claims (2, 3, 4, 5, 6, 7, 12)
-
- 8. One or more computing device readable media including a first plurality of computing device executable instructions that when executed by a processing unit implement a plurality of extensible grammar expressions, wherein each extensible grammar expression includes a regular expression to match corresponding unstructured data in a document.
-
13. One or more computing device readable media including a plurality of computing device executable instructions which when executed by a processing unit implement a method comprising:
-
receiving a plurality of extensible grammar expressions, wherein each extensible grammar expression includes a regular expression that searches for a set of information; receiving a given document including unstructured data; pre-processing the given document; tokenizing the given document;
searching the pre-processed and tokenized document using the regular expressions to determine if the unstructured data in the document matches one or more of the extensible grammar expressions;extracting one or more sets of information from the unstructured data using one or more heuristics; and outputting the one or more sets of extracted information. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
-
Specification