Computer implemented example-based concept-oriented data extraction method
First Claim
1. A computer implemented example-based concept-oriented data extraction method, comprising:
- a first procedure for labeling an exemplary data string, comprising the steps of;
capturing an exemplary data string;
tokenizing the exemplary data string into a plurality of tokens as an exemplary token sequence, each token having an index;
specifying the exemplary token sequence as a plurality of specific concepts, each being labeled to be a tuple and consisting of at least one token, the specific concept being selected from the group of a target concept and a filler concept, the target concept pointing to the targeted data of interest, the filler concept pointing to the contextual data of the targeted data, each tuple having a format including a concept type, a concept name, a beginning index of the first token in the specific concept, an ending index of the last token in the specific concept, and an associated concept recognizer of the specific concept, wherein the associated concept recognizer is provided to recognize the possible token sequence of the specific concept; and
constructing an exemplary concept graph of the exemplary data string according to the tuples; and
a second procedure for extracting targeted data from an untested data string, comprising the steps of;
capturing an untested data string;
tokenizing the untested data string into a plurality of tokens as an untested token sequence;
using the associated concept recognizers defined by the tuples for detecting a plurality of concept candidates, wherein each concept candidate has a format including the beginning index and the ending index of the corresponding token sequence, and the concept name of the concept candidate;
constructing a preliminary concept graph of the untested token sequence according to the concept candidates; and
determining an optimal hypothetical concept sequence by comparing the exemplary concept graph with the preliminary concept graph and capturing at least one matched target concept from the optimal hypothetical concept sequence for extracting the targeted data.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention relates to an example-based concept-orietned data extraction method. In an example labeling phase, the exemplary data string is converted into an exemplary token sequence, in which the target concepts and filler concepts are labeled to be tuples for use as an example, and thus an exemplary concept graph is constructed. In the data extraction phase, the untested data string is converted into an untested token sequence to be processed, and, based on the associated concept recognizers defined by the tuples in the example labeling phase, it is able to detect the concept candidates and establish the composite concepts and aggregate concepts, thereby constructing a hypothetical concept graph. After comparing the exemplary concept graph with the hypothetical concept graph, the optimal hypothetical concept sequence in the hypothetical graph is determined, so as to extract the targeted data from the matched target concepts.
-
Citations
18 Claims
-
1. A computer implemented example-based concept-oriented data extraction method, comprising:
-
a first procedure for labeling an exemplary data string, comprising the steps of; capturing an exemplary data string; tokenizing the exemplary data string into a plurality of tokens as an exemplary token sequence, each token having an index; specifying the exemplary token sequence as a plurality of specific concepts, each being labeled to be a tuple and consisting of at least one token, the specific concept being selected from the group of a target concept and a filler concept, the target concept pointing to the targeted data of interest, the filler concept pointing to the contextual data of the targeted data, each tuple having a format including a concept type, a concept name, a beginning index of the first token in the specific concept, an ending index of the last token in the specific concept, and an associated concept recognizer of the specific concept, wherein the associated concept recognizer is provided to recognize the possible token sequence of the specific concept; and constructing an exemplary concept graph of the exemplary data string according to the tuples; and a second procedure for extracting targeted data from an untested data string, comprising the steps of; capturing an untested data string; tokenizing the untested data string into a plurality of tokens as an untested token sequence; using the associated concept recognizers defined by the tuples for detecting a plurality of concept candidates, wherein each concept candidate has a format including the beginning index and the ending index of the corresponding token sequence, and the concept name of the concept candidate; constructing a preliminary concept graph of the untested token sequence according to the concept candidates; and determining an optimal hypothetical concept sequence by comparing the exemplary concept graph with the preliminary concept graph and capturing at least one matched target concept from the optimal hypothetical concept sequence for extracting the targeted data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
Specification