Systems and methods for modular information extraction
First Claim
1. A computer-implemented method of extracting information comprising:
- defining a plurality of reusable operators, wherein each operator performs a predefined information extraction task different from the other operators;
specifying a composition of said reusable operators to form a composite annotator, wherein each operator receives a searchable item and generates one or more output annotations; and
storing the output annotations for use during a search,wherein the plurality of reusable operators include an extraction operator, wherein the extraction operator identifies features based on predefined criteria and generates one or more output annotations comprising the features extracted from one or more searchable items,the searchable items comprising text, wherein the extraction operator extracts specified text by matching text in each of said searchable items against a first rule and a second rule, wherein if text satisfies the first rule and the second rule, text between the text satisfying the first rule and text satisfying the second rule is stored in an output annotation, and wherein the extraction operator assigns a specified type to each of the one or more output annotations.
2 Assignments
0 Petitions
Accused Products
Abstract
Embodiments of the present invention include a computer-implemented method of extracting information. In one embodiment, the present invention comprises defining a plurality of reusable operators, wherein each operator performs a predefined information extraction task different from the other operators. Composite annotators may be created by specifying a composition of the reusable operators. Each operator may receive a searchable item, such as a web page or an annotation, and may generate one or more output annotations. The output annotations may be further processed by other reusable operators and the annotations may be stored in a repository for use during a search.
-
Citations
29 Claims
-
1. A computer-implemented method of extracting information comprising:
-
defining a plurality of reusable operators, wherein each operator performs a predefined information extraction task different from the other operators; specifying a composition of said reusable operators to form a composite annotator, wherein each operator receives a searchable item and generates one or more output annotations; and storing the output annotations for use during a search, wherein the plurality of reusable operators include an extraction operator, wherein the extraction operator identifies features based on predefined criteria and generates one or more output annotations comprising the features extracted from one or more searchable items, the searchable items comprising text, wherein the extraction operator extracts specified text by matching text in each of said searchable items against a first rule and a second rule, wherein if text satisfies the first rule and the second rule, text between the text satisfying the first rule and text satisfying the second rule is stored in an output annotation, and wherein the extraction operator assigns a specified type to each of the one or more output annotations. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A computer-implemented method of extracting information comprising:
-
defining a plurality of reusable operators, wherein each operator performs a predefined information extraction task different from the other operators; specifying a composition of said reusable operators to form a composite annotator, wherein each operator receives a searchable item and generates one or more output annotations; and storing the output annotations for use during a search, wherein the plurality of reusable operators include an composition operator, the searchable items comprising text, and wherein the composition operator receives an input annotation and two reference annotations and generates one or more output annotations comprising text between the two reference annotations. - View Dependent Claims (14, 15)
-
-
16. A computer-implemented method of extracting information comprising:
-
defining a plurality of reusable operators, wherein each operator performs a predefined information extraction task different from the other operators; specifying a composition of said reusable operators to form a composite annotator, wherein each operator receives a searchable item and generates one or more output annotations; and storing the output annotations for use during a search, wherein the plurality of reusable operators include an composition operator, the searchable items comprising text, and wherein the composition operator receives an input annotation and two reference annotation types and generates one or more output annotations comprising text between the two reference annotation types. - View Dependent Claims (17)
-
-
18. A computer-implemented method of extracting information comprising:
-
defining a plurality of reusable operators, wherein each operator performs a predefined information extraction task different from the other operators; specifying a composition of said reusable operators to form a composite annotator, wherein each operator receives a searchable item and generates one or more output annotations; and storing the output annotations for use during a search, wherein the plurality of reusable operators include an composition operator, the searchable items comprising text, and wherein the composition operator receives an input annotation, two reference annotations, and a rule, and generates one or more output annotations comprising text between the two reference annotations that satisfy the rule.
-
-
19. A computer-implemented method of extracting information comprising:
-
defining a plurality of reusable operators, wherein each operator performs a predefined information extraction task different from the other operators; specifying a composition of said reusable operators to form a composite annotator, wherein each operator receives a searchable item and generates one or more output annotations; and storing the output annotations for use during a search, wherein the plurality of reusable operators include an composition operator, the searchable items comprising text, and wherein the composition operator receives an input annotation, two reference annotation types, and a rule, and generates one or more output annotations comprising text between the two reference annotation types that satisfy the rule.
-
-
20. A computer-implemented method of extracting information comprising:
-
defining a plurality of reusable operators, wherein each operator performs a predefined information extraction task different from the other operators; specifying a composition of said reusable operators to form a composite annotator, wherein each operator receives a searchable item and generates one or more output annotations; and storing the output annotations for use during a search, wherein the plurality of reusable operators include an composition operator, and wherein the composition operator receives an input annotation, first and second reference annotations, and an expression specifying a semantic relationship between the first reference annotation and the second reference annotation, and wherein the composition operator checks if the first and second reference annotations are within the input annotation and selects annotations matching the expression. - View Dependent Claims (21, 22, 23)
-
-
24. A computer-implemented method of extracting information comprising:
-
defining a plurality of reusable operators, wherein each operator performs a predefined information extraction task different from the other operators; specifying a composition of said reusable operators to form a composite annotator, wherein each operator receives a searchable item and generates one or more output annotations; and storing the output annotations for use during a search, wherein the plurality of reusable operators include an composition operator, and wherein the composition operator receives a first input annotation, a second input annotation, a first parameter for specifying the output annotation type, and a second parameter for specifying a type of relationship that the output annotation will express, wherein the second parameter specifies the type of relationship as one of (i) a hierarchy, wherein the output annotation was created from the first input annotation and the second input annotation, or (ii) an attribute, wherein the output annotation is a composition of the first input annotation and the second input annotation. - View Dependent Claims (25)
-
-
26. A computer-readable medium containing instructions for controlling a computer system to perform a method of extracting information comprising:
-
receiving a plurality of first annotations, the first annotations comprising text; processing the plurality of first annotations using a reusable extraction operator, and in accordance therewith, generating a plurality of second annotations, wherein the extraction operator identifies text in the first annotations that satisfy one or more rules or match text in a predefined list of text, and generates one or more output annotations comprising the identified text; and processing the plurality of second annotations using a reusable context operator, and in accordance therewith, generating a plurality of third annotations, the second annotations comprising text, wherein the context operator receives each of the second annotations and a reference annotation and generates output annotations comprising a plurality of text adjacent to the reference annotation; processing the plurality of third annotations using a reusable composition operator, wherein the composition operator receives each of the third annotations, one or more first reference annotations, and one or more second reference annotations, and generates output annotations comprising text between the first and second reference annotations. - View Dependent Claims (27, 28)
-
-
29. A computer-readable medium containing instructions for controlling a computer system to perform a method of extracting information comprising:
-
receiving a plurality of first annotations; processing the plurality of first annotations using an reusable extraction operator, and in accordance therewith, generating a plurality of second annotations; and processing the plurality of second annotations using a reusable composition operator, and in accordance therewith, generating a plurality of third annotations, the second annotations comprising text, wherein the composition operator receives each of the second annotations, one or more first reference annotations, and one or more second reference annotations, and generates output annotations comprising text between the first and second reference annotations.
-
Specification