Method and apparatus for creating extractors, field information objects and inheritance hierarchies in a framework for retrieving semistructured information
First Claim
Patent Images
1. A system for extracting information from a semistructured information source comprising:
- a listing stack for holding extracted information;
a means for matching at least one extractor to said semistructured information to return a list of potential matches;
a means for iterating through said list of potential matches;
a means for retrieving information from a particular match in said list of potential matches;
a means for adding a particular match into said listing stack;
means for computing a cross product of fields in a first row from said listing stack;
means for adding said cross product of the fields in said first row to a list of accepted rows;
means for computing a selective cross product from a remaining row r and the list of accepted rows, for each remaining row r in a plurality of remaining rows in said listing stack; and
means for removing from the list of accepted rows at least one of a plurality of rows having identical fields.
4 Assignments
0 Petitions
Accused Products
Abstract
According to the invention, a system and method for extracting information from a semistructured information source. The system includes a listing stack for holding extracted information. A means for matching at least one extractor to the semistructured information to return a list of potential matches is also included. The system can also include a means for iterating through the list of potential matches and a means for retrieving information from a particular match in the list of potential matches. A means for adding a particular match into the listing stack can also be part of the system.
106 Citations
30 Claims
-
1. A system for extracting information from a semistructured information source comprising:
-
a listing stack for holding extracted information;
a means for matching at least one extractor to said semistructured information to return a list of potential matches;
a means for iterating through said list of potential matches;
a means for retrieving information from a particular match in said list of potential matches;
a means for adding a particular match into said listing stack;
means for computing a cross product of fields in a first row from said listing stack;
means for adding said cross product of the fields in said first row to a list of accepted rows;
means for computing a selective cross product from a remaining row r and the list of accepted rows, for each remaining row r in a plurality of remaining rows in said listing stack; and
means for removing from the list of accepted rows at least one of a plurality of rows having identical fields. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 28)
a means for indicating the start of a group of associated information in said listing stack.
-
-
3. The system of claim 2 further comprising:
- a means for indicating the end of a group of associated information in said listing stack.
-
4. The system of claim 1 further comprising:
- a table for holding information;
a means for converting information stored in said listing stack to a plurality of rows in said table.
- a table for holding information;
-
5. The system of claim 1 further comprising:
a means for returning a string for a URL.
-
6. The system of claim 1 wherein said means for matching further comprises a mask for controlling the matching process.
-
7. The system of claim 1 wherein said semistructured information comprises real estate listings.
-
8. The system of claim 1 wherein said semistructured information comprises job listings.
-
9. The system of claim 1 wherein said semistructured information comprises items for purchase or sale.
-
28. The system of claim 1 wherein the at least one extractor is a regular expression.
-
10. A method for extracting information from a semistructured information source into a listing stack comprising:
-
examining said semistructured information to identify patterns of interest;
examining the patterns of interest to identify attributes that correspond to fields of a relational database schema;
generating a wrapper based upon the attributes;
matching at least one extractor to said semistructured information to return a list of potential matches using the wrapper;
iterating through said list of potential matches;
retrieving information from a particular match in said list of potential matches; and
adding a particular match into said listing stack. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 29)
indicating the start of a group of associated information in said listing stack.
-
-
12. The method of claim 10 further comprising:
indicating the end of a group of associated information in said listing stack.
-
13. The method of claim 10 further comprising:
converting information stored in said listing stack to a plurality of rows in a table.
-
14. The method of claim 10 further comprising:
returning a string for a URL.
-
15. The method of claim 10 wherein said matching further comprises:
controlling the matching process using a mask.
-
16. The method of claim 10 wherein said semistructured information comprises:
real estate listings.
-
17. The method of claim 10 wherein said semistructured information comprises job listings.
-
18. The method of claim 10 wherein said semistructured information comprises items for purchase or sale.
-
29. The system of claim 10 wherein the at least one extractor is a regular expression.
-
19. A computer programming product for extracting information from a semistructured information source and storing said information so extracted into a listing stack comprising:
-
code for examining said semistructured information to identify patterns of interest;
code for examining the patterns of interest to identify attributes that correspond to fields of a relational database schema;
code for generating a wrapper based upon the attributes;
code for matching at least one extractor to said semistructured information using the wrapper to return a list of potential matches;
code for iterating through said list of potential matches;
code for retrieving information from a particular match in said list of potential matches;
code for adding a particular match into said listing stack; and
a computer readable storage medium for holding said codes. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 30)
code for indicating the start of a group of associated information in said listing stack.
-
-
21. The computer programming product of claim 20 further comprising:
code for indicating the end of a group of associated information in said listing stack.
-
22. The computer programming product of claim 19 further comprising:
code for converting information stored in said listing stack to a plurality of rows in a table.
-
23. The computer programming product of claim 19 further comprising:
code for returning a string for a URL.
-
24. The computer programming product of claim 19 wherein said code for matching further comprises code for controlling the matching process under a mask.
-
25. The computer programming product of claim 19 wherein said semistructured information comprises real estate listings.
-
26. The computer programming product of claim 19 wherein said semistructured information comprises job listings.
-
27. The computer programming product of claim 19 wherein said semistructured information comprises items for purchase or sale.
-
30. The system of claim 19 wherein the at least one extractor is a regular expression.
Specification