Automated data extraction and reformatting
First Claim
Patent Images
1. A computer-implemented method for automated data extraction from a Web site, comprising:
- (a) navigating to a Web site during a design phase;
(b) extracting data elements associated with said Web site and producing a visible display corresponding to said extracted data elements;
(c) selecting and storing at least one Page ID data element in said display from said data elements;
(d) selecting and storing one or more Extraction data elements in said display;
(e) selecting and storing at least one Base ID data element having an offset distance from said Extraction elements;
(f) setting a tolerance for possible deviation from said offset distance; and
(g) renavigating to said Web site during a playback phase and extracting data from said Extraction data elements if said Page ID data element is located in said Web site and if said offset distance of said Base ID data element has not changed by more than said adjustable tolerance.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method for automated browsing and data extraction from Internet Web sites. Our preferred method and system selects various data elements within the Web site during a design phase and extracts data from the Web site based on the matching of the selected data elements at the Web site during a playback phase. Another preferred method and system extracts XML data based on matching previously selected XML data elements during a design phase with XML data elements present during a playback phase, and reformats the extracted XML data into a relational format.
-
Citations
33 Claims
-
1. A computer-implemented method for automated data extraction from a Web site, comprising:
-
(a) navigating to a Web site during a design phase;
(b) extracting data elements associated with said Web site and producing a visible display corresponding to said extracted data elements;
(c) selecting and storing at least one Page ID data element in said display from said data elements;
(d) selecting and storing one or more Extraction data elements in said display;
(e) selecting and storing at least one Base ID data element having an offset distance from said Extraction elements;
(f) setting a tolerance for possible deviation from said offset distance; and
(g) renavigating to said Web site during a playback phase and extracting data from said Extraction data elements if said Page ID data element is located in said Web site and if said offset distance of said Base ID data element has not changed by more than said adjustable tolerance. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer-implemented method for automated data extraction from a Web site, comprising:
-
(a) navigating to a Web site during a design phase;
(b) extracting data elements associated with said Web site and producing a visible current display grid corresponding to said extracted data elements;
(c) selecting and storing at least one Page ID data element in said current display from said data elements;
(d) selecting and storing one or more Extraction data elements in said current display;
(e) selecting and storing at least one Base ID data element in said current display having an offset distance from said Extraction elements;
(f) entering a tolerance in said current display for possible deviation from said offset distance;
(g) displaying a playback display grid during a playback phase with said selected Page ID data element, said Extraction data elements, and said Base ID data element;
(h) renavigating to said Web site;
(i) extracting data elements associated with said Web site to said visible current display grid;
(j) comparing said extracted data elements in said current display grid with said playback display grid and extracting data from said Extraction data elements if said Page ID data element is found in said current display grid and if said offset distance of said Base ID data element has not changed by more than said tolerance; and
(k) adjusting said tolerance based on said offset distance of said Extraction elements found during renavigation. - View Dependent Claims (10)
-
-
11. A computer-implemented method for automated browsing of Web sites on a global communications network and for extracting usable data, comprising:
-
(a) accessing at least one Web site page containing data, wherein said data comprises a plurality of data formats;
(b) transforming said data in a plurality of formats into a computer-readable list;
(c) identifying a base data element from said list;
(d) identifying an offset from said base data element to the usable data; and
(e) extracting the usable data for use by a user regardless of changes to the Web site, provided that said offset between said base data element and the usable data does not change. - View Dependent Claims (12)
-
-
13. A computer-implemented method for automated browsing Web sites and for extracting usable data, comprising:
-
(a) filling a current display grid with rows of HTML data elements from at least one Web site page currently selected by a Web browser;
(b) displaying in a playback display grid previously-stored HTML data elements;
(c) examining said rows of said playback grid to locate an HTML data element previously selected as a Page ID data element;
(d) comparing said rows of said current grid to locate an HTML element that matches said Page ID data element;
(e) examining said rows of said playback grid to locate HTML data elements previously selected as Extraction data elements and a Base ID data element used as a reference for locating said Extraction data elements;
(f) comparing said rows of said current grid to locate HTML elements that match said Extraction data elements and match said Base ID data element;
(g) extracting data from said Extraction data elements regardless of changes to said Web site, provided that said Page ID elements match and any offset between said Base ID elements is within a predetermined tolerance; and
,(h) resetting said tolerance based on said offset of said Base ID elements.
-
-
14. A computer-based system for automatically browsing Web sites, comprising a client computer and a server computer for receiving requests from said client computers over a network connecting said client and server computers, said client computer running an application to:
-
(a) navigate to a Web site during a design phase;
(b) extract data elements associated with said Web site and produce a visible display corresponding to said extracted data elements;
(c) select and store at least one Page ID data element in said display from said data elements;
(d) select and store one or more Extraction data elements in said display;
(e) select and store at least one Base ID data element having an offset distance from said Extraction elements;
(f) set an adjustable tolerance for possible deviation from said offset distance;
(g) renavigate to said Web site during a playback phase and extract data from said Extraction data elements if said Page ID data element is located in said Web site and if said offset distance of said Base ID data element has not changed by more than said tolerance; and
(h) reset said tolerance based on changes to said Web site found during renavigation.
-
-
15. A computer-implemented method for automated data extraction, comprising:
-
(a) identifying selections of data elements in one of a plurality of data formats for extraction from a source of data comprising data stored in one of said plurality of formats;
(b) storing information related to said identified selections of data elements in XML format for subsequent use;
(c) acquiring said source of data and retrieving said data elements;
(d) comparing said retrieved XML data elements to said identified selections and extracting only the data from said data elements that correspond to said identified selections; and
(e) reformatting said extracted XML data into a relational format. - View Dependent Claims (16, 17, 18, 19)
-
-
20. A computer-implemented method for automated XML data extraction, comprising:
-
(a) navigating to a Web site including a plurality of web pages containing XML data;
(b) identifying selections of XML data elements for extraction from said Web site from said plurality of pages, said XML data comprising data elements containing said data stored in XML format;
(c) storing information related to said identified selections of XML data elements for subsequent use;
(d) re-navigating to said Web site and retrieving said XML data elements from said plurality of web pages;
(e) comparing said retrieved XML data elements to said identified selections and extracting only the data from said XML data elements that correspond to said identified selections; and
(f) reformatting said extracted XML data into a relational format. - View Dependent Claims (21)
-
-
22. A computer-implemented method for automated XML data extraction, comprising:
-
(a) navigating a client computer to a Web site including a plurality of web pages, said Web site containing XML data;
(b) generating a graphical tree structure on said client computer to display XML nodes and subnodes representing said XML data at said plurality of web pages on said Web site;
(c) selecting one or more of said nodes and/or subnodes from said tree structure associated with the data to be extracted;
(d) storing information related to said selected nodes and/or subnodes;
(e) renavigating said client computer to said Web site and retrieving said XML data using said information;
(f) comparing said retrieved XML data with said selected nodes and/or subnodes and extracting only the data corresponding to said selected nodes and/or subnodes; and
(g) reformatting said extracted XML data into a relational format. - View Dependent Claims (23)
-
-
24. A computer readable medium storing a set of instructions for controlling a computer to automatically extract desired XML data from a source of data in a plurality of formats, said medium comprising a set of instructions for causing said computer to:
-
(a) identify selections of data elements for extraction from a source of data comprising data stored in a plurality of formats;
(b) store information related to said identified selections of data elements for subsequent use;
(c) acquire said source of data and retrieve said data elements in XML format;
(d) compare said retrieved XML data elements to said identified selections and extract only the data from said data elements that correspond to said identified selections; and
(e) reformat said extracted XML data into a relational format.
-
-
25. A computer-based system for automated XML data extraction, comprising a client computer and server computer for receiving requests from said client computer over a network connecting said client and server computers, said client computer running an application to:
-
(a) identify selections of XML data elements for extraction from a plurality of sources of XML data contained at said server computer;
(b) store information related to said identified selections of XML data elements for subsequent use;
(c) acquire said plurality of sources of XML data and retrieve said XML data elements from said plurality of sources;
(d) compare said retrieved XML data elements to said identified selections and extract only the data from said XML data elements that correspond to said identified selections; and
(e) reformat said extracted XML data into a relational format.
-
-
26. A computer-implemented method for automated data extraction from a Web site, comprising:
-
(a) navigating to a Web site during a design phase;
(b) extracting data elements associated with said Web site and producing a visible display corresponding to said extracted data elements;
(c) selecting and storing at least one Page ID data element in said display from said data elements;
(d) selecting and storing one or more Extraction data elements in said display;
(e) selecting and storing at least one Base ID data element having an offset distance from said Extraction elements;
(f) setting an adjustable tolerance for possible deviation from said offset distance; and
,(g) renavigating to said Web site during a playback phase and extracting data from said Extraction data elements if said Page ID data element is located in said Web site and adjusting said tolerance based on said offset distance of said Base ID data element. - View Dependent Claims (27, 28, 29, 30, 31, 32, 33)
-
Specification