System for providing database functions for multiple internet sources
First Claim
1. A system for automatically extracting data from at least one electronic document in any of a plurality of formats, said at least one electronic document including a target page being accessible over a computer network, said target page comprising a plurality of elements each having a contents or structural definition, wherein said structural definition interrelates said plurality of elements, said system comprising:
- a navigation module to record a sequence of actions associated with an initial visit by a user to said target page operable to navigate to said target page of said electronic document;
an extraction recording module to receive user inputs from said user defining information of interest to said user to be extracted from said plurality of elements of said target page and generating a target pattern for automatically extracting said information of interest to said user from said target page;
a navigation playback module to automatically access said target page according to said recorded sequence for at least one subsequent visit to said target page; and
an extraction playback module to automatically identify and scrape select ones of said plurality of elements dependent upon said target pattern for each said at least one subsequent visit to said target page;
said extraction recording module remapping said target page by re-identifying any modified structural definitions of said plurality of elements thereby to enable access to an altered target page;
said extraction playback module identifies and scrapes said select ones of said plurality of elements dependent upon said target pattern and said re-identified structural definitions to thereby automatically identify and scrape said select ones of said plurality of elements from said altered target page dependent upon said target pattern;
wherein information of interest to said user is automatically extracted from said target page for each said at least one subsequent visit to said target page.
6 Assignments
0 Petitions
Accused Products
Abstract
A system for automatically extracting data from at least one electronic document accessible through the Internet or other computer network. The system records a sequence of actions operable to electronically navigate to a target page of the electronic document, the target page including a plurality of elements each having contents and a structural definition wherein the structural definitions interrelate the plurality of elements to specify a target pattern for a select subset of the plurality of elements. After recording the navigation path and the target pattern, the system automatically accesses the target page according to the recorded sequence. When the target page is accessed, the system automatically identifies, copies and processes selections from the plurality of elements dependent upon the target pattern.
293 Citations
43 Claims
-
1. A system for automatically extracting data from at least one electronic document in any of a plurality of formats, said at least one electronic document including a target page being accessible over a computer network, said target page comprising a plurality of elements each having a contents or structural definition, wherein said structural definition interrelates said plurality of elements, said system comprising:
-
a navigation module to record a sequence of actions associated with an initial visit by a user to said target page operable to navigate to said target page of said electronic document;
an extraction recording module to receive user inputs from said user defining information of interest to said user to be extracted from said plurality of elements of said target page and generating a target pattern for automatically extracting said information of interest to said user from said target page;
a navigation playback module to automatically access said target page according to said recorded sequence for at least one subsequent visit to said target page; and
an extraction playback module to automatically identify and scrape select ones of said plurality of elements dependent upon said target pattern for each said at least one subsequent visit to said target page;
said extraction recording module remapping said target page by re-identifying any modified structural definitions of said plurality of elements thereby to enable access to an altered target page;
said extraction playback module identifies and scrapes said select ones of said plurality of elements dependent upon said target pattern and said re-identified structural definitions to thereby automatically identify and scrape said select ones of said plurality of elements from said altered target page dependent upon said target pattern;
wherein information of interest to said user is automatically extracted from said target page for each said at least one subsequent visit to said target page. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
a script for altering said sequence of actions, thereby generating a second sequence of actions; and
whereinsaid navigation playback module automatically executes said second sequence of actions, thereby accessing a second target page and automatically identifies and scrapes select ones of said plurality of elements from said second target page dependent upon said target pattern.
-
-
6. The system of claim 1 wherein:
-
said extraction recording module automatically accesses said target page according to said recorded sequence and remaps said target page by re-identifying said structural definitions of said plurality of elements;
said extraction playback module automatically identifies and scrapes select ones of said plurality of elements dependent upon said target pattern and said re-identified structural definitions; and
said select ones of said plurality of elements are compared to predicted results to determine whether said mapping is functioning properly.
-
-
7. The system of claim 1, further comprising:
-
means for generating at least one file including data indicative of said actions and said target pattern; and
means for storing said at least one file so as to be accessible over said computer network.
-
-
8. The system of claim 7 wherein said file is in XML format.
-
9. The system of claim 1 wherein said computer network includes a global interconnection of computer networks.
-
10. The system of claim 1 wherein said extraction recording module ignores some of said structural definitions in identifying said target pattern.
-
11. The system of claim 1 wherein said modules are plug-ins in a browser.
-
12. A method for automatically extracting data from a target page of at least one electronic document being accessible over a computer network, said target page comprising a plurality of elements each having a contents or structural definition wherein said structural definitions interrelate said plurality of elements, said method comprising:
-
recording a sequence of actions associated with an initial visit to said target page by a user operable to electronically navigate to said target page of said electronic document;
receiving user inputs defining a user selected subset of said plurality of elements to be extracted based on at least one of said contents and structural definitions;
generating a target pattern to identify said user selected subset of said plurality of elements to be extracted in subsequent visits to said target page;
automatically accessing said target page according to said recorded sequence for at least one subsequent visit to said target page;
automatically identifying and scraping a subset of select ones of said plurality of elements dependent upon said target pattern for each said at least one subsequent visit;
remapping said target page by re-identifying any modified structural definitions of said plurality of elements thereby to enable access to an altered target page; and
identifying and scraping said select ones of said plurality of elements dependent upon said target pattern and said re-identified structural definitions to thereby automatically identify and scrape said select ones of said plurality of elements from said altered target page dependent upon said target pattern;
wherein information of interest to said user is automatically extracted from said target page for each said at least one subsequent visit to said target page. - View Dependent Claims (13, 14, 15, 16, 17, 18)
in a first mode, said target pattern is dependent upon said interrelation of said structural definitions for said select subset;
in a second mode, said target pattern is dependent upon contents of said select subset;
in a third mode, said target pattern is dependent upon said structural definitions for and contents of said select subset; and
in a fourth mode, said pattern is dependent upon formatting of said select subset; and
in a fifth mode, said pattern is dependent upon said structural definitions for, contents of and formatting of said select subset.
-
-
14. The method of claim 12 wherein said actions include user interaction with a plurality of electronic documents.
-
15. The method of claim 14 wherein each of said plurality of electronic documents comprises a web page or other web-accessible electronic document.
-
16. The method of claim 15 wherein said actions include activating HTTP links and electronically filling in and submitting forms.
-
17. The method of claim 12 wherein said format comprises a format consisting of one of the group of search engine results, web pages, other web-accessible documents, e-mail, text feeds in any format, HTML, .txt, pdf, Word, Excel, ppt, ftp text feeds, databases and XML.
-
18. The method of claim 12, further comprising the step of:
- applying XML tags to said scraped subset of select ones of said plurality of elements.
-
19. A computerized system for automatically scraping select data from a web site, data associated with said web site including a plurality of elements each having contents or structural data associated therewith, and being stored an a server being accessible through the Internet or other computer network, said contents and structural data and elements defining a select web page or other web-accessible document of said web site, said system comprising:
-
a navigation module being operable on a microprocessor-based device electronically coupled to the Internet or other computer network, said navigational module being operable to;
record a sequence of actions of a user operable to electronically navigate to said select web page or other web-accessible document of said web site using the Internet or other computer network; and
automatically access said select web page or other web-accessible document according to said recorded sequence for at least one subsequent visit to said select web page or other web-accessible document of said web site; and
an extraction module being operable on said microprocessor-based device, said extraction module being operable to;
receive user inputs identifying information of interest and generating a pattern for a select subset of said plurality of elements on said select web page or other web-accessible document for extracting said information of interest to said user;
automatically identify and scrape select ones of said plurality of elements of said select web page, or other web-accessible document dependent upon said pattern for each said at least one subsequent visit to said select web page or other web-accessible document;
remap said web page or other web-accessible document by re-identifying any modified structural definitions of said plurality of elements thereby to enable access to an altered web page or other web-accessible document; and
identify and scrape said select ones of said plurality of elements dependent upon said pattern and said re-identified structural definitions to thereby automatically identify and scrape said select ones of said plurality of elements from said altered web page or other web-accessible document dependent upon said pattern;
wherein information of interest to said user is automatically extracted in each said at least one subsequent visit to said target page. - View Dependent Claims (20, 21, 22, 23)
in a first mode, said pattern is dependent upon said interrelation of said structural definitions for said select subset;
in a second mode, said pattern is dependent upon contents of said select subset;
in a third mode, said pattern is dependent upon said structural definitions for and contents of said select subset;
in a fourth mode, said pattern is dependent upon formatting of said select subset; and
in a fifth mode, said pattern is dependent upon said structural definitions for, contents of and formatting of said select subset.
-
-
23. The system of claim 19 wherein:
-
said navigation module is adapted to automatically alter said sequence of actions according to predetermined criteria and automatically access other web pages or other web-accessible documents according to said altered sequence; and
said extraction module is adapted to automatically alter said pattern according to predetermined criteria and automatically identify and scrape other select ones of said plurality of elements of said other web pages or other web-accessible documents dependent upon said altered pattern.
-
-
24. A system for automatically extracting data from at least one electronic document accessible over a computer network, comprising:
-
a navigation module to record a navigation path to a target page selected by a user, said target page comprising a plurality of elements each having a contents or structural definition, wherein said structural definition interrelates said plurality of elements;
an extraction recording module receiving at least one user input for identifying information of interest on said target page to be extracted and generating a target pattern for extracting said information of interest;
a navigation playback module to automatically access said target page according to said navigation path for at least one subsequent visit to said target page;
an extraction playback module using said target pattern to extract said information of interest from said target page for each said at least one subsequent visit to said target page;
said extraction recording module remapping said target page by re-identifying any modified structural definitions of said plurality of elements thereby to enable access to an altered target page; and
said extraction playback module identifies and scrapes said select ones of said plurality of elements dependent upon said target pattern and said re-identified structural definitions to thereby automatically identify and scrape said select ones of said plurality of elements from said altered target page dependent upon said target pattern;
wherein information of interest to said user is automatically extracted from said target page in each said at least one subsequent visit to said target page. - View Dependent Claims (25, 26, 27)
said target page comprises a plurality of elements each having a contents or structural definition, wherein said structural definition interrelates said plurality of elements;
said extraction recording module remaps said target page by re-identifying any modified structural definitions of said plurality of elements thereby to enable access to an altered target page; and
said extraction playback module identifies and scrapes said select ones of said plurality of elements dependent upon said target pattern and said re-identified structural definitions to thereby automatically identify and scrape said select ones of said plurality of elements from said altered target page dependent upon said target pattern.
-
-
27. The system of claim 24 wherein:
-
said target page comprises a plurality of elements each having a contents or structural definition, wherein said structural definition interrelates said plurality of elements;
said extraction recording module automatically-accesses said target page according to said navigation path and remaps said target page by re-identifying said structural definitions of said plurality of elements;
said extraction playback module automatically identifies and scrapes select ones of said plurality of elements dependent upon said target pattern and said re-identified structural definitions; and
said select ones of said plurality of elements are compared to predicted results to determine whether said mapping is functioning properly.
-
-
28. A method for automatically extracting data from an electronic document being accessible over a computer network, comprising:
-
in an initial visit by a user to a target page of said electronic document, recording a navigation path to said target page and receiving at least one user input defining information of interest to said user in said target page to be extracted, said target page comprising a plurality of elements each having a contents or structural definition, wherein said structural definition interrelates said plurality of elements;
generating a target pattern for extracting said information of interest from said target page;
automatically accessing said target page according to said navigation path to return to said target page for at least one subsequent visit;
for each subsequent visit to said target page, extracting information from said target page based on said target pattern, said extracting information including remapping said target page by re-identifying any modified structural definitions of plurality of elements thereby to enable access to an altered target page and identifying and scraping said select ones of said plurality of elements dependent upon said target pattern and said re-identified structural definitions to thereby automatically identify and scrape said select ones of said plurality of elements from said altered target page dependent upon said target pattern;
wherein information of interest to said user is automatically extracted from said target page for each said at least one subsequent visit. - View Dependent Claims (29, 30, 31, 32, 33, 34, 35, 36, 37, 38)
-
-
39. A computer implemented method for automatically extracting data from an electronic document being accessible over a computer network, comprising:
-
providing a graphical user interface for a user to identify a web page as a target page, said target page comprising a plurality of elements each having a contents or structural definition, wherein said structural definition interrelates said plurality of elements;
recording a navigation path to said target page;
providing a user interface for said user to identify a subset of a plurality of elements of said target page as being information of interest to said user to be extracted;
generating a target pattern for extracting said information of interest from said target page;
automatically accessing said target page according to said navigation path for at least one return visit to said target page;
for each return visit to said target page, extracting information from said target page based on said target pattern, said extracting information including remapping said target page by re-identifying any modified structural definitions of said plurality of elements thereby to enable access to an altered target page and identifying and scraping said select ones of said plurality of elements dependent upon said target pattern and said re-identified structural definitions to thereby automatically identify and scrape said select ones of said plurality of elements from said altered target page dependent upon said target pattern; and
transforming said extracted information into a standard format for further processing;
wherein information of interest to said user is automatically extracted from said target page and transformed into a format suitable for use by at least one other software application for each said at least one return visit to said target page. - View Dependent Claims (40, 41, 42, 43)
-
Specification