Method and system for extraction and organizing selected data from sources on a network
First Claim
1. A method for extracting data from a network by a server, comprising:
- a) creating a database-structured query with at least one fundamental clause including a web domain address used for locating the data, based, in part, on a user input, wherein the database-structured query includes a regular expression used to determine the data to extract, and a conditional expression describing a number where to start searching and another number where to stop searching the data at the web domain address, wherein the number where to start searching, the another number where to stop searching and an amount to increment are defined by a user;
b) creating a template of the regular expression used to extract the data;
c) providing authentication data to the web domain address;
d) determining the web domain address on the network from which to extract the data;
e) extracting the data from the web domain address directly by retrieving a non-database structured arrangement of data from the determined web domain address and performing the database-structured query upon the retrieved non-database structured arrangement of data, wherein the extracting data from the web domain further comprises matching a plurality of patterns contained within the regular expression to retrieved data to determine the data to extract,(f) repeating steps (d) and (e) in an iterative manner based on the at least one fundamental clause;
(g) reshaping the extracted data to a predetermined format; and
(h) providing the extracted data from the determined web domain address, wherein the extracted data is provided in a tab delimited data file, and wherein the tab delimited data file is provided directly to the user.
8 Assignments
0 Petitions
Accused Products
Abstract
Described is a system and method for employing user created database-structured queries and data extraction engines to crawl through Websites extracting and organizing data from selected sources on a network, such as the Internet. The structure of a query processed by a Data Extraction engine enables a user to treat the network as a searchable database. The database-structured queries provide a user with tools to match patterns on selected sites on the network. A user may automate database-structured queries to be executed on a regular frequency. Output of the database-structured queries may be placed into a data log, displayed on a user display screen, or optionally reshaped for use by a plurality of data analysis tools. Additionally, an optional graphical user interface is provided.
37 Citations
32 Claims
-
1. A method for extracting data from a network by a server, comprising:
-
a) creating a database-structured query with at least one fundamental clause including a web domain address used for locating the data, based, in part, on a user input, wherein the database-structured query includes a regular expression used to determine the data to extract, and a conditional expression describing a number where to start searching and another number where to stop searching the data at the web domain address, wherein the number where to start searching, the another number where to stop searching and an amount to increment are defined by a user; b) creating a template of the regular expression used to extract the data; c) providing authentication data to the web domain address; d) determining the web domain address on the network from which to extract the data; e) extracting the data from the web domain address directly by retrieving a non-database structured arrangement of data from the determined web domain address and performing the database-structured query upon the retrieved non-database structured arrangement of data, wherein the extracting data from the web domain further comprises matching a plurality of patterns contained within the regular expression to retrieved data to determine the data to extract, (f) repeating steps (d) and (e) in an iterative manner based on the at least one fundamental clause; (g) reshaping the extracted data to a predetermined format; and (h) providing the extracted data from the determined web domain address, wherein the extracted data is provided in a tab delimited data file, and wherein the tab delimited data file is provided directly to the user. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A computer-readable storage device having computer-executable instructions for extracting data from a network, the computer-executable instructions enabling actions comprising:
-
a) creating a database-structured query with at least one fundamental clause including a web domain address used for locating the data, based, in part, on a user input, wherein the database-structured query includes a regular expression used to determine the data to extract, and a conditional expression describing a number where to start searching and another number where to stop searching the data at the web domain address, wherein the number where to start searching, the another number where to stop searching and an amount to increment are defined by a user; b) creating a template of the regular expression used to extract the data; c) providing authentication data to the web domain address; d) determining the web domain address on the network from which to extract the data; e) extracting the data from the web domain address directly by retrieving a non-database structured arrangement of data from the determined web domain address and performing the database-structured query upon the retrieved non-database structured arrangement of data, wherein the extracting data from the web domain further comprises matching a plurality of patterns contained within the regular expression to retrieved data to determine the data to extract; (f) repeating steps (d) and (e) in an iterative manner based on the at least one fundamental clause; (g) reshaping the extracted data to a predetermined format; and (h) providing the extracted data from the determined web domain address, wherein the extracted data is provided in a tab delimited data file, and wherein the tab delimited data file is provided directly to the user. - View Dependent Claims (11, 12, 13, 14)
-
-
15. A system for extracting data from a network comprising:
-
a client computer system having a hardware computing device and a client network connection to the network and communicating with a server computer system, the client computer system creating a database-structured query with at least one fundamental clause, based, in part, on a user input, wherein the database-structured query includes a regular expression used to determine the data to extract and a conditional expression describing a number where to start searching and another number where to stop searching the data at a web domain address, wherein the number where to start searching, the another number where to stop searching and an amount to increment are defined by a user; an editor for creating a template of the regular expression used to extract the data; the server computer system having a server network connection to the network and communicating with the client computer system, the server computer system further performs actions, comprising; receiving the database-structured query from the client computer system; determining the web domain address on the network from which to extract at least a portion of the data relevant to the query, wherein the determined web domain address is provided by the database-structured query; providing authentication data to the web domain address; extracting directly at least the portion of the data from the web domain address by retrieving a non-database structured arrangement of data from the determined web domain address and performing the database-structured query upon the retrieved non-database structured arrangement of data, wherein the extracting the portion of data from the web domain further comprises matching a plurality of patterns contained within the regular expression to retrieved data to determine the data to extract; repeating, based on the at least one fundamental clause, actions of the determining a web domain address and the extracting the portion of data from the web domain address in an iterative manner; reshaping the extracted data to a predetermined format; and providing the extracted data from the web domain address, wherein the extracted data is provided in a tab delimited data file, and wherein the tab delimited data file is provided directly to the user. - View Dependent Claims (16, 17, 18, 19, 20)
-
-
21. A method of extracting data from a network by a server, comprising:
-
(a) creating a database-structured query with at least one fundamental clause including a web domain address at the server based, in part, on a user input, wherein the database-structured query includes a regular expression used to determine the data to extract, creating a template of the regular expression used to extract the data, providing authentication data to the web domain address; (b) determining a website to search based in part on the database-structured query and wherein the database-structured query further includes a conditional expression describing a number where to start searching and another number where to stop searching the data at the website, wherein the number where to start searching, the another number where to stop searching and an amount to increment are defined by a user; (c) extracting the data at the website directly by retrieving a non-database structured arrangement of data from the web domain address and performing the database-structured query upon the retrieved non-database structured arrangement of data, wherein the extracting data from the web domain address further comprises matching a plurality of patterns contained within the regular expression to retrieved data to determine the data to extract, wherein the website is processed as a searchable database; (d) repeating steps (b) and (c) in an iterative manner based on the at least one fundamental clause; and (e) reshaping the extracted data to a predetermined format; (f) providing the extracted data from the website, wherein the extracted data is provided in a data log, and wherein the data log is provided directly to the user. - View Dependent Claims (22, 23, 24, 25, 26, 27)
-
-
28. A method of extracting data within at least one webpage, comprising:
-
(a) generating a database-structured query with at least one fundamental clause including a web domain address based, in part, on a user'"'"'s input, wherein the database-structured query further includes a regular expression used to determine the data to extract and a conditional expression describing a number where to start searching and another number where to stop searching the data at the at least one webpage, wherein the number where to start searching, the another number where to stop searching and an amount to increment are defined by a user; (b) creating a template of the regular expression used to extract the data and providing authentication data to the web domain address; (c) determining the at least one webpage with the data, wherein the determination of the at least one webpage is provided by the database-structured query; d) extracting the data at the at least one webpage directly by retrieving a non-database structured arrangement of data from the web domain address and performing the database-structured query upon the retrieved non-database structured arrangement of data, wherein the extracting data from the web domain further comprises matching a plurality of patterns contained within the regular expression to retrieved data to determine the data to extract, wherein the extracted data that satisfies a query condition includes at least one binary file; (e) repeating steps (c) and (d) in an iterative manner based on the at least one fundamental clause; (f) reshaping the extracted data to a predetermined format; and (g) providing the extracted data from the at least one webpage, wherein the extracted data is provided in a data log, and wherein the data log is provided directly to the user. - View Dependent Claims (29, 30, 31, 32)
-
Specification