Automated data extraction and reformatting

US 6,732,102 B1
Filed: 11/16/2000
Issued: 05/04/2004
Est. Priority Date: 11/18/1999
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for automated data extraction from a Web site, comprising:

(a) navigating to a Web site during a design phase;

(b) extracting data elements associated with said Web site and producing a visible display corresponding to said extracted data elements;

(c) selecting and storing at least one Page ID data element in said display from said data elements;

(d) selecting and storing one or more Extraction data elements in said display;

(e) selecting and storing at least one Base ID data element having an offset distance from said Extraction elements;

(f) setting a tolerance for possible deviation from said offset distance; and

(g) renavigating to said Web site during a playback phase and extracting data from said Extraction data elements if said Page ID data element is located in said Web site and if said offset distance of said Base ID data element has not changed by more than said adjustable tolerance.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for automated browsing and data extraction from Internet Web sites. Our preferred method and system selects various data elements within the Web site during a design phase and extracts data from the Web site based on the matching of the selected data elements at the Web site during a playback phase. Another preferred method and system extracts XML data based on matching previously selected XML data elements during a design phase with XML data elements present during a playback phase, and reformats the extracted XML data into a relational format.

Citations

33 Claims

1. A computer-implemented method for automated data extraction from a Web site, comprising:
- (a) navigating to a Web site during a design phase;
  
  (b) extracting data elements associated with said Web site and producing a visible display corresponding to said extracted data elements;
  
  (c) selecting and storing at least one Page ID data element in said display from said data elements;
  
  (d) selecting and storing one or more Extraction data elements in said display;
  
  (e) selecting and storing at least one Base ID data element having an offset distance from said Extraction elements;
  
  (f) setting a tolerance for possible deviation from said offset distance; and
  
  (g) renavigating to said Web site during a playback phase and extracting data from said Extraction data elements if said Page ID data element is located in said Web site and if said offset distance of said Base ID data element has not changed by more than said adjustable tolerance.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. A method as claimed in claim 1, wherein user-specific information is entered into said Web site and used in connection with producing the data to be extracted from said Extraction data elements.
  - 3. A method as claimed in claim 1, wherein said data elements comprise HTML elements.
  - 4. A method as claimed in claim 1, wherein said visible display comprises a grid containing rows and columns including information about each said data elements extracted.
  - 5. A method as claimed in claim 4, wherein said information comprises, for each said data element, fixed information of grid row number, HTML tag number and visible text, and user-selected information of Page ID, Base ID, Extract and tolerance.
  - 6. A method as claimed in claim 1, wherein a position of said Page ID data element within said Web site is stored and said extracting occurs during said playback phase if said Page ID data element has not changed said position.
  - 7. A method as claimed in claim 1, wherein said Page ID data element is selected as a data element that is unlikely to change position upon reformatting of said Web site.
  - 8. A method as claimed in claim 1, wherein said display contains data desired to be extracted.

9. A computer-implemented method for automated data extraction from a Web site, comprising:
- (a) navigating to a Web site during a design phase;
  
  (b) extracting data elements associated with said Web site and producing a visible current display grid corresponding to said extracted data elements;
  
  (c) selecting and storing at least one Page ID data element in said current display from said data elements;
  
  (d) selecting and storing one or more Extraction data elements in said current display;
  
  (e) selecting and storing at least one Base ID data element in said current display having an offset distance from said Extraction elements;
  
  (f) entering a tolerance in said current display for possible deviation from said offset distance;
  
  (g) displaying a playback display grid during a playback phase with said selected Page ID data element, said Extraction data elements, and said Base ID data element;
  
  (h) renavigating to said Web site;
  
  (i) extracting data elements associated with said Web site to said visible current display grid;
  
  (j) comparing said extracted data elements in said current display grid with said playback display grid and extracting data from said Extraction data elements if said Page ID data element is found in said current display grid and if said offset distance of said Base ID data element has not changed by more than said tolerance; and
  
  (k) adjusting said tolerance based on said offset distance of said Extraction elements found during renavigation.
- View Dependent Claims (10)
- - 10. A method as claimed in claim 9, wherein said tolerance comprises a forward and backward tolerance.

11. A computer-implemented method for automated browsing of Web sites on a global communications network and for extracting usable data, comprising:
- (a) accessing at least one Web site page containing data, wherein said data comprises a plurality of data formats;
  
  (b) transforming said data in a plurality of formats into a computer-readable list;
  
  (c) identifying a base data element from said list;
  
  (d) identifying an offset from said base data element to the usable data; and
  
  (e) extracting the usable data for use by a user regardless of changes to the Web site, provided that said offset between said base data element and the usable data does not change.
- View Dependent Claims (12)
- - 12. The method of claim 11, wherein identifying said offset comprises identifying said offset during a design phase and saving said offset for use in a run time phase including said extracting of said usable data.

13. A computer-implemented method for automated browsing Web sites and for extracting usable data, comprising:
- (a) filling a current display grid with rows of HTML data elements from at least one Web site page currently selected by a Web browser;
  
  (b) displaying in a playback display grid previously-stored HTML data elements;
  
  (c) examining said rows of said playback grid to locate an HTML data element previously selected as a Page ID data element;
  
  (d) comparing said rows of said current grid to locate an HTML element that matches said Page ID data element;
  
  (e) examining said rows of said playback grid to locate HTML data elements previously selected as Extraction data elements and a Base ID data element used as a reference for locating said Extraction data elements;
  
  (f) comparing said rows of said current grid to locate HTML elements that match said Extraction data elements and match said Base ID data element;
  
  (g) extracting data from said Extraction data elements regardless of changes to said Web site, provided that said Page ID elements match and any offset between said Base ID elements is within a predetermined tolerance; and
  
  , (h) resetting said tolerance based on said offset of said Base ID elements.

14. A computer-based system for automatically browsing Web sites, comprising a client computer and a server computer for receiving requests from said client computers over a network connecting said client and server computers, said client computer running an application to:
- (a) navigate to a Web site during a design phase;
  
  (b) extract data elements associated with said Web site and produce a visible display corresponding to said extracted data elements;
  
  (c) select and store at least one Page ID data element in said display from said data elements;
  
  (d) select and store one or more Extraction data elements in said display;
  
  (e) select and store at least one Base ID data element having an offset distance from said Extraction elements;
  
  (f) set an adjustable tolerance for possible deviation from said offset distance;
  
  (g) renavigate to said Web site during a playback phase and extract data from said Extraction data elements if said Page ID data element is located in said Web site and if said offset distance of said Base ID data element has not changed by more than said tolerance; and
  
  (h) reset said tolerance based on changes to said Web site found during renavigation.

15. A computer-implemented method for automated data extraction, comprising:
- (a) identifying selections of data elements in one of a plurality of data formats for extraction from a source of data comprising data stored in one of said plurality of formats;
  
  (b) storing information related to said identified selections of data elements in XML format for subsequent use;
  
  (c) acquiring said source of data and retrieving said data elements;
  
  (d) comparing said retrieved XML data elements to said identified selections and extracting only the data from said data elements that correspond to said identified selections; and
  
  (e) reformatting said extracted XML data into a relational format.
- View Dependent Claims (16, 17, 18, 19)
- - 16. A method as claimed in claim 15, wherein said source of said data is a Web site.
  - 17. A method as claimed in claim 15, wherein said source of said data is a file.
  - 18. A method as claimed in claim 15, including saving said extracted data into a relational data table.
  - 19. A method as claimed in claim 15, wherein said reformatted extracted data is passed to a calling application.

20. A computer-implemented method for automated XML data extraction, comprising:
- (a) navigating to a Web site including a plurality of web pages containing XML data;
  
  (b) identifying selections of XML data elements for extraction from said Web site from said plurality of pages, said XML data comprising data elements containing said data stored in XML format;
  
  (c) storing information related to said identified selections of XML data elements for subsequent use;
  
  (d) re-navigating to said Web site and retrieving said XML data elements from said plurality of web pages;
  
  (e) comparing said retrieved XML data elements to said identified selections and extracting only the data from said XML data elements that correspond to said identified selections; and
  
  (f) reformatting said extracted XML data into a relational format.
- View Dependent Claims (21)
- - 21. A method as claimed in claim 20, including saving said extracted data into a relational data table.

22. A computer-implemented method for automated XML data extraction, comprising:
- (a) navigating a client computer to a Web site including a plurality of web pages, said Web site containing XML data;
  
  (b) generating a graphical tree structure on said client computer to display XML nodes and subnodes representing said XML data at said plurality of web pages on said Web site;
  
  (c) selecting one or more of said nodes and/or subnodes from said tree structure associated with the data to be extracted;
  
  (d) storing information related to said selected nodes and/or subnodes;
  
  (e) renavigating said client computer to said Web site and retrieving said XML data using said information;
  
  (f) comparing said retrieved XML data with said selected nodes and/or subnodes and extracting only the data corresponding to said selected nodes and/or subnodes; and
  
  (g) reformatting said extracted XML data into a relational format.
- View Dependent Claims (23)
- - 23. A method as claimed in claim 22, wherein selecting one subnode under a parent node automatically selects all subnodes under said parent node.

24. A computer readable medium storing a set of instructions for controlling a computer to automatically extract desired XML data from a source of data in a plurality of formats, said medium comprising a set of instructions for causing said computer to:
- (a) identify selections of data elements for extraction from a source of data comprising data stored in a plurality of formats;
  
  (b) store information related to said identified selections of data elements for subsequent use;
  
  (c) acquire said source of data and retrieve said data elements in XML format;
  
  (d) compare said retrieved XML data elements to said identified selections and extract only the data from said data elements that correspond to said identified selections; and
  
  (e) reformat said extracted XML data into a relational format.

25. A computer-based system for automated XML data extraction, comprising a client computer and server computer for receiving requests from said client computer over a network connecting said client and server computers, said client computer running an application to:
- (a) identify selections of XML data elements for extraction from a plurality of sources of XML data contained at said server computer;
  
  (b) store information related to said identified selections of XML data elements for subsequent use;
  
  (c) acquire said plurality of sources of XML data and retrieve said XML data elements from said plurality of sources;
  
  (d) compare said retrieved XML data elements to said identified selections and extract only the data from said XML data elements that correspond to said identified selections; and
  
  (e) reformat said extracted XML data into a relational format.

26. A computer-implemented method for automated data extraction from a Web site, comprising:
- (a) navigating to a Web site during a design phase;
  
  (b) extracting data elements associated with said Web site and producing a visible display corresponding to said extracted data elements;
  
  (c) selecting and storing at least one Page ID data element in said display from said data elements;
  
  (d) selecting and storing one or more Extraction data elements in said display;
  
  (e) selecting and storing at least one Base ID data element having an offset distance from said Extraction elements;
  
  (f) setting an adjustable tolerance for possible deviation from said offset distance; and
  
  , (g) renavigating to said Web site during a playback phase and extracting data from said Extraction data elements if said Page ID data element is located in said Web site and adjusting said tolerance based on said offset distance of said Base ID data element.
- View Dependent Claims (27, 28, 29, 30, 31, 32, 33)
- - 27. A method as claimed in claim 26, wherein user-specific information is entered into said Web site based on said adjustable tolerance and said offset.
  - 28. A method as claimed in claim 26, wherein said adjustable tolerance is reset based on renavigation of said Web site during said playback phase.
  - 29. A method as claimed in claim 26, wherein user-specific information is entered into said Web site and used in connection with producing the data to be extracted from said Extraction data elements.
  - 30. A method as claimed in claim 26, wherein said visible display comprises a grid containing rows and columns including information about each said data elements extracted.
  - 31. A method as claimed in claim 29, wherein said information comprises, for each said data element, fixed information of grid row number, HTML tag number and visible text, and user-selected information of Page ID, Base ID, Extract and tolerance.
  - 32. A method as claimed in claim 26, wherein a position of said Page ID data element within said Web site is stored and said extracting occurs during said playback phase if said Page ID data element has not changed said position.
  - 33. A method as claims in claim 26, wherein said data elements are extracted from a Web page embedding at least one of the following formats:
    - XML, PDF, Word, and Excel.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Instaknow.com Incorporated
Original Assignee
Instaknow.com Incorporated
Inventors
Khandekar, Pramod
Primary Examiner(s)
Breene, John
Assistant Examiner(s)
Ali, Mohammad

Application Number

US09/714,644
Time in Patent Office

1,265 Days
Field of Search

707/1, 707/3-6, 707/100-154, 707/9-10, 715/513, 715/501.1, 345/760, 345/764, 345/854
US Class Current

1/1
CPC Class Codes

G06F 16/30 of unstructured textual dat...

G06F 16/86 Mapping to a database

Automated data extraction and reformatting

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

33 Claims

Specification

Solutions

Use Cases

Quick Links

Automated data extraction and reformatting

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

33 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links