System for searching, collecting and organizing data elements from electronic documents
First Claim
1. A data collection system requiring no preliminary set-up and scripting tasks, characterized by the combination of:
- a one-click automation module, to browse through the sources,one-click filters to view directly the type of data they are looking for within the pages,an non-volatile, multi-purpose repository to collect and prioritize the data they find while surfing, whatever its structure is,an automatic system to check on the users own machine and amongst their peers if a similar query was not performed recently, in order to reuse successful extraction processes or results themselves, if they haven'"'"'t changed, andan easy way to structure and export their collections for other applications.
0 Assignments
0 Petitions
Accused Products
Abstract
A system for automatically or manually collecting data from electronic documents that comprises a combination of functionalities which include in particular a one-click automation system to navigate through the electronic documents, a query system to locate data through other systems on the network—if present—which may have already performed similar searches, filtered views of the electronic documents or pages, an automatic structure recognition system and a multi-purpose collection basket, which is a user database accepting polymorphic data. The collected data is stored into the user'"'"'s basket either by a manual drag and drop or automatically, as the user—or the program—navigates from document to document or page to page. If the collected data includes links to other documents, these associated documents can be automatically downloaded by the system and saved to storage devices.
-
Citations
4 Claims
-
1. A data collection system requiring no preliminary set-up and scripting tasks, characterized by the combination of:
-
a one-click automation module, to browse through the sources, one-click filters to view directly the type of data they are looking for within the pages, an non-volatile, multi-purpose repository to collect and prioritize the data they find while surfing, whatever its structure is, an automatic system to check on the users own machine and amongst their peers if a similar query was not performed recently, in order to reuse successful extraction processes or results themselves, if they haven'"'"'t changed, and an easy way to structure and export their collections for other applications. - View Dependent Claims (2, 3)
-
-
4. A structure recognition process characterized by 5 main steps:
-
constitution of a work dictionary of marker candidates for different types of markers (label markers, labels, field delimiters, record delimiters, list markers, etc.) using all available tags (XML, HTML . . . ), punctuation or layout description strings, an original dictionary of pre-set marker candidates being augmented of strings recurring frequently in the document as well as of characters or strings consistently located, in the current document, between easily recognizable patterns like phone numbers or email addresses; combination of the markers of the dictionary in order to generate regular expression patterns and the number of occurrences of each pattern is added to arrays on which are then performed a series of statistical computations to extract possible numbers of records in the document and reliability marks are given to the different solutions; selecting of the result of this analysis is a series of regular expressions (or masks) as the best way to scrape the data in the document. This automatically generated set of scraping patterns is saved for future use (by the user or an other peer on the network, which could have the same need for scraping this source) and associated to the URL of the current HTML page or document; extraction of data from the current page by applying the generated scraper, and is presented in a table where the recognized records are displayed as rows, the fields as the columns and the labels if present are used as column headings. Applying the scraper consists of parsing the document record by record and field by field (or item by item, in the case of a single column list), using the delimiters and masks of the scraper. If several fields of the same record have the same label, they will be presented in two columns with the same heading (possibly suffixed with an incremented index); and post processing of the whole table once all the data is placed in rows and columns, cell by cell, to clean the text of possible noise, de-duplicate redundant data, arrange the layout, optimize column sizes, etc.
-
Specification