System for searching, collecting and organizing data elements from electronic documents

US 20080027895A1
Filed: 07/28/2006
Published: 01/31/2008
Est. Priority Date: 07/28/2006
Status: Abandoned Application

First Claim

Patent Images

1. A data collection system requiring no preliminary set-up and scripting tasks, characterized by the combination of:

a one-click automation module, to browse through the sources,one-click filters to view directly the type of data they are looking for within the pages,an non-volatile, multi-purpose repository to collect and prioritize the data they find while surfing, whatever its structure is,an automatic system to check on the users own machine and amongst their peers if a similar query was not performed recently, in order to reuse successful extraction processes or results themselves, if they haven'"'"'t changed, andan easy way to structure and export their collections for other applications.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system for automatically or manually collecting data from electronic documents that comprises a combination of functionalities which include in particular a one-click automation system to navigate through the electronic documents, a query system to locate data through other systems on the network—if present—which may have already performed similar searches, filtered views of the electronic documents or pages, an automatic structure recognition system and a multi-purpose collection basket, which is a user database accepting polymorphic data. The collected data is stored into the user'"'"'s basket either by a manual drag and drop or automatically, as the user—or the program—navigates from document to document or page to page. If the collected data includes links to other documents, these associated documents can be automatically downloaded by the system and saved to storage devices.

Citations

4 Claims

1. A data collection system requiring no preliminary set-up and scripting tasks, characterized by the combination of:
- a one-click automation module, to browse through the sources,one-click filters to view directly the type of data they are looking for within the pages,an non-volatile, multi-purpose repository to collect and prioritize the data they find while surfing, whatever its structure is,an automatic system to check on the users own machine and amongst their peers if a similar query was not performed recently, in order to reuse successful extraction processes or results themselves, if they haven'"'"'t changed, andan easy way to structure and export their collections for other applications.
- View Dependent Claims (2, 3)
- - 2. A system as set forth in claim 1, for collecting data from electronic documents by recognizing the structure of data as well as a plurality of data element types characterized by a combination of functionalities including a one-click automation system to navigate through the electronic documents, a query system to locate data through other systems on the network which may have already performed similar searches, filtered views of the electronic documents or pages, an automatic structure recognition system and a multi-purpose collection basket, which is a user database accepting polymorphic data, the collected data being stored into an user'"'"'s basket, as the user or the program navigates from document to document or page to page, these associated documents being automatically downloaded by the system and saved to storage devices when the collected data includes links to other documents.
  - 3. A system as set forth in claim 1, comprising an object maker module which allows to create and edit information objects destined to be stored in the web memory of the system and possibly shared on the Web or on a peer-to-peer network, the system providing the user with a toolbox to create a new class (or subclass inheriting properties of a parent class) describing it and modifying it, the system excluding the possibility of creating duplicate classes within the accessible area (the local system, resources of a centralized server and/or the peers of the network, if the system is connected to one).

4. A structure recognition process characterized by 5 main steps:
- constitution of a work dictionary of marker candidates for different types of markers (label markers, labels, field delimiters, record delimiters, list markers, etc.) using all available tags (XML, HTML . . . ), punctuation or layout description strings, an original dictionary of pre-set marker candidates being augmented of strings recurring frequently in the document as well as of characters or strings consistently located, in the current document, between easily recognizable patterns like phone numbers or email addresses;
  
  combination of the markers of the dictionary in order to generate regular expression patterns and the number of occurrences of each pattern is added to arrays on which are then performed a series of statistical computations to extract possible numbers of records in the document and reliability marks are given to the different solutions;
  
  selecting of the result of this analysis is a series of regular expressions (or masks) as the best way to scrape the data in the document. This automatically generated set of scraping patterns is saved for future use (by the user or an other peer on the network, which could have the same need for scraping this source) and associated to the URL of the current HTML page or document;
  
  extraction of data from the current page by applying the generated scraper, and is presented in a table where the recognized records are displayed as rows, the fields as the columns and the labels if present are used as column headings. Applying the scraper consists of parsing the document record by record and field by field (or item by item, in the case of a single column list), using the delimiters and masks of the scraper. If several fields of the same record have the same label, they will be presented in two columns with the same heading (possibly suffixed with an incremented index); and
  
  post processing of the whole table once all the data is placed in rows and columns, cell by cell, to clean the text of possible noise, de-duplicate redundant data, arrange the layout, optimize column sizes, etc.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Jean-Christophe Combaz
Original Assignee
Jean-Christophe Combaz
Inventors
Combaz, Jean-Christophe

Application Number

US11/494,927
Publication Number

US 20080027895A1
Time in Patent Office

Days
Field of Search
US Class Current

707/1
CPC Class Codes

G06F 16/335 Filtering based on addition...

System for searching, collecting and organizing data elements from electronic documents

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

4 Claims

Specification

Solutions

Use Cases

Quick Links

System for searching, collecting and organizing data elements from electronic documents

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

4 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links