×

State Deduplication for Automated and Semi-Automated Crawling Architecture

  • US 20160335286A1
  • Filed: 09/29/2015
  • Published: 11/17/2016
  • Est. Priority Date: 05/13/2015
  • Status: Abandoned Application
First Claim
Patent Images

1. A system for automated acquisition of content from an application, the system comprising:

  • a state storage module configured to store application state records, wherein each record of the application state records includes a representation of content of a corresponding application state of the application;

    a link tracking module configured to control an executing instance of the application to navigate to a first application state of the application;

    a duplicate content detector configured to (i) calculate a representation of content of the first application state, (ii) compare the calculated representation to the stored representations of content in the application state records, and (iii) generate a comparison signal based on the comparison, wherein;

    the comparison signal indicates whether a match is found between the calculated representation and any of the stored representations of content in the application state records,the link tracking module is configured to create a new application state record in the state storage module only in response to the comparison signal indicating that no match is found, andthe calculated representation is stored in the new application state record; and

    a scraper module configured to, for each of the application state records in the state storage module, extract text and metadata from the corresponding application state, wherein information based on the extracted text and metadata is stored in a data store.

View all claims
  • 4 Assignments
Timeline View
Assignment View
    ×
    ×