State Deduplication for Automated and Semi-Automated Crawling Architecture
First Claim
1. A system for automated acquisition of content from an application, the system comprising:
- a state storage module configured to store application state records, wherein each record of the application state records includes a representation of content of a corresponding application state of the application;
a link tracking module configured to control an executing instance of the application to navigate to a first application state of the application;
a duplicate content detector configured to (i) calculate a representation of content of the first application state, (ii) compare the calculated representation to the stored representations of content in the application state records, and (iii) generate a comparison signal based on the comparison, wherein;
the comparison signal indicates whether a match is found between the calculated representation and any of the stored representations of content in the application state records,the link tracking module is configured to create a new application state record in the state storage module only in response to the comparison signal indicating that no match is found, andthe calculated representation is stored in the new application state record; and
a scraper module configured to, for each of the application state records in the state storage module, extract text and metadata from the corresponding application state, wherein information based on the extracted text and metadata is stored in a data store.
4 Assignments
0 Petitions
Accused Products
Abstract
A system for automated acquisition of content from an application includes state storage for storing state records. Each record includes a representation of content of a corresponding state of the application. A link tracking module controls the application to navigate to a first state of the application. A duplicate content detector calculates a representation of content of the first state and generates a comparison signal indicating whether the calculated representation matches any of the stored representations of content in the state records. The link tracking module creates a new state record in the state storage only in response to the comparison signal indicating that no match is found. The calculated representation is stored in the new state record. A scraper module, for each of the state records in the state storage, extracts text and metadata. Information based on the extracted text and metadata is stored in a data store.
30 Citations
28 Claims
-
1. A system for automated acquisition of content from an application, the system comprising:
-
a state storage module configured to store application state records, wherein each record of the application state records includes a representation of content of a corresponding application state of the application; a link tracking module configured to control an executing instance of the application to navigate to a first application state of the application; a duplicate content detector configured to (i) calculate a representation of content of the first application state, (ii) compare the calculated representation to the stored representations of content in the application state records, and (iii) generate a comparison signal based on the comparison, wherein; the comparison signal indicates whether a match is found between the calculated representation and any of the stored representations of content in the application state records, the link tracking module is configured to create a new application state record in the state storage module only in response to the comparison signal indicating that no match is found, and the calculated representation is stored in the new application state record; and a scraper module configured to, for each of the application state records in the state storage module, extract text and metadata from the corresponding application state, wherein information based on the extracted text and metadata is stored in a data store. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A method for automated acquisition of content from an application, the method comprising:
-
storing application state records, wherein each record of the application state records includes a representation of content of a corresponding application state of the application; controlling an executing instance of the application to navigate to a first application state of the application; calculating a representation of content of the first application state; comparing the calculated representation to the stored representations of content in the application state records; generating a comparison signal based on the comparison, wherein the comparison signal indicates whether a match is found between the calculated representation and any of the stored representations of content in the application state records; only in response to the comparison signal indicating that no match is found, creating and storing a new application state record for the first application state, wherein the calculated representation is stored in the new application state record; and for each of the application state records, extracting text and metadata from the corresponding application state, wherein information based on the extracted text and metadata is stored in a data store. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
-
Specification