Document crawling systems and methods
First Claim
1. A non-transitory computer-readable medium encoded with a data management application comprising modules executable by a processor to crawl documents, the data management application comprising:
- a scheduling module to retrieve a plurality of job modules from a data store, the plurality of job modules each comprising corresponding crawling instructions and corresponding priority data for crawling documents in a data storage system;
a priority queue to receive the plurality of job modules from the scheduling module and to store each job module in a sequence according to the corresponding priority data;
an execution module to assign each job module to one of a plurality of processing modules according to the sequence for processing, wherein each assigned job module is configured to;
identify a step for processing based on the corresponding crawling instructions, the step comprising crawling a group of the documents;
process the step to crawl the group of the documents in the data storage system;
determine if at least one additional step for processing is required based on the corresponding crawling instructions, the at least one additional step comprising crawling another group of the documents; and
reschedule the job module to the scheduling module for insertion into the priority queue.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods are provided for crawling and indexing documents stored in a data storage system. A crawler system processes multiple jobs that each correspond to crawling documents in the data storage system. Each job includes priority data and crawling instructions. The crawler system stores each job in a priority queue in a sequence based on the priority data. The crawler system assigns each job in the priority queue to a next available processing module for processing based on the stored sequence. Before processing each job, the crawler system determines whether to segment the job into smaller steps based on the corresponding crawling instructions. If the job is segmented, one of smaller steps is processed to crawl a group of the documents in the data storage system. The remaining steps are stored in the priority queue to wait for processing.
27 Citations
31 Claims
-
1. A non-transitory computer-readable medium encoded with a data management application comprising modules executable by a processor to crawl documents, the data management application comprising:
-
a scheduling module to retrieve a plurality of job modules from a data store, the plurality of job modules each comprising corresponding crawling instructions and corresponding priority data for crawling documents in a data storage system; a priority queue to receive the plurality of job modules from the scheduling module and to store each job module in a sequence according to the corresponding priority data; an execution module to assign each job module to one of a plurality of processing modules according to the sequence for processing, wherein each assigned job module is configured to; identify a step for processing based on the corresponding crawling instructions, the step comprising crawling a group of the documents; process the step to crawl the group of the documents in the data storage system; determine if at least one additional step for processing is required based on the corresponding crawling instructions, the at least one additional step comprising crawling another group of the documents; and reschedule the job module to the scheduling module for insertion into the priority queue. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A system for crawling documents, the system comprising:
-
a data store to store a plurality of job modules, each of the plurality of jobs job modules comprising corresponding crawling instructions and corresponding priority data for crawling documents in a data storage system; a processing device comprising a data management application comprising modules executable by the processing device to crawl the documents, the data management application comprising; a scheduling module to retrieve the plurality of job modules from the data store; and a priority queue to receive the plurality of job modules from the scheduling module and to store each job module in a sequence according to the corresponding priority data; and an execution module to assign each job module to one of a plurality of processing modules according to the sequence for processing, wherein each assigned job module is configured to; identify a step for processing based on the corresponding crawling instructions, the step comprising crawling a group of the documents; enable the processing module to process the step to crawl the group of the documents in the data storage system; determine if at least one additional step for processing is required based on the corresponding crawling instructions, the at least one additional step comprising crawling another group of the documents; and reschedule the job module to the scheduling module for insertion into the priority queue. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. A method for crawling documents, the method comprising:
-
executing a data management application on a processing device to crawl documents in a data storage system, the data management application comprising a scheduling module, a priority queue, a plurality of processing modules, and an execution module; retrieving a plurality of job modules from a data store at the scheduling module, the plurality of job modules each comprising corresponding crawling instructions and corresponding priority data for crawling documents; transferring the plurality of job modules from the scheduling module and to the priority queue; storing each job module in a sequence in the priority queue according to the corresponding priority data; assigning, at the execution module, each job module to one of a plurality of processing modules according to the sequence for processing; identifying a step for processing at an assigned processing module based on the corresponding crawling instructions, the step comprising crawling a group of the documents; processing the step at the assigned processing module to crawl the group of the documents in the data storage system; determining if at least one additional step for processing is required based on the corresponding crawling instructions, the at least one additional step comprising crawling another group of the documents; and rescheduling the assigned job module to the scheduling module for insertion into the priority queue. - View Dependent Claims (23, 24, 25, 26, 27, 28, 29, 30)
-
-
31. A non-transitory computer-readable medium encoded with a data management application comprising modules executable by a processor to crawl documents, the data management application comprising:
-
a scheduling module to retrieve a plurality of job modules from a data store, the plurality of job modules each comprising corresponding crawling instructions, corresponding status data, and corresponding priority data for crawling documents in a data storage system; a priority queue to receive the plurality of job modules from the scheduling module and to store each job module in a sequence according to the corresponding priority data; and an execution module to assign each job module to one of a plurality of processing modules according to the sequence for processing, wherein each assigned job module is configured to; identify a step for processing based on the corresponding crawling instructions and the corresponding status data, the step comprising crawling a group of the documents, and the status data indicating whether the step has been processed; process the step to crawl the group of the documents in the data storage system; determine if the at least one additional step for processing is required based on the corresponding crawling instructions, the at least one additional step comprising crawling another group of the documents; and reschedule the assigned job module to the scheduling module for insertion into the priority queue.
-
Specification