Systems and methods for facilitating data discovery
First Claim
1. A system for facilitating data discovery on a network, the network having one or more data storage devices, the system comprising:
- a crawler program configured to scan files stored on the one or more data storage devices, and identify a first set of files and a second set of files as being relevant, the scanning and the identifying being performed at a crawler operating speed, the crawler program being further configured to delay scanning based on at least one of the following conditions;
(1) a file path length associated with one or more files in the first set of files exceeds a file length threshold; and
2) one or more filer lengths associated with one or more files of the of the first set of files exceed a filer length threshold;
a data fetcher program configured to receive a location of the first set of files identified by the crawler program, the location being on the one or more data storage devices, and copy the first set of files from the received location at a data fetcher operating speed, the data fetcher program being further configured to delay copying the second set of files, thereby causing the crawler program to adjust the crawler operating speed of the scanning and the identifying according to the data fetcher operating speed based on at least one of the following conditions;
(1) a file size associated with a file in the second set of files is smaller than a file size threshold, (2) a quantity of files in one of the first set and the second set of files exceeds a file quantity threshold, (3) a file format associated with a file of one of the first set and the second set of files does not belong to a predetermined set of file formats, and (4) an amount of text to index in the first set of files exceeds a text amount threshold; and
circuit hardware implementing one or more functions of one or more of the crawler program and the data fetcher program.
9 Assignments
0 Petitions
Accused Products
Abstract
A system for facilitating data discovery on a network, wherein the network has one or more data storage devices. The system may include a crawler program configured to select at least a first set of files and a second set of files, each of the first set of files and the second set of files being stored in at least one of the one or more data storage devices. The system may also include a data fetcher program configured to obtain a copy of the first set of files, the data fetcher program being further configured to resist against obtaining a copy of the second set of files. The system may also include circuit hardware implementing one or more functions of one or more of the crawler program and the data fetcher program.
-
Citations
22 Claims
-
1. A system for facilitating data discovery on a network, the network having one or more data storage devices, the system comprising:
-
a crawler program configured to scan files stored on the one or more data storage devices, and identify a first set of files and a second set of files as being relevant, the scanning and the identifying being performed at a crawler operating speed, the crawler program being further configured to delay scanning based on at least one of the following conditions;
(1) a file path length associated with one or more files in the first set of files exceeds a file length threshold; and
2) one or more filer lengths associated with one or more files of the of the first set of files exceed a filer length threshold;a data fetcher program configured to receive a location of the first set of files identified by the crawler program, the location being on the one or more data storage devices, and copy the first set of files from the received location at a data fetcher operating speed, the data fetcher program being further configured to delay copying the second set of files, thereby causing the crawler program to adjust the crawler operating speed of the scanning and the identifying according to the data fetcher operating speed based on at least one of the following conditions;
(1) a file size associated with a file in the second set of files is smaller than a file size threshold, (2) a quantity of files in one of the first set and the second set of files exceeds a file quantity threshold, (3) a file format associated with a file of one of the first set and the second set of files does not belong to a predetermined set of file formats, and (4) an amount of text to index in the first set of files exceeds a text amount threshold; andcircuit hardware implementing one or more functions of one or more of the crawler program and the data fetcher program. - View Dependent Claims (2, 3, 4)
-
-
5. A system for facilitating data discovery on a network, the network having one or more data storage devices, the system comprising:
-
a crawler program configured to scan files stored on the one or more data storage devices, and identify a first set of files, a second set of files, a third set of files, and a fourth set of files as being relevant, the scanning and the identifying being performed at a crawler operating speed, the crawler program being further configured to delay scanning the fourth set of files based on at least one of the following conditions;
(1) a file path length associated with one or more files in the third set of files exceeds a file length threshold; and
2) one or more filer lengths associated with one or more files of the of the third set of files exceed a filer length threshold;a data fetcher program configured to receive a location of the first set of files identified by the crawler program, the location being on the one or more data storage devices, and copy the first set of files, a copy of the second set of files, and a copy of the third set of files from the received location at a data fetcher operating speed, the data fetcher program being further configured to delay copying the fourth set of files, thereby causing the crawler program to adjust crawler operating speed of the scanning and the identifying according to the data fetcher operating speed based on at least one of the following conditions;
(1) a file size associated with one or more files of the first set, the second set and the third set of files is smaller than a file size threshold, (2) one or more quantities of files in one of the first set, second set and third of files exceeds a file quantity threshold, (3) a file format associated with a file of one of the first set, second set and third set of files does not belong to a predetermined set of file formats, and (4) an amount of text to index in one of the first set, second set and third set of files exceeds a text amount threshold;a processing program configured to perform one or more services on the copy of the first set of files and the copy of the second set of files, the processing program being further configured to delay performing any services on the copy of the third set of files; a search indexing program configured to generate at least a search index using the copy of the first set of files, the search indexing program being further configured to delay generating any search index from the copy of the second set of files; and circuit hardware implementing one or more functions of one or more of the crawler program, the data fetcher program, the processing program, and the search indexing program. - View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A method for facilitating data discovery on a network, the network having one or more data storage devices, the method comprising:
-
scanning files stored on the one or more data storage devices, and identifying a first set of files and a second set of files as being relevant, the scanning and the identifying being performed by a crawler program at a crawler program operating speed, the crawler program being configured to delay scanning based on at least one of the following conditions;
(1) a file path length associated with one or more files in the first set of files exceeds a file length threshold; and
2) one or more filer lengths associated with one or more files of the of the first set of files exceed a filer length threshold;receiving, by a data fetcher program, a location of the first set of files identified by the crawler program, the location being on the one or more data storage devices; and copying the first set of files from the received location at a data fetcher operating speed, the data fetcher program being configured to delay copying the second set of files, thereby causing the crawler program to adjust the crawler operating speed of the scanning and the identifying according to the data fetcher operating speed based on at least one of the following conditions;
(1) a file size associated with a file in the second set of files is smaller than a file size threshold, (2) a quantity of files in one of the first set and the second set of files exceeds a file quantity threshold, (3) a file format associated with a file of one of the first set and the second set of files does not belong to a predetermined set of file formats, and (4) an amount of text to index in the first set of files exceeds a text amount threshold. - View Dependent Claims (16, 17, 18, 22)
-
-
19. A computer program product comprising computer-readable program code to be executed by one or more processors when retrieved from a non-transitory computer-readable medium, the program code including instructions to:
-
scan files stored on the one or more data storage devices, and identify, by the crawler program, a first set of files and a second set of files as being relevant, the scanning and the identifying being performed by a crawler program at a crawler program operating speed, the crawler program being configured to delay scanning based on at least one of the following conditions;
(1) a file path length associated with one or more files in the first set of files exceeds a file length threshold; and
2) one or more filer lengths associated with one or more files of the of the first set of files exceed a filer length threshold; andreceive, by a data fetcher program, a location of the first set of files identified by the crawler program, the location being on the one or more data storage devices; copying, by the data fetcher program, the first set of files from the received location at a data fetcher operating speed, the data fetcher program being configured to delay copying the second set of files, thereby causing the crawler program to adjust the crawler operating speed of the scanning and the identifying according to the data fetcher operating speed based on at least one of the following conditions;
(1) a file size associated with a file in the second set of files is smaller than a file size threshold, (2) a quantity of files in one of the first set and the second set of files exceeds a file quantity threshold, (3) a file format associated with a file of one of the first set and the second set of files does not belong to a predetermined set of file formats, and (4) an amount of text to index in the first set of files exceeds a text amount threshold. - View Dependent Claims (20, 21)
-
Specification