Systems and methods for probabilistic data classification
First Claim
1. A computer system comprising:
- a file system configured to store electronic files;
a plurality of file system scanning agents configured to access the electronic files, the file system scanning agents comprising computer hardware with one or more processors and configured to compile, based on the electronic files, index data usable for classifying the electronic files, and transmit the index data over a network to be stored in one or more indexes stored separately from the file system; and
a file classification server configured to, without directly accessing the content of the electronic files, access the index data previously compiled by the plurality of file system scanning agents from the one or more indexes stored separately from the file system, and classify the electronic files based on the index data, wherein classifying the electronic files comprises assigning one or more labels to the electronic files based at least in part on a set of user-defined rules and the index data previously compiled by the plurality of file system scanning agents, and wherein the file classification server is further configured to determine a probability that one or more of the electronic files should be classified as members of a category, determine that the probability is within a threshold amount from a probability threshold for classifying the one or more of the electronic files as the members of the category, and mark the one or more of the electronic files as being questionable members of the category.
2 Assignments
0 Petitions
Accused Products
Abstract
A system for performing data classification operations. In one embodiment, the system comprises a file system configured to store a plurality of computer files and a scanning agent configured to traverse the file system and compile data regarding the attributes and content of the plurality of computer files. The system also comprises an index configured to store the data regarding attributes and content of the plurality of computer files and a file classifier configured to analyze the data regarding the attributes and content of the plurality of computer files and to classify the plurality of computer files into one or more categories based on the data regarding the attributes and content of the plurality of computer files. Results of the file classification operations can be used to set appropriate security permissions on files which include sensitive information or to control the way that a file is backed up or the schedule according to which it is archived.
227 Citations
18 Claims
-
1. A computer system comprising:
-
a file system configured to store electronic files; a plurality of file system scanning agents configured to access the electronic files, the file system scanning agents comprising computer hardware with one or more processors and configured to compile, based on the electronic files, index data usable for classifying the electronic files, and transmit the index data over a network to be stored in one or more indexes stored separately from the file system; and a file classification server configured to, without directly accessing the content of the electronic files, access the index data previously compiled by the plurality of file system scanning agents from the one or more indexes stored separately from the file system, and classify the electronic files based on the index data, wherein classifying the electronic files comprises assigning one or more labels to the electronic files based at least in part on a set of user-defined rules and the index data previously compiled by the plurality of file system scanning agents, and wherein the file classification server is further configured to determine a probability that one or more of the electronic files should be classified as members of a category, determine that the probability is within a threshold amount from a probability threshold for classifying the one or more of the electronic files as the members of the category, and mark the one or more of the electronic files as being questionable members of the category. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method comprising:
-
with a plurality of file system scanning agents, accessing electronic files stored in a file system; compiling index data usable for classifying the electronic files; and transmitting the index data over a network to be stored in one or more indexes stored separately from the file system; and with a file classification server separate from the plurality of file system scanning agents, classifying the electronic files without directly accessing the content of the electronic files by assigning one or more labels to the electronic files based at least in part on a set of user-defined rules and the index data previously compiled by the plurality of file system scanning agents; determining a probability that one or more of the electronic files should be classified as members of a category; determining that the probability is within a threshold amount from a probability threshold for classifying the one or more of the electronic files as the members of the category; and marking the one or more of the electronic files as being questionable members of the category. - View Dependent Claims (11, 12, 13, 14, 15, 16)
-
-
17. A computer system comprising:
-
means for accessing electronic files stored in a file system; means for compiling index data usable for classifying the electronic files; means for transmitting the index data over a network to be stored in one or more indexes stored separately from the file system; means for classifying the electronic files without directly accessing the content of the electronic files by assigning one or more labels to the electronic files based at least in part on a set of user-defined rules and the index data previously compiled and transmitted to the one or more indexes; means for determining a probability that one or more of the electronic files should be classified as members of a category; means for determining that the probability is within a threshold amount from a probability threshold for classifying the one or more of the electronic files as the members of the category; and means for marking the one or more of the electronic files as being questionable members of the category. - View Dependent Claims (18)
-
Specification