Method and apparatus for harvesting file system metadata
First Claim
1. A method for harvesting file system metadata, comprising:
- interacting with a file system abstraction layer/protocol adaptor to access managed files and directories across file systems that operate under various file system protocols at one or more physical locations;
collecting raw metadata of the managed files and directories;
filtering the raw metadata in real time;
placing the filtered raw metadata in one or more volume metadata caches;
synthesizing synthetic metadata from the filtered raw metadata;
generating content-based metadata, wherein generating content-based metadata for a managed file comprises processing content of the file according to a type of the file to determine one or more content-specific entities within the file;
transforming the filtered raw metadata, the synthetic metadata, and the content-based metadata into metadata records having a common representation, wherein each of the metadata record comprises a set of attributes associated with a file or directory residing on the file systems;
processing the metadata records; and
placing processed metadata records in volume clusters, wherein each of the volume clusters comprises one or more node data tables and one or more attribute tables, wherein each of the one or more node data tables represents denormalized dense attribute space common to the file systems and is timestamped by an epoch corresponding to a definition of freshness of data contained therein, and wherein each of the one or more attribute tables corresponds to a sparse attribute-volume-epoch combination.
2 Assignments
0 Petitions
Accused Products
Abstract
A harvester is disclosed for harvesting metadata of managed objects (files and directories) across file systems which are generally not interoperable in an enterprise environment. Harvested metadata may include 1) file system attributes such as size, owner, recency; 2) content-specific attributes such as the presence or absence of various keywords (or combinations of keywords) within documents as well as concepts comprised of natural language entities; 3) synthetic attributes such as mathematical checksums or hashes of file contents; and 4) high-level semantic attributes that serve to classify and categorize files and documents. The classification itself can trigger an action in compliance with a policy rule. Harvested metadata are stored in a metadata repository to facilitate the automated or semi-automated application of policies.
126 Citations
20 Claims
-
1. A method for harvesting file system metadata, comprising:
-
interacting with a file system abstraction layer/protocol adaptor to access managed files and directories across file systems that operate under various file system protocols at one or more physical locations; collecting raw metadata of the managed files and directories; filtering the raw metadata in real time; placing the filtered raw metadata in one or more volume metadata caches; synthesizing synthetic metadata from the filtered raw metadata; generating content-based metadata, wherein generating content-based metadata for a managed file comprises processing content of the file according to a type of the file to determine one or more content-specific entities within the file; transforming the filtered raw metadata, the synthetic metadata, and the content-based metadata into metadata records having a common representation, wherein each of the metadata record comprises a set of attributes associated with a file or directory residing on the file systems; processing the metadata records; and placing processed metadata records in volume clusters, wherein each of the volume clusters comprises one or more node data tables and one or more attribute tables, wherein each of the one or more node data tables represents denormalized dense attribute space common to the file systems and is timestamped by an epoch corresponding to a definition of freshness of data contained therein, and wherein each of the one or more attribute tables corresponds to a sparse attribute-volume-epoch combination. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer program product comprising one or more computer readable storage media storing instructions translatable by one or more processors to perform:
-
interacting with a file system abstraction layer/protocol adaptor to access managed files and directories across file systems that operate under various file system protocols at one or more physical locations; collecting raw metadata of the managed files and directories; filtering the raw metadata in real time; placing the filtered raw metadata in one or more volume metadata caches; synthesizing synthetic metadata from the filtered raw metadata; generating content-based metadata, wherein generating content-based metadata for a managed file comprises processing content of the file according to a type of the file to determine one or more content-specific entities within the file; transforming the filtered raw metadata, the synthetic metadata, and the content-based metadata into metadata records having a common representation, wherein each of the metadata record comprises a set of attributes associated with a file or directory residing on the file systems; processing the metadata records; and placing processed metadata records in volume clusters, wherein each of the volume clusters comprises one or more node data tables and one or more attribute tables, wherein each of the one or more node data tables represents denormalized dense attribute space common to the file systems and is timestamped by an epoch corresponding to a definition of freshness of data contained therein, and wherein each of the one or more attribute tables corresponds to a sparse attribute-volume-epoch combination. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A system for harvesting file system metadata, comprising:
-
an appliance coupled to file systems over a network, wherein the file systems operate under various file system protocols at one or more physical locations and wherein the appliance comprises volume metadata caches and volume clusters, wherein each of the volume clusters comprises one or more node data tables and one or more attribute tables, wherein each of the one or more node data tables represents denormalized dense attribute space common to the file systems and is timestamped by an epoch corresponding to a definition of freshness of data contained therein, and wherein each of the one or more attribute tables corresponds to a sparse attribute-volume-epoch combination; one or more processors; and one or more computer readable storage media storing instructions translatable by the one or more processors to perform; interacting with a file system abstraction layer/protocol adaptor to access managed files and directories across the file systems; collecting raw metadata of the managed files and directories; filtering the raw metadata in real time; placing the filtered raw metadata in one or more of the volume metadata caches; synthesizing synthetic metadata from the filtered raw metadata; generating content-based metadata, wherein generating content-based metadata for a managed file comprises processing content of the file according to a type of the file to determine one or more content-specific entities within the file; transforming the filtered raw metadata, the synthetic metadata, and the content-based metadata into metadata records having a common representation, wherein each of the metadata record comprises a set of attributes associated with a file or directory residing on the file systems; processing the metadata records; and placing processed metadata records in one or more of the volume clusters. - View Dependent Claims (18, 19, 20)
-
Specification