Object classification and indexing of very large name spaces using grid technology
First Claim
1. A method of storage or retrieval of computer data, the computer data being contained in files of a file system in electronic data storage, the file system further including a hierarchy of directories of the files, said method comprising:
- a) concurrently executing respective instances of a utility program on multiple host processors to traverse respective subdirectory trees of the hierarchy of directories of the files in order to collect, in at least one log, file-specific information of files in the file system; and
thenb) when storing or recalling computer data of a file in the file system, accessing the file-specific information of the file from said at least one log to facilitate the storing or recalling of the computer data of the file in the file system;
wherein the file system is subdivided into disjoint subdirectory trees, and each instance of the utility program is executed by a respective one of the host processors to record, in a respective log of the respective one of the host processors, file-specific information of files in a respective one of the disjoint subdirectory trees; and
wherein the inode numbers of the files in the file system are subdivided into disjoint inode number ranges, each disjoint inode number range is assigned to a respective one of the host processors, and the file-specific information for the files having inode numbers within said each disjoint inode number range is transferred to a respective database of said respective one of the host processors so that said respective one of the host processors accesses the file-specific information in its respective database to facilitate storing or recalling of computer data of the files having inode numbers within said each disjoint inode number range.
9 Assignments
0 Petitions
Accused Products
Abstract
For migration or de-duplication of a file system having a large number of files, a utility program traverses the file system to create a log of file-specific information about the file system. For identification of duplicates, the utility program produces a signature for each file. Respective instances of the utility program are started on multiple nodes upon which the file system is mounted. A fully qualified pathname is compiled during transfer of the log to a database. Multiple databases can be produced for the file system such that each database contains the file-specific information for a specified range of inode numbers. The database also maintains classification state for each file. For example, for a migration or replication process, the classification state identifies whether or not the file has been untouched, copied, linked, secondary-ized, source deleted, or modified.
103 Citations
15 Claims
-
1. A method of storage or retrieval of computer data, the computer data being contained in files of a file system in electronic data storage, the file system further including a hierarchy of directories of the files, said method comprising:
-
a) concurrently executing respective instances of a utility program on multiple host processors to traverse respective subdirectory trees of the hierarchy of directories of the files in order to collect, in at least one log, file-specific information of files in the file system; and
thenb) when storing or recalling computer data of a file in the file system, accessing the file-specific information of the file from said at least one log to facilitate the storing or recalling of the computer data of the file in the file system; wherein the file system is subdivided into disjoint subdirectory trees, and each instance of the utility program is executed by a respective one of the host processors to record, in a respective log of the respective one of the host processors, file-specific information of files in a respective one of the disjoint subdirectory trees; and wherein the inode numbers of the files in the file system are subdivided into disjoint inode number ranges, each disjoint inode number range is assigned to a respective one of the host processors, and the file-specific information for the files having inode numbers within said each disjoint inode number range is transferred to a respective database of said respective one of the host processors so that said respective one of the host processors accesses the file-specific information in its respective database to facilitate storing or recalling of computer data of the files having inode numbers within said each disjoint inode number range. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method of storage or retrieval of computer data, the computer data being contained in files of a file system in electronic data storage, the file system further including a hierarchy of directories of the files, said method comprising:
-
a) executing a utility program with a first data processor to traverse at least a subdirectory tree of the hierarchy of directories of the files in order to collect, in a log, file-specific information about files in the subdirectory tree; and
thenb) transferring the file-specific information for a specified range of inode numbers from the log to a database; and
thenc) accessing the database with a second data processor when storing or recalling computer data of a file having an inode number in the specified range of inode numbers; which includes storing, in the database, signatures of the files of the file system having inode numbers within the specified range of inode numbers, and wherein the inode numbers are primary keys to records of the file-specific information in the database, and the signatures are secondary keys to the records of the file-specific information in the database; and which includes the second data processor searching the database for a given signature in order to find at least two files having duplicate data content, in order to eliminate more than one copy of the duplicate data content from the electronic data storage. - View Dependent Claims (11)
-
-
12. A data processing system comprising:
-
electronic data storage containing a file system, the file system including files and a hierarchy of directories of the files; and multiple data processors coupled to the electronic data storage for access to the file system; wherein each of the data processors is programmed for executing a respective instance of a utility program to traverse an assigned subdirectory tree of the hierarchy of directories of the files in order to collect, in a respective log, file-specific information about files in the subdirectory tree; wherein each of the data processors is programmed for transferring, from the logs to a respective database, the file-specific information for a specified range of inode numbers assigned to said each of the data processors; and wherein said each of the data processors is programmed for accessing, from the respective database, the file-specific information for the specified range of inode numbers in order to facilitate storage or retrieval of computer data of files of the file system having inode numbers within the specified range of inode numbers assigned to said each of the data processors; wherein the multiple data processors are programmed for storing, in the respective databases, signatures of the files of the file system having inode numbers within the specified ranges of inode numbers, wherein the multiple processors are programmed for searching the databases for a given signature in order to find at least two files having duplicate data content, in order to eliminate more than one copy of the duplicate data content from the electronic data storage. - View Dependent Claims (13, 14, 15)
-
Specification