|
|
|
|
| HPSS for GPFS |
HPSS:
The High Performance Storage System (HPSS) is IBMs highly scalable
Hierarchical Storage Management (HSM) System. HPSS is intended to be used by
IBMs high-end HPC customers, with storage requirements in the tens of millions,
hundreds of millions, and even the billion file range. HPSS is capable of
concurrently accessing hundreds of tapes for extremely high aggregate data transfer
rates, and can easily meet otherwise unachievable total storage bandwidth and
capacity requirements. HPSS can stripe files across multiple tapes, which
ensures high bandwidth data transfers of huge files. HPSS provides for
stewardship and access of many petabytes of data stored on robotic tape libraries.
|
|
|
GPFS:
The IBM General Parallel File System (GPFS) is a true distributed,
clustered file system. Multiple servers are used to manage the data and
metadata of a single file system. Individual files are broken into
multiple blocks and striped across multiple disks, and multiple servers, which
eliminates bottlenecks. Information Lifecycle Management (ILM) policy
scans are also distributed across multiple servers, which allows GPFS to quickly
scan the entire file system, identifying files that match a specific criteria.
Shortly after we showed the Billion File Demo at the
international conference on high performance computing (SC07), the Almaden Research Center
showed that a pre-GA version of GPFS is capable of scanning a single GPFS file system,
containing a billion files, in less than 15 minutes!
|
GPFS/HPSS Interface (GHI):
HPSS can now be used to automatically HSM manage GPFS disk resources. GPFS
customers can now store petabytes of data on a file system with terabytes of high
performance disks. HPSS can also be used to backup your GPFS file system,
and in the event of a catastrophic failure, HPSS can be used to restore your
cluster and file systems. The GPFS high performance ILM policy scans are used to:
- Identify new files, or files that changed, so the data can be migrated to tape;
- Idnetify older, unused files that no longer need to remain on disk;
- Identify files that users need to bulk-stage back to GPFS for future processing;
- Capture GPFS cluster information; and
- Capture GPFS file system structure and file attributes.
|
|
GPFS + HPSS:
The ILM policy scan results are sent to the Scheduler. The Scheduler distributes
the work to the I/O Managers (IOM), and the GPFS data is moved to HPSS in
parallel. For those files that are no longer active, holes are then punched into
GPFS files to free up GPFS disk resources. The continuous movement of GPFS files
to HPSS tape, and the freeing of GPFS disk resouces is an automated process that is
transparent to the GPFS user. If the GPFS user should access a file that is only
on HPSS tape, the file will automatically stage back to GPFS, so the user can access the
file.
The GPFS/HPSS Interface continuously moves GPFS file data to HPSS. When the time
comes to perform a backup, only those files that have not yet been moved to HPSS tape,
are migrated. Therefore, it is NOT necessary to recaputre all of the file data
at each backup.
|
Small File Aggregation:
Most GPFS file systems are made up of small files about 90% of the files use 10%
of the disk resources. Traditionally, moving small files to tape diminishes
your tape drive performance. The GPFS/HPSS Interface moves small files from
GPFS to HPSS by grouping many small files into much larger aggregates. Small
file aggregation is completely configurable, but 10,000 GPFS files were placed into
each HPSS aggregate at the SC07 Billion File Demo. Large
aggregates allow data to stream to the tape drive, which yields higher tape transfer
rates.
|
|
That's why we say...
GPFS + HPSS = Extreme Storage Scalability!
|
For questions about HPSS for GPFS, contact
Jim Gerry
|
|
|