Does HPSS have the ability to designate backup copies of data for storage in an external vault or off-site facility?
HPSS provides the capability to designate a set of tape volumes to be removed from the robotic tape library and stored in another facility.
Tape volumes containing backup copies of bitfiles can be selected for removal to a vault or off-site facility. The tape volumes must be in the End of Media (EOM) state in order to be a candidate for removal. HPSS provides a utility that allows the system administrator to specify the volumes to be removed. The utility then causes the tape library to eject the tapes.
Can HPSS support a mix of 3590 tapes (regular and extended length) and tape drives (3590B and 3590E)? If you do not currently support this mix of devices and media, what are HPSS future plans?
Yes, these media and drive types are supported in any combination. Tape volumes and drives can be defined as any of the four possible combinations of density (3590 & 3590E) and tape length (single & double). Which drives are enabled for which types of volumes depends on the variety of drives configured at the site. For example, if only 3590E/double length drives are available (our current understanding is that existing 3590 & 3590E drives will require an upgrade to support the longer tapes), then those drives will be used for all 3590 and 3590E volumes (both single and double length). If both double length enabled 3590 and double length enabled 3590E drives are available, the 3590 drives will handle all 3590 volumes and the 3590E drives will handle all 3590E volumes (i.e., HPSS will not attempt to mount 3590 volumes in 3590E drives - even though the 3590E drives can read the 3590 tapes - to avoid possible deadlock scenarios). Of course, other combinations are also supported.
How does HPSS allocate tape drives? Is it possible to share a robotic tape library?
The robotic library system can be shared with other management systems. Drives assigned to HPSS are assumed to be exclusively used by HPSS. However, by using the HPSS administrative interface, a drive may be locked and made available for use by other applications. This technique is used by HPSS to use the drives for tasks such as off-hours backups.
When allocating tape drives, the PVL will first attempt to acquire the cartridges required for the tape mounts. If a required cartridge is on the shelf, a check-in of the tape will be required. A check will be made to determine if the cartridges are associated with a deferred dismount job. If so, the cartridges will be transferred to this job. If it is not possible to reserve all the cartridges for the job, but a cartridge is in the deferred dismount state, then an unmount of the deferred activities will be performed. This is done so that drives will not be tied up. A wait will be performed until cartridges are freed. Whenever a cartridge is freed or a job cancel occurs, the wait thread will be signaled.
If drives from deferred dismounts satisfy the mount request, no further drive allocation / mount activity is required. Otherwise, further drive allocation logic is required. Drive allocation logic will be triggered when a job is committed or a drive is released by another job. If a required cartridge is not mounted, there are no available drives of the required type, and there are deferred dismount activities for this drive type, the oldest deferred dismount activity is dismounted. If there are still no drives available, we will have to wait for a drive to be released. If a cartridge is already mounted from a deferred dismount job, the job activity will be transferred to this job. Otherwise, if a drive is available, a thread will be spawned to mount the cartridge and verify the label. Once the cartridge is mounted and verified, the Storage Server (which initiated the mount) will be notified.
Once a PVL job gets any drive, it will get any additional required drives, which become available.
How does HPSS accommodate replacing an old generation of tape drives and media with new generation technology?
Tape cartridges in a storage class can be changed from one media to another. Volumes with the old media can be switched to End of Media (EOM). Cartridges with the new media can then be added, and new writes will go to the new media.
As more advanced storage technology becomes available or old storage technology becomes obsolete, there may be a need to replace the existing tape technology used by HPSS. HPSS provides a capability to replace the currently used tape technology with another technology. With this capability, the old technology volumes are marked "Retired" and new technology volumes are then created in the same storage class. Files written to this storage class will be written to the new technology volumes while the old technology volumes are treated by HPSS as read-only. Attrition and "repacking" will move all of the files from the old technology volumes to the new. When the old technology volumes are empty they can be removed from HPSS.
The repacking of old technology cartridges can be done in the background. The repack process can be run continuously, and will not impact day to day operations of HPSS.
Please describe how a client is made aware of error conditions taking place in the HPSS system (e.g. data to be retrieved is inaccessible). How does HPSS handles error situations, such as communication abort, or crash during the writing of a new bitfile.
Errors generated during reads or writes are returned to the user's client program. It is up to the client program to retry such operations. In the case of tape-only writes, the tape will eventually be marked End of Media (EOM) and a new tape will be used for writing data.
After a mount timeout, a tape will be marked End of Media and a drive error count is incremented. After a configurable number of drive errors, the drive will be disabled. This processing prevents continued failures associated with a bad tape volume or bad drives.
Errors generated on migration to tape are hidden from user and the operation is repeated until successful by the migration process. For read errors on staged files, the users will receive an error to indicate the problem after a number of retries.
For problems with a drive or volume failure, the HPSS system will alert the administrator with messages to the Alarm and Event window.
What are the built-in limits in HPSS?
The following Core Server limits are imposed in HPSS Release 6.2:
- Maximum number of HPSS subsystems: Unlimited
Storage Policy per
- Total Accounting Policies: 1
- Total Migration Policies: 64
- Total Purge Policies: 64
- Total Storage Classes: 192
- Total Storage Hierarchies: 64
- Total Classes Of Service: 64
- Maximum storage levels per hierarchy: 5
- Maximum storage classes per level: 2
- Maximum number of File Families: Unlimited
- Maximum number of copies of a bitfile: 4
- Maximum number of disk storage segments per bitfile: 10,000
- Minimum number of disk storage segments per bitfile: 1
- Maximum number of disk storage segments per Virtual Volume: 16,384
- Maximum number of bitfiles on one disk Virtual Volume: 16,384
Mover Device/PVL Drive
- Maximum Devices/Drives: Unlimited
- Total Devices per Mover: 64
(Note: It is possible to configure more than one software Mover per Mover node.)
- Import: 10,000 cartridges per SSM import request
- Export: 10,000 cartridges per SSM export request
- Create: 10,000 virtual volumes per SSM create request
- Delete: 10,000 physical volumes per SSM delete request
What happens when the Core Server maximum number of requests is exceeded? Do the requests fail or are they queued?
HPSS will queue requests if ALL Core Server request-handling threads are currently in use. If the maximum number of I/O threads configured become busy, the Core Server currently returns HPSS_EBUSY. The Client API will retry requests that are rejected because the maximum number of I/O threads are in use.
What are the theoretical and practical performance limitations with a single Core Server using DB2?
There is a theoretical limit of 128*10**10 files in the HPSS name space for a single HPSS Core Server (practical limit is on the order of 2 billion name space objects). In Release 4, multiple Core Servers are supported. The practical performance limitations are bound by the amount of attached disk space and the size of the DB2 memory cache. Data movement should not be adversely affected as the number of files increase.
In an environment where millions of purge records have been generated (for example because millions of files are kept on disk cache), is there a large overhead in tracking files last access in the purge records?
We do not expect a significant performance impact in managing large numbers of purge records.
Is there a way to restrict the number of concurrent writes into a a storage class. (different bitfiles, not striping?)
Currently, there does not exist a way to strictly limit the number of writes into a storage class in general. For tape storage classes, there is a configurable limit as to the number of active tape virtual volumes, which will provide an upper bound on the number of concurrent writes that can be active in that storage class (since each write will need to have exclusive access to a tape virtual volume during the actual write operation). However, the same capability is not supported for disk storage classes.
What are the limitations on the size of a bitfile? What are the restrictions on the number of tape volumes that a bitfile may span?
HPSS has no practical limitation on the size of a bitfile, the number of bitfiles, or the size of a data set since it uses unsigned 64 bit numbers to represent bitfile sizes and offsets, allowing for bitfiles up to one Exabyte in size.
HPSS limits the number of disk segments per bitfile to 10,000, although in a properly configured HPSS system, this limit should never be reached. There is no such hard limit on HPSS tape bitfiles, but from previous experience it has been determined that problems result in the underlying database when the maximum number of segments gets larger than 10,000 or 15,000. If you assume 20 GB tape cartridges and one tape segment per tape cartridge, then you have a limitation of 200 to 300 TB for the maximum size of an HPSS tape bitfile.
HPSS also allows up to one Exabyte of HPSS virtual volumes within a single Storage Class. The limit on the number of bitfiles that can be contained with a Storage Class (some number of disks) is 16,000 bitfiles per disk times the number of disks in the Storage Class. There is no such limitation on the number of bitfiles in a tape Storage Class.
Are there any mechanism that allows the number of requests performed by a given user or group of users to be monitored and limited, preferably dynamically?
HPSS provides facilities to monitor and limit the number of requests performed. HPSS supports a Gatekeeper Service, implemented by a gatekeeper server, which receives RPCs from the HPSS client library in response to client bitfile creates, opens, closes and explicit stages. These RPCs contain information on the identity of the end user which allow the administrator to provide policy modules that allow for monitoring and limiting resource usage. This requires the site to develop code to implement these polices.
The FTP access file also allows for limiting controls on the number of concurrent FTP sessions.
How many removable media families does HPSS support.
In HPSS, "File Family" is the term used for tape media family. HPSS supports an unlimited number of File Families.
Please explain the advantages of configuring two or more storage subsystems in a single HPSS instance versus multiple instances of HPSS in order to achieve scalability, separation, and growth.
There are advantages in implementing two or more subsystems within one HPSS instance:
- No performance degradation compared to two separate instances - There should be no performance impact of using subsystems versus two separate instances of HPSS. Subsystems allow the single instance to scale by adding processing resources as transaction performance requirements increase.
- Single management interface - The complete system can be managed by one set of GUI screens and command line utilities. This will simplify the on-going operation, maintenance and administration of the system.
- Use of a single core server machine - Both subsystem's core servers can be initially installed on the same processor machine, with provision to move to separate machines in the future as the system grows. We would not recommend that two separate production instances of HPSS be installed on the same core server machine.
- Sharing of tape drives between subsystems - Subsystems provide flexibility in sharing storage resources. Storage classes (disk and tape), tape drives, and tape libraries can be shared across subsystems or dedicated to a subsystem, as best fits the requirements.
How does HPSS select the optimum network path for data transfers to/from the clients?
HPSS always allows the system to send data transfers over the optimum delivery path. HPSS provides for the separation of the control path from the data path. In selecting network interfaces for data transfers, HPSS allows the client to provide a network address that is used to determine what interface will be used for the data transfer. The client's network address and Mover routing information are used to determine the path in which data transfers are directed. Additional network and routing information can be specified in the HPSS.conf file.
How is the structure and content of the metadata used by HPSS made visible from the client point of view? Describe the utilities or mechanisms used to query and modify the metadata.
HPSS provides a rich set of storage metadata and a robust set of APIs and administrative displays for viewing and modifying it. Attributes for the following metadata managed object types may be retrieved:
Bitfiles, Storage Segments, Storage Maps, Virtual Volumes, Physical Volumes, Mover Devices / Drives, Drives, Volumes, Cartridges, Filesets, Storage Classes, PVL Jobs / Queues, Log Files, and Server Configuration information.
In addition, APIs to set selected attributes of these managed objects are also supported.
Administrative displays for getting and setting attributes for the managed objects listed above are also provided.
In addition to the API and GUI interfaces described above, a set of utilities is provided for accessing metadata. Examples of these utilities are:
- lshpss for listing the configuration information for Classes of Service, Hierarchies, Storage Classes, Migration Policies, Purge Policies, Physical Volumes, Mover Devices, PVL Drives, HPSS Servers, Movers, Accounting Policies, and Log Policies.
- dump_sspvs to list the physical volumes known to an HPSS Storage Server.
- dumpbf to display the storage characteristics of a particular bitfile.
- lsvol to list the bitfiles that have storage segments on a particular volume.
- dumppv_pvl to list the physical volumes managed by the Physical Volume Library (PVL).
- dumppv_pvr to list the physical volumes managed by the Physical Volume Repository (PVR).
How does HPSS handle the failure of single drive, volume or other storage device? If there is a failure of one drive or volume, are other drives/volumes affected?
The failure of any single drive, volume or other storage device only affects data and bitfiles on that drive (etc.), and any requests directly related to it. An alarm signal is raised, but other operations continue as normal.
The administrator may also lock drives or volumes from being accessed. When HPSS detects specific errors, it will automatically set a tape to End of Media or, potentially, lock a drive. For example, after a mount timeout, a tape will be marked End of Media and the drive error count is incremented. After a configurable number of drive errors, the drive will be disabled. This process prevents continued failures associated with a bad tape volume or bad drives.
Alarms are sent to the Alarm and Event window, and are also written to the HPSS log.
Please describe how HPSS isolates corrupted portions of the system for repair work, while normal operations continue on the remainder.
The following are provisions within HPSS, which allow for the isolation of corrupted portions of the system to be repaired, while normal operation continues on the remainder.
An HPSS Mover can be taken offline when repairs are required, while normal operations continue on the remainder.
The Storage Map can be locked to prevent storage space from being allocated, thus making it "read-only." All subsequent attempts to allocate space on the Virtual Volume will fail (another VV in the same storage class will be selected, if available).
Locking a drive disallows its use by HPSS. Changing a drive's state to locked will ensure that the drive will not be used for new mounts, but it will not cause the dismount of any cartridges currently on the drive. The drive will be unloaded when the current client using the drive completes and dismounts.
A drive, which is currently mounted, can be locked. Locking a drive will not affect an active job which has a volume mounted. Once that job has completed, the drive will be dismounted and no further mounts will take place. This may be useful when preventative maintenance is required for an operating drive.
Please describe the HPSS features that provide for system availability in event of failures, for example, continuing to provide the full complement of its services, possibly at reduced performance, after hardware or software failure has been experienced.
A number of features are provided by HPSS for high availability.
Servers may be configured to automatically respawn upon failure. HPSS servers register their address information in the Startup Daemon and connection contexts are maintained between HPSS servers. Should a server fail, any other server with connections to that server will re-establish its connection when the failed server restarts.
For reliability of metadata, mirroring of metadata is performed. In addition, running backups of metadata is performed.
Storage Subsystems allow the system to be partitioned into sub-units according to name space. A separate set of HPSS core software processes are associated with each Storage Subsystem. As a result, a failure in one Storage Subsystem should not impact the other.
Administrative interfaces for locking devices is supported. This allows devices with hardware errors to be taken offline. In addition, in response to selected errors, the HPSS software will automatically lock drives.
HPSS also supports a High Availablity option to accommodate failures in processors or system disks on the core server.
How does HPSS notify the operations staff that manual intervention is required?
Although HPSS provides several windows that display the status of different components of the system, the Alarms and Events window displays the most information when manual intervention is required. The messages on this window are color-coded to indicate the severity level of the messages (red indicating the highest severity).
Events displayed on the window indicate that a significant event has just occurred which may be of interest to the operator while alarms are indications that a server has detected an abnormal condition. If the operator left clicks on the message, another window appears with complete information on the message. Using the message number associated with the message, the operator can locate additional information on each alarm in the HPSS Error Message Reference Manual.
|HPSS @ SC15 - SC15 is the 2015 international conference for high performance computing, networking, storage and analysis. SC15 will be in Austin, Texas, from November 16th through 19th - Learn More. Come visit the HPSS folks at the IBM booth and schedule an HPSS briefing at the IBM Executive Briefing Center|
|Swift On HPSS - OpenStack Swift Object Server implementation enables objects created using the Swift API to be accessed by name in HPSS - /account name/container name/object name. Legacy HPSS files can be accessed using the Swift API. Contact us for more information.|
|2015 HPSS Users Forum - The HPSS User Forum 2015 will be hosted by SciNet in Toronto, Canada from Monday, September 28 through Friday, October 2. For more information.|
|HPSS @ ISC15 - ISC15 is the 2015 International Supercomputing Conference for high performance computing, networking, storage and analysis. ISC15 will be in Frankfurt, Germany, from July 12th through 16th - Learn More. Come visit the HPSS folks at the IBM booth and schedule an HPSS briefing at the IBM Executive Briefing Center|
| 2015 HPSS Training - The next HPSS System Administration course from August 24th - 28th. For more information and registration.
| HPSS @ MSST 2015 - MSST 2015 is the 31st International Conference on Massive Storage Systems and Technology. This year's theme is Media Wars: Disk versus FLASH in the Struggle for Capacity and Performance. Learn More
| NCSA in production with RAIT - A massive 380 petabyte HPSS system was successfully deployed. -- the world’s largest automated near-line data repository for open science. Learn more from NCSA, and HPCwire. The new HPSS system went into production using HPSS Redundant Array of Independent Tapes (RAIT) tiers, which is similar to RAID, providing redundancy for a tape stripe. RAIT allows HPSS customers to meet their performance and redundancy requirements without doubling their tape cost. Learn more about RAIT.