Next Previous

13.2 HPSS Server Problems

The paragraphs below discuss server problems common to all servers; problems with the Name Server, Bitfile Server, Storage Server, Migration/Purge Server, PVL, PVR, GK, LS, NDCG, and Mover; and problems with Logging Services, NFS, Startup Daemon, and SSM.

HPSS Servers are started in separate directories to prevent the overwriting of core files in the event of a server terminating abnormally. The parent of these directories is controlled by the HPSS_PATH_CORE environment variable, with the default being /var/hpss/adm/core . Within that directory, subdirectories are created based on the Server descriptive name (with spaces replaced by underscores and parentheses dropped - e.g., for the Mover with descriptive name "Mover (hpss)", the directory created will be "Mover_hpss"). Core files detected during server startup will be renamed based on the date and time that the core file was written (of the form core.YYYY_MMDD_hhmmss, where YYYY is the year, MMDD the month and day, and hhmmss is the hour, minute and second).

In addition, a log is maintained of servers that terminate with an abnormal termination code. The parent directory of the log file is controlled by the HPSS_PATH_ADM environment variable, with the default being /var/hpss/adm . The file name is hpssd.failed_server .

If an HPSS server terminated abnormally, the appropriate core file should be saved. In addition, an HPSS delog should be performed to capture the HPSS messages logged by the server and other servers interfacing with the server around the abnormal ending period.

13.2.1 General Problems

13.2.1.1 Servers cannot be started

Diagnosis 1 : The Executable flag for a particular server is not set in the server's configuration file.

Resolution: If the Executable flag for a server is not set, SSM will refuse to start that server. To fix the problem, set the flag.

Diagnosis 2 : The Startup Daemon on the server's host is not running. SSM cannot start the Startup Daemon.

Resolution : The Startup Daemon is normally started automatically at system boot time. If it is not running on the affected host, start it manually by executing the rc.hpss script there. Once the Startup Daemon is up and SSM can connect to it, try starting the target server again.

Diagnosis 3 : The Startup Daemon on the server's host is running, but SSM cannot connect to it.

Resolution : The only way SSM can start servers is by asking the Startup Daemon to do it, so it is essential that SSM is able to connect to the Startup Daemon. See whether SSM is connected to any other HPSS servers on that host, or whether any non-HPSS program (such as ping or telnet ) can communicate between the two hosts. If the network itself is all right, try the Force Connect button from the Server List window (Figure 1-1 HPSS Servers Window) to get SSM to connect to the Startup Daemon. If this does not work, the SSM System Manager and/or the Startup Daemon may have to be restarted.

Diagnosis 4 : The Startup Daemon on the server's host is running and reachable, but SSM refuses to try to connect to it because the Startup Daemon is marked non-executable.

Resolution : SSM spends a good deal of time pinging each server to make sure it is still connected to it (and trying to reconnect if it is not). In an attempt to avoid wasting resources on a server that is not going to be running anyway, SSM ignores servers whose Executable flag is not set. If the Startup Daemon itself does not have the Executable flag set, SSM will never try to connect to it. In that case, SSM will not be able to start any servers on that host because it was not connected to the Startup Daemon there. Therefore, even though SSM does not start the Startup Daemon itself, make sure the Startup Daemon's Executable flag is set.

13.2.1.2 Server performs sluggishly

Diagnosis 1 : The size of the thread pool or the maximum number of connections allowed for a server may be too large.

Resolution: Decreasing the number of concurrent requests that a server can process may improve performance. Consider reducing the size of the thread pool or the number of connections to investigate performance implications. To reduce the size of the thread pool or to decrease the number of connections, call up the Server Configuration window (Figure 6-6 Basic Server Configuration Window in the HPSS Installation Guide) and select the server in question. Modify the Thread Pool Size or Maximum Connections configuration variables as necessary. The server must be reinitialized or else stopped and restarted to pick up the new configuration.

Diagnosis 2 : The Domain Name Service (DNS) is not reachable.

Resolution : Add all necessary entries to the /etc/ host file. Terminate all HPSS servers, Encina SFS servers, and DCE. Restart the system without DNS support. Fix DNS.

Diagnosis 3 : Either one or more network components is unusable or the network interface is unavailable.

Resolution : Repair the network. Implement the RPC_UNSUPPORTED_NETIFS environment variable for the disrupted network interface and restart the system.

Diagnosis 4 : The RPC port mapper has incorrect, invalid, or extra mappings.

Resolution: Use the rpccp show mapping command to observe RPC mappings. Use the rpccp command to remove invalid mappings. Contact your site DCE administrator or reference the DCE V.1.3 for the AIX Administration Guide for syntax of commands.

13.2.2 Name Server (NS) Problems

13.2.2.1 NS is unable to create new directory entries

Diagnosis 1 : The NS may have exhausted its allocation of SFS records.

Resolution : If Diagnosis 1 is correct, the NS will have issued alarms indicating that it is low on space. If SFS has sufficient disk space available, increasing the number of SFS records allocated to the Name Server may solve the problem. 10.7.3 Name Server Space Shortage describes how the Maximum Records parameter can be increased. If SFS does not have sufficient disk space, it may be possible to increase the size of the SFS volume. 10.7.2 SFS Space Shortage describes how the SFS volume size can be increased.

Diagnosis 2 : The CDS ACL for the NS may not be set correctly.

Resolution : In this scenario, bitfiles cannot be added to directories. The creation of new directories, hard links, or symbolic links will not be effected. The CDS ACL for the Name Server might not have "control" permission on for the BFS. Table 6-3 Basic Server Configuration Variables in the HPSS Installation Guide under Advice for Server CDS Name describes how to properly set the CDS ACL for the Name Server.

13.2.3 Bitfile Server (BFS) Problems

13.2.3.1 BFS cannot connect to Storage Servers

Diagnosis 1 : One or more of the Storage Servers is not up.

Resolution : Restart the Storage Server in question. The BFS will attempt to connect to this Storage Server the next time it is needed. This should happen fairly quickly because the background thread that monitors storage space usage statistics for Storage Servers will attempt to contact the Storage Servers within a few seconds.

13.2.3.2 BFS cannot connect to SSM

Diagnosis : The SSM System Manager is not running or not responding.

Resolution : If the SSM window is responding, check the status of SSM's connection to the BFS at the Server Status window and reconnect if necessary. For additional information that may help with this problem, see Sections 13.1.1.1 and 13.1.1.4.

13.2.3.3 Errors reading, writing, creating, or deleting entries in metadata (SFS) files

Diagnosis : Error codes indicate that SFS is not running or not responding.

Resolution : Check the status of SFS with Encina administration tools, and restart if necessary.

13.2.3.4 BFS takes a long time starting

Diagnosis 1 : The server is taking a long time to close outstanding SFS Open File Descriptors (OFDs).

Resolution : The BFS closes any OFDs that were left open from a prior execution of the BFS that crashed. This is to be expected if there was a large amount of activity when the BFS crashed and is a normal recovery operation.

Diagnosis 2 : When the BFS starts, it attempts to connect to all of the Storage Servers and SSM at initialization time, which can lead to a time delay if one or more of the servers it not operational.

Resolution : When the servers that were not up are started, BFS will automatically connect to them. To minimize the delays, start all of the servers that BFS connects to before starting BFS.

13.2.3.5 Service parameters have been changed and BFS does not recognize them

Diagnosis : The BFS has not been recycled since the changes were made. (Potentially includes Class of Service, Storage Class, Storage Hierarchy, and Migration Policy definitions.)

Resolution : The BFS does not pick up this information automatically. Recycle the BFS.

13.2.3.6 Unable to write new records to the SFS BFS storage segment unlink file

Diagnosis 1 : One or more of the Storage Servers is down.

Resolution : If one or more of the Storage Servers is down, the BFS will not be able to successfully delete storage segments targeted for that server, and the associated storage segment unlink records cannot be eliminated. Restart the needed Storage Server. If Storage Servers are to be down for substantial periods of time, it may be necessary to expand the storage segment unlink file so that it can hold more records.

Diagnosis 2 : The storage segment unlink file is too small.

Resolutio n: A heavy load of delete activity on the system can cause unlink records to be created at a faster rate than the rate at which BFS is able to unlink storage segments. Increase the size of the storage segment unlink file.

13.2.3.7 Receiving messages from BFS indicating inconsistencies in account summary records

Diagnosis 1 : Account summary record has been corrupted due to an incomplete SFS recovery.

Resolution : Specialized procedures are provided to deal with this problem. Contact IBM support.

Diagnosis 2 : Software problem in HPSS has resulted in the inconsistency.

Resolution : Contact IBM support.

13.2.3.8 BFS cannot connect to Gatekeeper

Diagnosis 1 : The Gatekeeper is not up.

Resolution : Restart the Gatekeeper in question. The BFS will attempt to connect to this Gatekeeper the next time it is needed. If the site policy increased the types of requests being monitored, then the BFS will not find this out until one of the types of requests previously being monitored is issued. For example, if the BFS was monitoring open requests and the site policy was changed to monitor open and create requests then the BFS won't know about the change until it attempts to issue an open to the DOWN Gatekeeper.

Diagnosis 2 : The Gatekeeper is not configured into the Storage Subsystem.

Resolution : Configure the Gatekeeper into the Storage Subsystem corresponding to the BFS. Recycle the BFS.

13.2.4 Storage Server (SS) Problems

13.2.4.1 SS cannot connect to PVL

Diagnosis : The PVL is not running, or is not responding.

Resolution : Restart the PVL. The Storage Server will try to connect to the PVL for up to 5 minutes before returning an error to a request to mount a disk or tape. Each additional mount request will try for 5 minutes before returning an error. For additional information that may help with this problem, see Sections 13.1.1.1 and 13.1.1.4.

13.2.4.2 SS cannot connect to SSM

Diagnosis : The SSM System Manager is not running or not responding.

Resolution : If the SSM window is responding, check the status of SSM's connection to the SS in the Status field on the Server List window (Figure 1-1 HPSS Servers Window) and reconnect if necessary. The SS will attempt to connect to SSM indefinitely, but retains a limited number of messages to send to SSM in a queue. When the queue becomes full, additional messages are discarded until the connection is re-established. For additional information that may help with this problem, see Sections 13.1.1.1 and 13.1.1.4.

13.2.4.3 Errors reading, writing, creating, or deleting entries in metadata (SFS) files

Diagnosis 1 : Error codes indicate that SFS is not running or not responding.

Resolution : Check the status of SFS with Encina administration tools, and restart if necessary. The Storage Server will retry timed-out SFS I/O operations a fixed number of times before generating an error.

Diagnosis 2 : Error codes indicate that an invalid OFD was found.

Resolution : The Storage Servers will discard an invalid OFD and obtain a new one and retry the operation in most cases. In other cases, an error will be returned immediately to the client. If no error was returned to the client, no action need be taken (the server recovered the error). If an error was returned, log messages should be noted, but little can be done unless the problem becomes persistent.

13.2.4.4 Server takes a long time to come up

Diagnosis 1 : The server is taking a long time to close outstanding SFS OFDs.

Resolution : If the problem does not repeat, a large number of OFDs were successfully closed. This is the intended operation of the system and no action need be taken if the reason for the large number of outstanding OFDs is known. If the problem repeats each time the server is started, investigate the outstanding OFDs (using sfsadmin, for instance) to find out why the OFDs are not being closed when the server starts. If the OFDs were created by another server, SS restarts can be affected because all of the OFDs at the SFS server must be examined by the SS to find OFDs to close. Restarting the server that created the OFDs may correct the problem.

Diagnosis 2 : (for Disk SS only): The Disk SS is taking a long time to mount each disk physical volume via the PVL.

Resolution : Check the status of the PVL and the status of the connection to the PVL from the Disk SS. Each disk physical volume must be mounted in the PVL before the Disk SS can complete its initialization. The server will "hang" until these steps are complete, or will produce a fatal error message and halt.

13.2.4.5 Disk storage server reports "no space"

Diagnosis : The disk VVs are fragmented.

Resolution : The Bitfile Server makes requests to the Disk Storage Server to create one or more disk storage segments in which a file will be recorded. The Bitfile Server determines the size of these storage segments according to the size of the file to be recorded. The Disk Storage Server attempts to find free space on the disk virtual volumes (VVs) it manages in which to create the disk storage segments. The free space for each segment must be made of contiguous disk VV blocks.

If all of the disk VVs are fragmented to a point where a storage segment cannot be created at the requested length, the Disk Storage Server will report that it is out of disk space and return an error to the Bitfile Server. An alarm is sent to the "Alarms and Events" display.

The total free disk space in the system may exceed the requested storage segment size, but if the disks are sufficiently fragmented, it may not be possible to create one or more of the fragments.

Two solutions are available for this problem. Disk VVs could be repacked, which will create large blocks of free space, and purge parameters can be changed to increase the amount of free space in the VVs, which may increase the sizes of the largest free blocks.

13.2.4.6 SS cannot be started

Diagnosis : SS died at initialization with " Invalid COS " error.

Resolution : The reference for the deleted COS was not removed from the Storage Subsystem configuration to which the Storage Server belongs. Bring up the SSM Storage Subsystem Configuration window for the appropriate storage subsystem. Search for the reference to the deleted COS in the Allowed COS list and set it to " No ".

13.2.5 Migration/Purge Server (MPS) Problems

13.2.5.1 No storage class information reported on Storage Class List window (Chapter 6, Figure 6-5)

Diagnosis : One or more of the MPS or SS is not running or cannot be connected.

Resolution : Start the MPS or SS. Resolve the connection problem.

13.2.5.2 A storage class does not show up in the Storage Class List window

Diagnosis 1 : The storage class has been added or updated after the MPS startup is completed.

Resolution : Shut down and restart the MPS.

Diagnosis 2 : No SS resources have been created in the storage class.

Resolution : Create the missing resource. From the HPSS Health and Status window (Figure 7-1 HPSS Health and Status Window), select the Operation pull down menu and then select Create SS Resources .

Diagnosis 3 : The SS controlling the missing storage class has not been started.

Resolution : Start the SS.

Diagnosis 4 : The storage class is not used in any hierarchies.

Resolution : Once the storage class is added to at least one hierarchy, and MPS is restarted, MPS will start reporting usage statistics for that storage class.

Diagnosis 5 : The storage class is not active in a given subsystem.

Resolution : Enable a class of service which references a hierarchy which utilizes the given storage class. This is done in the storage subsystem configuration window. Once MPS is restarted it will begin reporting statistics for those storage class resources within its assigned subsystem.

13.2.5.3 MPS is not migrating or purging data

Diagnosis 1 : Either the BFS or one of the storage servers in the subsystem is down.

Resolution : Start the BFS and/or SS.

Diagnosis 2 : No storage space is available in one of the migration target storage classes.

Resolution : Add or reclaim resources in the given storage class within the given subsystem.

Diagnosis 3 : Migration or purge is encountering errors and aborting.

Resolution :

Make sure that a bad piece of magnetic media (disk or tape) is not causing errors in either the source or target storage classes.

Make sure that the target files do not reside on a volume which is locked. In the case of tape migration, remember that the whole file option may involve more that one source volume.

Diagnosis 4 : Migration or purge appears to be hung.

Resolution :

Make sure that a tape mount is not hung up or failing and being retried for either the source or target storage classes (applies to migration only).

Use Real Time Monitoring to identify a potential deadlock in one of the servers which participates in migration or purge and contact your customer service representative.

13.2.5.4 The purging runs occur more frequently or less frequently than desired

Diagnosis : The Start Used Percent and/or the Target Free Percent parameter(s) are set incorrectly in the purge policy.

Resolution : Correct the Start Used Percent and/or Target Free Percent parameter(s) in the purge policy. Shut down and restart the MPS.

13.2.5.5 The migration runs occur more frequently or less frequently than desired

Diagnosis 1 : The parameters are set incorrectly in the migration policy.

Resolution : Correct the Runtime Interval, Last Read Interval, Last Update Interval and/or Free Space Target parameter(s) in the migration policy. Shut down and restart the MPS.

Diagnosis 2 : The Storage Class Update Interval parameter is set incorrectly in the Migration/Purge Specific Configuration.

Resolution : If the Storage Class Update Interval parameter is set too large, the MPS will sample SS statistics too infrequently and act on the information too late. Set this parameter to a smaller value. Shut down and restart the MPS.

13.2.5.6 A purge is tried for a tape storage class

Diagnosis : (SSM issues an invalid argument message following a force purge command.) Force Purge is not supported for tape storage classes.

Resolution : Delete the purge policy from the tape storage class. The MPS purges all the storage segments from a virtual volume as part of the tape migration. There is no specific purge for tapes.

13.2.6 Physical Volume Library (PVL) Problems

13.2.6.1 Tape mount requests are not being satisfied

Diagnosis 1 : A connectivity failure exists between the PVL and either the MVR, PVR, or SS.

Resolution : Loss of inter-server connectivity should be evident via the HPSS Alarms and Events window (Figure 1-5 HPSS Alarms and Events Window). Determine where connectivity problems exist and proceed to the appropriate problem diagnosis below.

Diagnosis 2 : Mount requests are queued in the PVL waiting for resources.

Resolution: Check PVL job queues and devices for resource shortages such as all devices being in use, multiple requests for the same cartridge, or drives being disabled. This examination should reveal the resource shortage as being drive or cartridge related. If the shortage is caused because a drive is in a disabled state, enable the drive (if appropriate) using the Drive Information window. If a true resource shortage exists, wait for resources to become available or cancel appropriate PVL jobs to free the required resource. If no resource shortage exists, then proceed to Diagnosis 3.

Diagnosis 3 : An internal PVL job queue error has occurred.

Resolution : Use the PVL Job Queue window (Figure 7-2 PVL Job Queue Window) to select the job in question. Cancel the job and retry it. If problems exist for all PVL mounts, restart the PVL.

13.2.6.2 A PVL job cannot be canceled

Diagnosis 1 : A connectivity failure exists between the PVL and the MVR or PVR.

Resolution : Loss of inter-server connectivity should be evident via the HPSS Alarms and Events window (Figure 1-5 HPSS Alarms and Events Window). Determine where connectivity problems exist and proceed to the problem diagnosis below.

Diagnosis 2 : An internal PVL queue inconsistency exists.

Resolution : Restart the PVL. If the problem persists, check to see if the job involves a tape mount by clicking the Job Info button on the PVL Job Queue window (Figure 7-2 PVL Job Queue Window). If the job is a tape mount, determine the PVR involved by using the Device/Drive List window (Figure 10-1 Device/Drive List Window) to select the Drive ID and using the Info ... button to activate the Drive Information window. Shut down and restart the PVR in question.

Diagnosis 3 : A storage server has I/O requests outstanding which reserve the device causing the PVL to issue rewind_and_elevate errors.

Resolution : If data is moving to the device (this can be detected via one of the following SSM screens: Mover Device Information , Storage Map Information , or Mover Information ), wait for the I/O to complete. If the I/O appears hung, use a utility such as lsof , and grep for Mover processes which have the device open. Kill the Mover processes and restart if necessary.

Diagnosis 4 : The Platform which hosts the Mover processes and/or device files is down. This is causing the PVL to issue rewind_and_elevate errors.

Resolution : Lock the drive via the HPSS Devices and Drives screen. This will cause the PVL to exit the drive unload loop and issue a dismount to the PVR. You may have to manually unload the drive in order for the dismount to occur.

13.2.6.3 SS mount requests are not appearing in PVL job queues

Diagnosis 1 : A connectivity failure exists between the PVL and the SS.

Resolution : Loss of PVL-to-SS connectivity should be evident via the HPSS Alarms and Events window (Figure 1-5 HPSS Alarms and Events Window). Proceed to the SS connectivity failure problem given below in Section 13.2.6.7.

Diagnosis 2 : PVL and SS queues are not synchronized.

Resolution : After ensuring that the requests in question do not exist, restart the PVL. If this fails to correct the problem, restart the appropriate tape SS.

13.2.6.4 A tape cartridge is physically mounted in a drive but is not recognized by the system as being mounted

Diagnosis 1 : A connectivity failure exists between the PVL and an MVR.

Resolution : Loss of PVL-to-MVR connectivity should be evident via the HPSS Alarms and Events window. Proceed to the MVR connectivity failure problem description given below in Section 13.2.6.8.

Diagnosis 2 : Drive polling is not enabled for operator PVR.

Resolution : If the drive in question is in an operator PVR (is mounted by hand), polling may not have been enabled for the drive in question. Enable polling for the appropriate drive using the PVL Drive Information window.

13.2.6.5 A drive has been added to the PVL but is not being used by the system

Diagnosis 1 : The drive in question has not been enabled.

Resolution : Use the Drive Information window (Figure 5-3 PVL Drive Information Window) to enable the drive in question for reading and/or writing.

Diagnosis 2 : The PVL and MVR were not restarted after reconfiguration.

Resolution : Restart both the appropriate MVR and PVL for the modified configuration to take effect.

13.2.6.6 Imports of cartridges fail due to improper labeling

Diagnosis : The Import Type field was used incorrectly.

Resolution : The value of the Import Type field can be either Default or Scratch . For disk, always use the Scratch import type.

Specifying Scratch will cause a label to be written on to the media no matter what is currently on it, potentially causing any data on the media to be lost. Because of the potential danger of importing media as Scratch , a dialog box will appear to confirm the choice.

Specifying Default will cause an action depending on how the media is labeled (i.e., tape or disk). For tape media, the action taken is based on the current volume label type:

HPSS--Media imported. The volume label type for this HPSS is: media has an ANSI label; i.e., it starts with an 80-byte block starting with the characters VOL1 . The owner field of the ANSI label is set to HPSS .

Foreign--Media imported. The volume label type for Foreign is: media has an ANSI label, but the owner field is not HPSS .

Non-ANSI--Import fails. The volume label type for Non-ANSI is: media starts with an 80-byte block that does not start with the characters VOL1 .

No label, but data found--Import fails.

No label or data--Cartridge is labeled and imported.

For disk media, the current volume label is read and if the volume identifier matches the identifier specified in the import request, the label is rewritten. This is done in case the volume is being re-imported with either a different block size or number of blocks, because these values are placed in the disk volume label. The MVR can then verify that the label matches the device configuration metadata. If the current volume identifier does not match the identifier specified in the import request, the import will fail.

13.2.6.7 PVL cannot connect to the SS

Diagnosis : The SS is not running or is not responding.

Resolution : Restart the SS. For additional information that may help with this problem, see Sections 13.1.1.1 and 13.1.1.4.

13.2.6.8 PVL cannot connect to an MVR

Diagnosis : The involved MVR is not running or is not responding.

Resolution : Restart the MVR in question. For additional information that may help with this problem, see Sections 13.1.1.1 and 13.1.1.4.

13.2.6.9 PVL cannot connect to the PVR

Diagnosis : The PVR is not running, or is not responding.

Resolution : Restart the PVR. For additional information that may help with this problem, see Sections 13.1.1.1 and 13.1.1.4.

13.2.6.10 PVL cannot connect to SSM

Diagnosis : The SSM System Manager is not running or is not responding.

Resolution : If the SSM window is responding, check the status of SSM's connection to the PVL in the Status field on the Server List window (Figure 1-1 HPSS Servers Window) and reconnect if necessary. The PVL will attempt to connect to SSM indefinitely. For additional information that may help with this problem, see Sections 13.1.1.1 and 13.1.1.4.

13.2.6.11 Errors occur while reading, writing, creating, or deleting entries in metadata (SFS) files

Diagnosis : Error codes indicate that SFS is not running or is not responding.

Resolution : Check the status of SFS with Encina administration tools, and restart if necessary. The PVL will retry timed-out SFS I/O operations a fixed number of times before generating an error.

13.2.7 Physical Volume Repository (PVR) Problems

13.2.7.1 PVR is unable to communicate with a robot

Diagnosis : These errors are usually caused by configuration problems outside the control of HPSS.

Resolution: Verify that a non-HPSS process is able to talk to the robot. It is best to use the robot's own control software. For example, for a 3494 robot, try to mount and dismount a tape using the mtlib command on the workstation that is running the PVR. For the STK robot, try to mount and dismount a tape from the Automated Cartridge System Library Software (ACSLS) console. Additionally for STK, make sure the Storage Server Interface (SSI) process is running on the same workstation as the PVR. The SSI process must have been started before the PVR. For an ADIC AML robot, try to mount and dismount a tape using dasadmin commands. Note that the user must use the command " mt -f /dev/rmtxx rewoffl " to rewind and elevate the tape before issuing the dasadmin dismount command. For LTO libraries, shut down the PVR and try to talk to the robot through the tapeutil tool. Using tapeutil open /dev/smc0 (or whatever the device-specific file is called) and issue mount , dismount , and move volume commands.

If the non-HPSS processes are able to mount and dismount tapes, check the PVR configuration. For LTO, if you cannot open the /dev/smc* file, another process may have control over the library. Remember that only one process can talk to the library at a time, so any other process with an open SMC special device file descriptior will have to be terminted. For 3494/3495 PVRs, verify that the Command and Async devices (generally /dev/lmcp0 ) are valid and available. Verify that the HPSS_3494_COMMAND_DEVICE and HPSS_3494_ASYNC_DEVICE HPSS environment variables are not defined. For STK robots, check that the packet version used by the PVR is the same as the packet version used by the SSI and ACSLS. Note that the packet version number is usually one less than the ACSLS software version number. Be sure that this number agrees with the HPSS environment variable ACSAPI_PACKET_VERSION . For ADIC AML, check the configured Server Name and Client Name fields in the PVR Type Specific Configuration entry. Make sure that these are the same as in the OS/2 PC configuration file. Also, the user should monitor the OS/2 log file for additional information. The error messages are described in detail in the EMASS Storage Systems AMU Reference Guide .

13.2.7.2 PVR operational state is Major

Diagnosis 1 : A cartridge has failed to mount.

Resolution: If a cartridge is supposed to be mounted by a human operator and the mount has been outstanding for about 20 minutes, the Operational State will be set to Major to signify that the mount is taking too long. If a cartridge is supposed to be mounted by a robot, but the robot is unable to mount the cartridge, a message will be logged indicating the problem and the Operational State will be set to Major .

Correct the problem indicated in the log and then force the mount to retry by setting the PVR's Administrative State to Repaired . If the mount fails again, the Operational State will remain set to Major . All mounts that have failed will be retried when the PVR is repaired. They will also be retried every 5 minutes. If any mount fails, the Operational State will be set to Major .

It is possible for the Operational State to be set to Major even if there are no mounts currently pending. If a mount fails due to a transient condition, the Operational State will be set to Major . If the automatic retry successfully mounts the cartridge later, the Operational State will remain set to Major . This allows the operator to identify and correct the transient condition. Set the Administrative State to Repaired to clear the Operational State.

Diagnosis 2 : For IBM 349x robots, insufficient drives are available to honor a mount request.

Resolution : This problem may have occurred due to the injection of a cleaning cartridge into a drive. The PVL is responsible for maintaining the available drive count; however, the PVL has no ability to know when a cleaning cartridge is injected. The PVR is very persistent and the problem will correct itself usually within 5 minutes. It is necessary to notify the PVR of repair in order to reset the Operational State to Normal . The PVR will continue to recheck for an available drive at 5-minute intervals until the problem is resolved.

13.2.7.3 3494 PVR fails to shut down

Diagnosis : The 3494 PVR spawns a child process. If that process fails to shut down, the lmcp daemon has hung the process.

Resolution: To correct this problem in the short term, as root User ID kill -9 the lmcp daemon. When the daemon dies, the PVR processes will halt. This is indicative of an out-of-date lmcp daemon. To permanently fix this problem, install the latest version of the lmcp daemon.

13.2.8 Gatekeeper (GK) Problems

13.2.8.1 The BFS is not calling the Gatekeeper

Diagnosis 1: The Gatekeeper is not up.

Resolution : Restart the Gatekeeper in question. The BFS will attempt to connect to this Gatekeeper the next time it is needed. If the site policy increased the types of requests being monitored, then the BFS will not find this out until one of the types of requests previously being monitored is issued. For example, if the BFS was monitoring open requests and the site policy was changed to monitor open and create requests then the BFS won't know about the change until it attempts to issue an open to the DOWN Gatekeeper. As a general rule, it is recommended that the BFS be recycled whenever the GK site policy changes the types of requests to be monitored.

Diagnosis 2 : The Gatekeeper is not configured into the Storage Subsystem.

Resolution : Configure the Gatekeeper into the Storage Subsystem corresponding to the BFS. Recycle the BFS.

Diagnosis 3 : The site policy increased the types of requests being monitored.

Resolution : As a general rule, it is recommended that the BFS be recycled whenever the GK site policy changes the types of requests to be monitored.

Diagnosis 4 : The CDS ACL for the GK may not be set correctly.

Resolution : Fix the GK security object to contain the following ACL entry:

{user hpss_bfs rw---}

13.2.8.2 The wrong requests types are being monitored

Diagnosis : The types of requests being monitored have changed.

Resolution : Recycle the GK and the BFS.

13.2.8.3 The Gatekeeper is not doing any gatekeeping

Diagnosis : The default gatekeeping policy is to do NO gatekeeping.

Resolution : Write the site customizable gatekeeping policy module. See 2.6.6 in the HPSS Installation Guide and the HPSS Programmer's Reference, Vol 1.

13.2.8.4 The SSM cannot contact the Gatekeeper

Diagnosis : The CDS ACL for the GK may not be set correctly.

Resolution : Fix the GK security object to contain the following ACL entry:

 

{user hpss_ssm rw--c}

13.2.8.5 The Gatekeeper won't start/load

Diagnosis 1 : The shared libraries have been moved/deleted.

Resolution : Issue the " ldd /opt/hpss/bin/hpss_gk " command on Solaris or the " dump -H /opt/hpss/bin/hpss_gk " command on AIX to list the dynamic dependencies of the Gatekeeper dynamic executable. Check that the libgksite.* and libacctsite.* shared libraries are actually located in the pathname displayed by the ldd or dump command. If they differ, then rebuild the Gatekeeper (see 4.5.2 Remake HPSS of the HPSS Installation Guide for more information on rebuilding HPSS).

Diagnosis 2: The shared libraries have the wrong permission.

Resolution: Verify that the Gatekeeper Server process has read permission for the libraries it loads (e.g. libgksite.*, libacctsite.*)

Diagnosis 3 : The Site Policy Path Name in the Gatekeeper server configuration is bad.

Resolution : The Site Policy Path Name is passed to the gk_site_Init() routine which is written by the site. If the site implements this routine to return an error (for example, because the site policy path name is invalid), then the Gatekeeper will crash.

13.2.8.6 Account Validation Fails to Initialize

Diagnosis 1 : The BFS or NS terminate during startup complaining that they cannot initialize Account Validation.

Resolution : Examine the logged error code as well as any Account Validation errors logged recently. Make sure the Accounting Policy has been created and initialized properly. Make sure the Global Configuration metadata has been setup. Make sure the local cell id has been set up in the trusted cell table properly. If Account Validation is enabled, make sure at least one Gatekeeper has been defined and is marked executable.

Diagnosis 2 : The Gatekeeper terminates during startup complaining that it cannot initialize Account Validation.

Resolution : Make sure an Accounting Policy has been defined. If you have written a site policy module, make sure it is working properly.

13.2.9 Mover (MVR) Problems

13.2.9.1 MVR performs poorly

Diagnosis 1 : A problem exists with the MVR internal buffer size.

Resolution: If the MVR buffer size is too small, the MVR will perform numerous separate requests when a single request could be made to perform the same input or output operation. If the MVR buffer size is too large, the MVR could reserve too much system virtual memory, requiring frequent paging of MVR and other process memory (which will also decrease performance) Also, if the buffer size is too large, transfers may be completed without the MVR receiving any benefit from double buffering. (For example, if the MVR buffer size is 4 MB but a majority of client requests are 4 MB or less, the Mover will complete the transfer using one buffer, thus not allowing any client and device I/O time to be overlapped.)

Diagnosis 2 : A disk device is configured to use the block special file.

Resolution: If the device is configured to use the block special file, the data will be buffered by the operating system, which could cause additional overhead during read (primarily) and write operations. Also, data that the MVR believes has been written to disk may in fact only be stored in system memory, waiting to be flushed to disk.

Diagnosis 3 : A disk device is not configured to use multiple MVR tasks.

Resolution: If the device is not configured to use multiple MVR tasks, the MVR will single thread
I/O requests for that device. Modify the device configuration to enable multiple MVR tasks.

13.2.9.2 MVR cannot be started

Diagnosis 1 : The MVR could not bind to the TCP/IP port number specified in the MVR specific configuration file.

Resolution: Verify that the hostname specified in the MVR specific configuration file relates to a valid network interface for the machine on which the MVR is running, and that the port number specified is a valid port number and one that is not in use by another process (possibly another HPSS MVR that was previously started on the same machine).

Diagnosis 2 : The MVR could not bind to a UNIX domain socket used for intra-MVR communication.

Resolution: The MVR uses a set of UNIX domain sockets that are placed in /var/hpss/tmp while the MVR is running. If an MVR was previously running under a different UNIX user ID and was not cleanly shut down, the sockets may be left in the file system, and the newly started MVR may not be able to remove them. If this is the case, a user with sufficient privilege must remove the socket files in /var/hpss/tmp before the MVR can be run by the second user. The socket file names all begin with the prefix Mvr .

Diagnosis 3 : The MVR cannot start either the DCE request process or the TCP/IP request process.

Resolution: The MVR DCE program must be located in the same directory as the main MVR process (or else the environment for the Startup Daemon must contain the directory in its executable path). Since the DCE process reads the MVR configuration metadata, it must be started before any configured values are known. The MVR TCP/IP program pathname is contained in the MVR specific configuration file, and may be verified by examining that information.

Diagnosis 4 : The Mover is configured for non-DCE mode and the non-DCE node inetd configuration is incorrect.

Resolution: This diagnosis is likely to be correct if the Parent Mover process (on the DCE/Encina node) generated an alarm message indicating that it cannot establish a connection to the remote (non-DCE/Encina) node. To correct the problem, verify the /etc/services and /etc/inetd.conf configuration are correct (see 6.8.8.2 MVR Configuration to Support Non-DCE Execution Mode in the HPSS Installation Guide). Also verify (typically via netstat ) that there is a listen waiting on the appropriate TCP port.

Diagnosis 5 : The Mover is configured for non-DCE mode and the encryption key is out of sync between the two Mover nodes.

Resolution: In this case, the Mover should generate an alarm message indicating that there is an encryption key mismatch with the non-DCE/Encina node. To resolve the problem, verify that the encryption key file (referenced in the /etc/inetd.conf file) on the non-DCE/Encina node contains the same value as is configured in the Mover's type specific configuration.

Diagnosis 6 : The Mover is configured for non-DCE mode and the skew between the DCE/Encina node and the non-DCE/Encina node (the two nodes that this Mover is executing across) is greater than the maximum allowable difference (currently 5 minutes).

Resolution: In this case, the Mover should generate an alarm message indicating that the clock skew is too great between the DCE/Encina node and the non-DCE/Encina node. To resolve the problem, one or both of the nodes' clocks must be adjusted so that they are within the allowable difference.

13.2.9.3 MVR cannot write a label to a tape

Diagnosis 1 : The MVR does not have the required privilege to access the device.

Resolution: Verify that the tape device special file is defined such that the user under which the MVR is running is able to access the file for both reading and writing. If the device is a DD-2 tape device, the MVR must be run under the root user id to allow appropriate access to the device.

Diagnosis 2 : The tape device is not configured to support reading and writing variable size blocks.

Resolution: Verify that the tape device is defined such that it will support variable size blocks. This involves defining the block size of the device to be zero. Consult the platform and device driver documentation on how to set the block size for the device.

13.2.9.4 MVR cannot read the label from a previously labeled tape

Diagnosis 1 : The tape device is configured as being able to support using the no delay flag on open, but in fact the device driver does not support issuing tape operations if the device was opened using the no delay flag.

Resolution: Change the device configuration to turn off the no delay support flag.

Diagnosis 2 : The tape device is not configured to support variable block sizes (either because the device was reconfigured or the tape was read on a device other than the one that was used to write the label).

Resolution: See the Resolution for Diagnosis 2 in Section 13.2.9.3.

13.2.9.5 Tape positioning operations are performing poorly

Diagnosis 1 : The tape device is not configured to support absolute positioning (fast locate).

Resolution: Change the device configuration to turn on the fast locate support flag if the device and driver interface support fast locate .

Diagnosis 2 : The MVR was not built with the compilation flag to include code for the device specific device driver interface, which would allow absolute positioning (fast locate) to be used.

Resolution: Rebuild the MVR to include support for the specific device driver interface being used, and modify the device configuration to turn on the fast locate support flag (if necessary).

13.2.9.6 Network transfers are performing poorly

Diagnosis 1 : Routing tables on the node on which the Mover is running is incorrect.

Resolution: Verify that the system routes defined are causing the Mover to use the expected network connectivity when communicating with a remote client.

Diagnosis 2 : The networking options defined in the HPSS network option file ( hpss_netopt.conf ) are not optimally set for the utilized networks.

Resolution: Verify the correctness of the network configuration file hpss_netopt.conf on the mover machine for the utilized networks. See Section 5.8.6 for further details.

13.2.9.7 MVR cannot perform a LFT data transfer

Diagnosis 1 : The MVR reports an access error while performing a LFT data transfer and logs an alarm message indicating that the local file could not be opened.

Resolution : Verify that the MVR and the client both have the file system, that contains the specified file, mounted locally.

Diagnosis 2 : The MVR reports an access error while performing a LFT data transfer and logs an alarm message indicating that the specified path is not configured for LFT data transfer.

Resolution : Verify that the LFT configuration file contains a path that matches the base of the requested file path. See 6.8.8.4 in the HPSS Installation Guide for more details on configuring Local File Transfer.

Diagnosis 3 : The MVR reports an access error trying while performing a LFT data transfer.

Resolution : Verify that the MVR is running as the ' root ' user. Because a MVR using LFT must read and write files with varying ownership and permission, it must be run as the ' root ' user.

Diagnosis 4 : A MVR managing a tape device, on the same machine with a LFT MVR, reports shared memory an access error during migration and stage.

Resolution : Since a LFT MVR must run as the ' root ' user, then any other MVR on the same machine must also run as the ' root ' user. This is because shared memory is used for data transfer between two MVRs on the same machine, and the shared memory segment is created with permissions that allow only user access.

13.2.10 Non-DCE Client Gateway/Non-DCE Client API problems

13.2.10.1 The Non-DCE Client Gateway will not start

Diagnosis 1 : The NDCG could not bind to the TCP/IP port number specified in the NDCG specific configuration file.

Resolution: Verify that the port number specified is a valid port number and one that is not in use by another process (possibly another NDCG that was previously started on the same machine).

Diagnosis 2 : Changes have been made to the inet.conf and/or /etc/services files without recycling the inetd .

Resolution : Recycle the inetd .

13.2.10.2 Non-DCE client applications cannot communicate with the NDCG

Diagnosis : The Non-DCE Client API is attempting to connect to the Non-DCE Client Gateway through the wrong TCP/IP port.

Resolution : Verify that the port number specified in the Non-DCE Client Gateway's specific configuration is the same as the port specified with the HPSS_NDCG_SERVERS and HPSS_NDCG_TCP_PORT environment variables in the environment in which the non-DCE client application is running.

13.2.10.3 Non-DCE client applications occasionally have failed API calls

Diagnosis 1 : The Non-DCE Client Gateway has more concurrent requests than it is configured to handle from a single client.

Resolution : Increase the values of the Maximum Request Queue Size and/or Maximum Thread Pool size in the Non-DCE Client Gateway server-specific configuration.

Diagnosis 2 : The Non-DCE Client Gateway has established its maximum number of client connections, and all subsequent connection attempts fail.

Resolution : Increase the value of the Maximum Processes field in the Non-DCE Client Gateway server-specific configuration. Alternatively, a better solution may be to configure an additional Non-DCE Client Gateway on another node in order to distribute the work load between multiple machines.

13.2.11 Logging Services Problems

13.2.11.1 Logging performance is sluggish

Diagnosis : A large number of messages are being generated.

Resolution: Change the Logging Policy to filter out unneeded messages (the recommended record types to filter out first are Trace, Request and Debug messages). To set or modify Logging Policy, call up the Log Policy window and select Logging Policies . Modify metadata and reinitialize logging service components as necessary. The logging policy may also be modified by selecting Log Policy... from the HPSS Servers window or from the Basic Server Configuration window. Refer to 6.6.4 in the HPSS Installation Guide for additional details.

13.2.11.2 Log Daemon (or Log Client) will not start/restart

Diagnosis 1 : A Log Daemon (or Log Client) may already be executing.

Resolution: If the process is already executing and must be recycled, terminate the existing process and attempt to restart.

Diagnosis 2 : The group permissions for the Log Client and/or Log Daemon may not have sufficient permissions to access the log files.

Resolution: The Logging processes should be started from the same group as the Logging processes that originally created the files.

Diagnosis 3 : The Log Daemon will not initialize if both the primary and secondary log files ( logfile01 / logfile02 ) are marked current, if neither is marked current, or if either file is marked invalid.

Resolution: This error is not likely to occur, but if either log file does become corrupted, logfile01 and logfile02 should be deleted. Restart the Log Daemon and the new files will be recreated.

13.2.12 Network File System (NFS) Problems

13.2.12.1 NFS Daemon will not start

This problem can occur for several reasons. Review the HPSS log for messages that explain the nature of the problem and then proceed as discussed below.

Diagnosis 1 : NFS daemon was unable to read the HPSS exports file.

Resolution : Check for file existence, permissions, and correctness.

Diagnosis 2 : The HPSS root uid does not have an account in the DCE security registry.

Resolution : Use dcecp to verify whether an account exists.

Diagnosis 3 : The NFS Daemon was unable to read the credentials map file.

Resolution : Check for file existence, permissions, and correctness.

Diagnosis 4 : The NFS Daemon was unable to read the data cache and checkpoint files.

Resolution : Check for file existence and permissions.

Diagnosis 5 : The NFS Daemon was unable to open its SFS configuration file.

Resolution : Check for file existence and permissions.

Diagnosis 6 : Security initialization failed.

Resolution : Review the HPSS log for security subsystem messages and take appropriate action.

Diagnosis 7 : Errors occurred while initializing DCE/Encina.

Resolution : Turn on Trace and Debug record types if they are not set and attempt to restart the NFS server. If the server is having trouble registering its service, make sure the NFS CDS directory exists and that it has a proper ACL.

Diagnosis 8 : The NFS Daemon will not remain up.

Resolution : Make sure the Bitfile Server (BFS) is up.

Diagnosis 9 : The NFS Daemon died with "Binding Address In Use..." error.

Resolution : Determine whether another application is using port 2049 then free up the port for NFS. Public domain tools, such as lsof , can be used to investigate the application currently using port 2049.

13.2.12.2 Mount Daemon will not start

Review the HPSS log for messages that help explain the nature of the problem and then proceed as discussed below.

Diagnosis 1 : The Mount Daemon was unable to read the HPSS exports file. This problem is similar to the NFS Daemon not starting.

Resolution: Check for file existence, permissions, and correctness.

Diagnosis 2 : The HPSS root uid does not have an account in the DCE security registry.

Resolution: Use dcecp to verify whether an accounts exists.

Diagnosis 3 : The Mount Daemon was unable to open its SFS configuration file.

Resolution: Check for file existence and permissions.

Diagnosis 4 : Security initialization failed.

Resolution: Review the HPSS log for security subsystem messages and take appropriate action.

Diagnosis 5 : Errors occurred while initializing DCE/Encina.

Resolution: Turn on Trace and Debug record types if they are not set and attempt to restart the Mount Daemon. If the server is having trouble registering its service, make sure the Mount Daemon CDS directory exists and that it has a proper ACL.

Diagnosis 6 : The Mount Daemon was unable to open the NFS Server SFS configuration file.

Resolution: Check for the file's existence and determine whether the Mount Daemon DCE principal has permission to read. If not using the default NFS descriptive name, set the HPSS_NFS_DESC_NAME environment variable in the /opt/hpss/config/hpss_env file.

Diagnosis 7 : The NFS Mount Daemon will not remain up.

Resolution : Make sure the Bitfile Server (BFS) is up.

Diagnosis 8 : In some situations, if one of the HPSS servers is down, the NFS daemon will not be able to process the exports file, and so will shut itself down.

Resolution : If possible, start the missing HPSS server. If that cannot be done, remove the fileset that is exported by that server from the exports file.

13.2.12.3 NFS server is unable to recover data cache

Diagnosis 1 : During recovery, if a bitfile is noticed to have been removed through other means, such as FTP, it is skipped and the remaining files are recovered. Errors during recovery are of two types: retryable and fatal. Retryable errors include communication errors, out of memory conditions, single bitfile errors, invalid NFS Daemon configuration parameters, and most other errors.

Resolution : When in doubt, assume an error is retryable. The underlying cause of the error should be fixed, if it is not transient, and the NFS Daemon should be restarted. For example, if a particular bitfile always returns an error during recovery, it should be removed, for example with ftp . Usually if a single bitfile is returning an error, the remaining files in the cache will be recovered properly before the NFS Daemon terminates.

Diagnosis 2 : Fatal errors include invalid checksums on either the checkpoint or cache file headers, missing checkpoint and/or cache files, and any retryable error that does not respond to the Diagnosis 1 Resolution above.

Resolution : For fatal errors, remove any remaining checkpoint and cache files and restart the NFS Daemon. Any unrecoverable information will be lost.

13.2.12.4 NFS server is unable to recover credentials map

Diagnosis : The credentials map file specified in the NFS configuration does not exist, does not allow access by the NFS UNIX UID, or is corrupted.

Resolution : Check the NFS configuration for a correct credentials dump file. If the configuration is correct and the file exists, view the file contents for corruption. If the file looks corrupted, delete the file and use the /usr/bin/touch binary to re-create the file. Then restart the NFS server.

13.2.12.5 Data transfer performance is slow

Several warning messages may be logged by the NFS Daemon that may indicate that the daemon's data cache configuration needs modification. Normally these messages only appear under heavy read/write load. If they start appearing often, follow the directions below.

Diagnosis 1 : "No free memory buffers available" message appears.

Resolution : This message indicates that all of the available memory allocated for handling read/write requests to and from HPSS is in use. While this temporary condition exists, requests will return a "busy" result and will be retried later, degrading performance. If this message appears often, increase the Memory buffers field. Since the total memory allocated is the number of memory buffers multiplied by the Buffer Size field, you might have to decrease the Buffer Size field if not enough real memory is available. It is a good idea to keep the Buffer Size field a power of two, with a minimum of half a megabyte.

Diagnosis 2 : "No free cache entries available" message appears.

Resolution : This message (a symptom that the entire cache is filled with information that needs to be written to HPSS) indicates that the NFS Daemon is having trouble writing information to HPSS faster than it is coming in from the client. While this temporary condition exists, requests will return a "busy" result and will be retried later, degrading performance. If this message appears often, there are several things that may be tried.

Increase the Cache Entries field. If no more disk space is available, it might be necessary to also decrease the Buffer Size field. It is a good idea to keep the Buffer Size field a power of two, with a minimum of half a megabyte.

Decrease the Thread Interval field, but not too small because performance will degrade as the system searches for entries to write to HPSS more often. The average period between cache entries written to HPSS is the Thread Interval field value divided by the number in the Cleanup Threads field.

Increment the Cleanup Threads field by one. The average period between cache entries written to HPSS is the Thread Interval field value divided by the number in the Cleanup Threads field.

Diagnosis 3 : "Having trouble creating free data cache entries" message appears.

Resolution : The resolution is the same as for Diagnosis 2 above except that it may also be beneficial to increase the Memory Buffers field. If it does not appear that the data cache configuration is the problem, check the network configuration. Check for socket buffer overflows in the output to a netstat -s command. If this is happening, check the maximum socket buffer size by looking at the sb_max parameter in the output from the no -a command. If you increase this parameter, you will see a reduction in the socket buffer overflows.

13.2.12.6 ls/stat/pwd performance is slow

Diagnosis : This problem can be caused by inadequate memory storage for the attribute cache or inadequate cache hold time.

Resolution : Look at the NFS statistics Directory Cache and Header Cache blocks for the Hits and Faults fields. If the Faults field is constantly increasing and the NFS requests are not creating objects (for example, through create , link , mkdir , or symlink requests), adjust the values for Directory Size , Table Entries , LRU Max Len , and Holdtime . Refer to Chapter 5, HPSS Configuration, for suggested values. If reconfiguring the attribute cache does not help performance, check the network performance and SFS configuration.

13.2.12.7 Bitfiles written using NFS are not flushed to HPSS

Diagnosis : This problem can occur if the recover data cache option is not in the NFS configuration.

Resolution : Check the NFS configuration to verify whether the Recover Cached Data field is set. If it is not set, set it and restart the NFS server. If the field is set, check the status of the BFS and SS.

13.2.12.8 NFS top-level storage class is running out of space and migrate/purge does not reclaim space

Diagnosis : This problem can occur when the NFS server is shut down without flushing the data cache and files that were being read/written are deleted when the NFS server is restarted.

Resolution : Look at the BFS segment checkpoint SFS file. If it contains records, cycle the BFS.

13.2.12.9 Changes to the exports file aren't being recognized

Diagnosis : The NFS and Mount daemons do not automatically recognize when the exports file has been changed, and there is currently no mechanism to signal the daemons to reread the file.

Resolution : Restart the NFS and Mount daemons.

13.2.12.10 NFS seems unreliable

Diagnosis : If NFS has been mounted soft and the default options are used, NFS may time out too quickly.

Resolution : Tell users to increase the timeo and retrans parameters on their mount commands. Values of 30 and 100 respectively are not unreasonable. Coordinate these values with the disk and memory cache parameters for the NFS daemon. Users could also use hard mounts.

13.2.12.11 NFS clients can't mount file systems

Diagnosis 1 : If the user has never been able to mount the file system successfully, the exports file is probably wrong.

Resolution : Make sure the exports file is correct. In particular, the name of the file system and fileset must be spelled correctly. Also the syntax for the entry must be correct. The daemon will start OK with errors like this, but will not be able to process mount requests.

Diagnosis 2 : If the user used to be able to mount the file system, but it has stopped working, the daemon may need to be restarted.

Resolution : On rare occasions, the NFS daemon may decide that a file system does not exist when in fact it does. Make sure the file system is still listed in the exports file. Make sure there are no duplicate entries for the same directry. If the exports file is OK, restart the daemon and see if the problem goes away.

Diagnosis 3 : The exports file has an entry with the root option set, but the user has not set up a credential map entry for the root account.

Resolution : Have the user run nfsmap to create a credential map entry for the root account. Be sure to understand the security implications of doing this.

13.2.12.12 Unable to access/manipulate NFS-mounted files above 2GB

Diagnosis: ulimit settings are incorrect

Resolution: Check the UNIX ulimit setting. The user must have authority at the UNIX level to access/create files over 2GB in size. See the UNIX man page on ulimit for specific details as they relate to your system.

13.2.13 Startup Daemon Problems

13.2.13.1 A server refuses to start when requested from SSM

Diagnosis : If a server refuses to start up and the log message indicates that the problem is in the InitServer function, check for lockfile contention.

Resolution : The InitServer function creates a lockfile on the server's host in the /var/hpss/tmp directory. The lockfile name is of the form hpssd . NNNN.AAAA where NNNN is a hexadecimal number and AAAA is the descriptive name of the server. If the server's real descriptive name contains imbedded spaces, the lockfile name will substitute underscores for the spaces. The lockfile contains the process ID of the server. The Startup Daemon on each host uses the lockfile to determine whether the server is currently running.

If the server was previously run under a different UNIX UID, it is possible that the lockfile already exists but the current server does not have permission to overwrite it. Check the owner and permissions on the file and the UNIX UID for the server in the generic system configuration file. If you are certain the server is not already running on that host, remove the lockfile and retry starting the server.

13.2.13.2 Cannot start a server

Diagnosis 1 : The server may already be running. The Startup Daemon has determined that an identical copy of the target server is already running.

Resolution : Make sure that there are not two servers with the same descriptive name (this should not be possible). If you force the server to run anyway, you may damage the HPSS system, so stop the old server first.

Diagnosis 2 : A lock file name collision may exist.

Resolution : On very rare occasions, two servers with different descriptive names will share the same lock file name. This can only happen if the first 22 characters of the descriptive names are identical, and even then the problem occurs only rarely. To fix the problem, change the descriptive name of one of the servers.

Diagnosis 3 : The HPSS executable may not exist or may not be accessible.

Resolution : Make sure the path to the executable is specified correctly, that the executable exists, and that the UNIX user under which the server will be running has permission to access the executable.

Diagnosis 4 : The UNIX user may not exist. The Startup Daemon issues the "Cannot start server; no such unix user <userid>" error message. The SSM System Manager issues the "SSMSM unauthorized to access hpssd_Start_Server API..." error message.

Resolution : Make sure the server's user name exists in the passwd file on the computer where the server will be running.

Diagnosis 5 : The Startup Daemon may not be running.

Resolution : The daemon must be started before HPSS servers can be brought up. To start the daemon, run the script rc.hpss .

Diagnosis 6 : There may be a problem with a Startup Daemon lock file.

Resolution : See the discussion below in Section 13.2.13.4 on how to check lock files.

Diagnosis 7 : The Startup Daemon may not be responding to requests.

Resolution : Kill the daemon using the kill -9 command and then restart it using the script rc.hpss . Do this only as a last resort because it causes the daemon to lose some of the information it has about which servers are running.

Diagnosis 8 : The Data Server will not start and possible Java errors are reported.

Resolution : Verify that the Java prerequisites have been satisfied. This includes:

1. The required java software is installed and located in the correct directory

2. The security provider is listed in the Java security file

3. A public/private key pair and a certificate have been created and stored in a keystore

4. A Data Server policy file exists with the proper permissions

5. The Data Server client authorization file exists (even if no authorized users are listed in it).

6. If the Data Server is to be executed in Low Security mode, the password to the keystore file created in step 3 must be stored in cleartext form in the proper file. If, instead, the Data Server is executed in Normal Security Mode (which is the recommended mode), it will prompt during initialization for this password, which must then be typed in manually. See 3.8 in the HPSS Installation Guide for more information.

See 3.8 in the HPSS Installation Guide for more information.

13.2.13.3 Cannot stop a server

Diagnosis 1 : The target server may not be able to shut down gracefully.

Resolution : A server may have received a request to shut down, but cannot complete the request for some reason. To fix the problem, use the force halt button to force the server to shut down. This should only be done as a last resort when it is clear that the server will never complete a graceful shutdown.

Diagnosis 2 : This diagnosis applies only to stopping a server with the force halt button; it is not an issue for the shutdown button.

Resolution : To halt a server, SSM issues two requests: one to the specified server directing the server to halt immediately, and one to the Startup Daemon on the server's host directing the Startup Daemon to kill the server. Either request alone should be sufficient, but both are issued in case either fails. If SSM cannot communicate with the server, or if the server ignores the halt request, the only way the halt can succeed is if the Startup Daemon kills the server. In this case, if the Startup Daemon is not executing and communicating with SSM, the halt request will fail.

Diagnosis 3 : The Startup Daemon may not be running.

Resolution : The daemon must be started before HPSS servers can be stopped. To start the daemon, run the script rc.hpss .

Diagnosis 4 : There may be a problem with a Startup Daemon lock file.

Resolution : See the discussion in Section 13.2.13.4 below on how to check lock files.

Diagnosis 5 : The Startup Daemon may not be responding to requests.

Resolution : Kill the daemon using the kill -9 command and then restart it using the script rc.hpss . Do this only as a last resort because it causes the daemon to lose some of the information it has about which servers are running.

13.2.13.4 A problem exists with a Startup Daemon lock file

Diagnosis 1 : The file may be empty.

Resolution : Delete the file. The name of the file that is causing the problem can usually be found in the log. You can also use the ls -l command to look for empty files in /var/hpss/tmp .

Diagnosis 2 : The server may not be able to access the file.

Resolution : This might happen if you have started the server manually under your own UID and then restarted it from the SSM. To fix the immediate problem, use the chown command. To prevent this problem from happening in the future, ensure that all accounts that run the server belong to the same group.

Diagnosis 3 : The file may contain invalid information.

Resolution : To better understand the situation, use the cat command to view the contents of the lock file. It should look similar to this:

DescName: Mount Daemon LockNum: 0 PID: 20016

The descriptive name must match the name of the file (in this example, /var/hpss/tmp/hpssd.4302.Mount_ Daemon .) If the names do not match, change one of the daemon's descriptive names to avoid a name collision. To avoid further trouble, delete the lock file.

13.2.14 SSM Problems

13.2.14.1 The SSM windows have incorrect colors, sometimes to the point of being unusable

Diagnosis : Sammi is not able to obtain all the X color resources that it needs.

Resolution : Check to see what other applications have active windows on the workstation where SSM is being run. Some applications which use a large number of colors prevent SSM from allocating the colors it needs, causing the SSM windows to appear all black or all white, for example. The Netscape Web browser is one application known to cause this problem.

If suspect applications are found, try shutting them down and restarting SSM to see if the color problem goes away. Sometimes SSM can coexist with these applications if SSM is started first, and the other application second.

13.2.14.2 Objects on the SSM windows are missing or not positioned properly

Diagnosis : X resources required by Sammi are not being set correctly.

Resolution : Make sure you have a copy of the file named SAMMI (from /opt/hpss/config/templates/SAMMI.template ) in your home directory on the host where the Sammi Runtime is executing. If you have a private X app-defaults area, the file should go there instead of in your home directory. If Sammi is currently running, shut it down and restart it.

13.2.14.3 Sammi and the Data Server refuse to connect to one another on SSM session startup

Diagnosis 1 : Problems exist in the Sammi configuration file or the API configuration file.

Resolution : The Sammi configuration file is usually named ssm_console.dat , and each user should have a private copy customized for a unique console ID. This file is usually located in the /opt/hpss/sammi/ssmuser/<userid> directory. Near the bottom of this file are several lines beginning with logical_server . For the two vital Sammi processes ( s2_evtsvr and s2_stream ), note the hexadecimal RPC addresses in field 3 of each line. Also make sure that the RPC version numbers (field 4) are equal to the console ID, which is set in the console_id line near the top of the file. There should also be a line for the Data Server ( ssm_ds ). Note this RPC address, also, and make sure that the RPC version number is 1.

Now check the API configuration file named api_config.dat . There is only one copy, located in /opt/hpss/bin . This file contains multiple sets of two logical_server lines, one set for each Sammi console ID that has been set up by the HPSS administrator. The two lines in each set refer to the same two Sammi processes ( s2_evtsvr and s2_stream ) configured in the Sammi configuration file. There should be one set of lines for which the RPC version field (field 4) matches the console ID from the Sammi configuration file. The RPC addresses (field 3) in these lines must be identical to the RPC addresses for the corresponding Sammi processes in the Sammi configuration file.

Diagnosis 2 : Startup script problems exist.

Resolution : If both configuration files are correct (see Diagnosis 1 above), check the HPSS_PORT_SSMDS environment variable defined in the /opt/hpss/config/hpss_env file. The HPSS_PORT_SSMDS variable defines the hexadecimal RPC address of the Data Server. This address must match the RPC address for ssm_ds defined in the Sammi configuration file (ssm_console.dat ).

13.2.14.4 Sammi and the Data Server connect to one another, but they generate many error messages and the Data Server cannot write to most fields

Diagnosis : Hostname resolution problems exist.

Resolution : The Sammi configuration file and the API configuration file (see Section 13.2.14.3 above) both contain hostname information in field 5 of their logical_server entries. The Sammi process lines in both files must reference the host where the Sammi runtime is executing. The Data Server ( ssm_ds ) line in the Sammi configuration file must reference the host where the Data Server is executing. Normally these host names should be localhost or the name of a remote host, as appropriate. Using localhost where possible can make Sammi operation more efficient.

If, however, your site uses Domain Name Services (DNS) for hostname resolution, you may encounter Sammi communication problems related to these hostname fields. Certain quirks in Sammi and/or DNS sometimes make it impossible for Sammi to resolve a hostname unless it is in a specific form. This problem usually becomes apparent when Sammi comes up and appears to connect normally to the Data Server, but then the Data Server finds it impossible to write data to fields on the SSM windows. Error messages generated by Sammi may also appear on the TTY used to start Sammi.

Unfortunately, there is no clear-cut method for fixing the problem; the technique involves some trial and error. You may have to tinker with hostnames in the configuration files. In some cases, you may have to change localhost to the actual fully-qualified hostname (for example, node1.mysite.gov ). In other cases, the full name may not work, but the abbreviated hostname may work (using the same example, node1 ). You may find that one hostname format will work in the Sammi configuration file, but something else is required in the API configuration file.

The only way to find the right combination is to edit one or both configuration files and then restart Sammi and see what happens. If the problem still exists, try a different hostname format in one or both files, and then shut down Sammi and restart again. Repeat until the Data Server is able to write data to fields normally.

13.2.14.5 Sammi and the Data Server start up normally, but get disconnected in the middle of operations

Diagnosis 1 : The network or execution hosts are overloaded. This problem usually arises when data fields in windows suddenly get overwritten with red error messages reading "?unconnected dfd?".

Resolution : If the network or SSM host machines are overloaded, communications problems may occur between Sammi and the Data Server in the middle of an SSM session. If this is truly the cause of the problem, you may need to take steps to free up machine or network resources.

Diagnosis 2 : RPC timeouts are too small. As in Diagnosis 1, the data field can be overwritten.

Resolution : The Sammi configuration file and the API configuration file (see Section 13.2.14.3 above) both contain RPC timeouts. These appear as the last two fields of the logical_server lines. The numbers are timeouts in seconds, and the second number on each line should always be at least 10 seconds larger than the first number.

The timeouts in the Sammi configuration file define the maximum time Sammi will wait with no response from a server before deciding that the server is disconnected. The timeouts in the API configuration file define the maximum time the Data Server will wait with no response from Sammi before deciding that Sammi is disconnected. Increasing these timeouts can help to avoid too-hasty disconnections in overloaded environments. Increasing them too much, however, will delay reporting of hard communications failures.

13.2.14.6 Communications problems exist between the Data Server and the System Manager

Diagnosis 1 : The number of servers running is incorrect.

Resolution : There must be one and only one System Manager in execution at any one time per HPSS installation. Several Data Servers may execute concurrently, but only if they are using different CDS names. Note that the start_ssm script will refuse to start either program if it finds a copy already running on the same host; however, the administrator may create customized start scripts to start more than one Data Server on the same host as long as the administrator makes sure each server uses different CDS names and different ports to talk to Sammi.

The ps command can be used on each host to determine whether there are more copies of the System Manager or Data Server executing than intended.

Diagnosis 2 : CDS entries are incorrect.

Resolution : The CDS names are set by environment variables in the /opt/hpss/config/hpss_env file. The CDS directory specified for the Data Server must exist and must contain a security object with a proper ACL before the Data Server is executed. The directory and security object will have been created by the HPSS Infrastructure Configuration script ( mkhpss ). To configure a second Data Server, you must create its CDS directory and its security object and set the proper ACLs on them using regular DCE administration tools ( cdscp and acl_edit ). The directory should be accessible by all HPSS servers; the security object should give all permissions to the SSM DCE principal and all permissions except control to the other HPSS servers. By default, the DCE registry group for all HPSS servers is hpss_server and the DCE principal for SSM is hpss_ssm .

For example, if you wish to configure a new Data Server to use CDS directory /.:/hpss/ssmds_2 , use the following commands:

 

cdscp create directory /.:/hpss/ssmds_2

acl_edit /.:/hpss/ssmds_2 -m group:hpss_server:rwdtcia

acl_edit -e /.:/hpss/ssmds_2 -m group:hpss_server:rwdtc

acl_edit /.:/hpss/ssmds_2 -io -m group:hpss_server:rwdtc

acl_edit /.:/hpss/ssmds_2 -ic -m group:hpss_server:rwdtcia

cdscp create object /.:/hpss/ssmds_2/Security

acl_edit /.:/hpss/ssmds_2/Security -m user:hpss_ssm:rwdtc

acl_edit /.:/hpss/ssmds_2/Security -m group:hpss_server:rwdt

The System Manager will create its own CDS directory and security object, if they are not already there, using the CDS directory specified in the hpss_env file. This is how the System Manager bootstraps itself the first time it is executed after initial installation.

Diagnosis 3 : The programs are not running or have not registered.

Resolution : Each program logs a message if it cannot register. In addition, on successful registration the System Manager logs a message stating the CDS name and Descriptive Name it is using. The rpccp show mapping command, used on the host where each program is executing, will show whether the program is registered in the DCE rpc endpoint map there and, if so, which CDS name it is using. If either program has not registered with its endpoint mapper, the other program will not be able to contact it.

The ps command can be used on each host to determine whether the servers are running.

Diagnosis 4 : Servers are not finding each other.

Resolution : When the System Manager can connect to the Data Server, it logs a message saying "Network communications established," and specifies the CDS name of the Data Server. When the System Manager cannot reach the Data Server, it logs an error message specifying the CDS name it thinks the Data Server is using. If the System Manager is logging these error messages, make sure the CDS name it is trying is the same one under which the Data Server is actually registered.

When the Data Server has successfully connected to the System Manager, it logs a message saying "SSM Data Server has completed DCE bind to System Manager." See Diagnosis 2 above for more information.

Diagnosis 5 : The Data Server has been automatically checked out.

Resolution : The System Manager will automatically check out any Data Server which it cannot establish contact with within a certain interval.This interval is hard coded as 5 minutes . After checking out the Data Server, the System Manager will no longer send it any notifications. If the screens appear to be stale and are not receiving any updates at all, try forcing a Ping SSM System Manager from the Admin menu . If this does not remedy the situation, check the log to see whether the System Manager is trying to check the Data Server out or has already done so.

Note that just because the Data Server can contact the System Manager, this does not mean that the System Manager can contact the Data Server. Two different DCE interfaces are involved here. One is advertised by the Data Server as a service for the System Manager. This interface allows the System Manager to send alarms, events, status messages, data change notifications, server list changes, drive list changes, and storage class list changes to the Data Server. The other interface is advertised by the System Manager as a service for the Data Server. This interface allows the Data Server to request reads and modifications of SFS files, current copies of managed objects from servers, startup and shutdown of servers, and most other administrative functions the Data Server performs on behalf of the user. It is when the System Manager cannot contact the Data Server on the first interface for a specified interval that it automatically checks it out. The other interface may be working correctly; for those functions, the Data Server appears responsive.

If the Data Server is automatically checked out, try to check it in again using the Ping SSM System Manager function under the Admin Menu on the HPSS Health and Status window (Figure 7-1 HPSS Health and Status Window). You may need to repeat this process several times. If this still does not correct the problem, you may need to restart the Data Server and/or System Manager.

13.2.14.7 Communications problems exist between the System Manager and other HPSS servers.

Diagnosis : Servers cannot find the System Manager.

Resolution : If the basic server configuration file contains more than one entry for the SSM System Manager, or if it contains an entry that does not match the values set by the environment variables in the /opt/hpss/config/hpss_env file, other servers may not be able to find the System Manager and send it any notifications.

There should be exactly one entry for the SSM in the server list. Use the HPSS Servers window to check that there is only one executable server of type SSMSM. Also check that the descriptive name and the CDS Name field defined in the System Manager's basic configuration match the H