online documentation (CCVU.HLP) included on the CD. Compaq Insight Manager Compaq Insight Manager, loaded from the Compaq Management CD that is shipped with each ProLiant server, is an easy-to-use, console-based software utility for collecting server and cluster information. Compaq Insight Manager performs the following functions: I Monitors fault conditions and system status I Monitors shared storage and interconnect adapters I Forwards server alert fault conditions I Remotely controls servers The Integrated Management Log collects and feeds data to Compaq Insight Manager. This log is used with the Insight Management Desktop (IMD), Remote Insight (optional controller), and SmartStart. In Compaq servers, each hardware subsystem, such as disk storage, system memory, and system processor, has a robust set of management capabilities. Compaq Full Spectrum Fault Management notifies of impending fault conditions and keeps the server up and running in the unlikely event of a hardware failure. For information concerning Compaq Insight Manager, refer to the Compaq Server Setup and Management pack shipped with each ProLiant server. Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: b-ch1 Architecture of the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 11:58 AM 1-20 Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Compaq Insight Manager XE Compaq Insight Manager XE is a Web-based management system and is located on the Compaq Management CD shipped with each ProLiant server. It can be used in conjunction with Compaq Insight Manager agents as well as its own Web-enabled agents. This browser-based utility provides increased flexibility and efficiency for the administrator. It extends the functionality of Compaq Insight Manager and works in conjunction with the Cluster Monitor subsystem, providing a common data repository and control point for enterprise servers and clusters, desktops, and other devices using either SNMP- or DMI-based messaging. Cluster Monitor Cluster Monitor is a Web-based monitoring subsystem of Compaq Insight Manager XE. With Cluster Monitor, you can view all clusters from a single browser and configure monitor points and specific operational performance thresholds that will alert you when these thresholds have been met or exceeded on your application systems. Cluster Monitor relies heavily on the Compaq Insight Manager agents for basic information about system health. It also has custom agents that are designed specifically for monitoring cluster health. Cluster Monitor provides access to the Compaq Insight Manager alarm, device, and configuration information. Cluster Monitor allows the administrator to view some or all of the clusters, depending on administrative controls that are specified when clusters are discovered by Compaq Insight Manager XE. Compaq Intelligent Cluster Administrator Compaq Intelligent Cluster Administrator extends Compaq Insight Manager and Cluster Monitor by enabling Administrator to configure and manage ProLiant clusters from a Web browser. With Compaq Intelligent Cluster Administrator, you can copy, modify, and dynamically install a cluster configuration on the same physical cluster or on any physical cluster anywhere in the system, through the Web. Compaq Intelligent Cluster Administrator checks for any cluster destabilizing conditions, such as disk thresholds or application slowdowns, and reallocates cluster resources to meet processing demands. This software also performs dynamic allocation of cluster resources that may be failing without causing the cluster to fail over. Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: b-ch1 Architecture of the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 11:58 AM Architecture of the Compaq ProLiant Clusters HA/F100 and HA/F200 1-21 Compaq Intelligent Cluster Administrator also provides initialized cluster configurations that allow rapid cluster generation as well as cluster configuration builder wizards for extending the Compaq initialized configurations. Compaq Intelligent Cluster Administrator is included with the HA/F200 cluster kit and can be purchased as a stand-alone component for the HA/F100 cluster. Intelligent Cluster Administrator is licensed on a per cluster basis. Resources for Application Installation The client/server software applications are among the key components of any cluster. Compaq is working with its key software partners to ensure that cluster-aware applications are available and that the applications work seamlessly on Compaq ProLiant clusters. Compaq provides a number of Integration TechNotes and White Papers to assist you with installing these applications in a Compaq ProLiant Cluster environment. Visit the Compaq High Availability website (http://www.compaq.com/highavailability) to download current versions of these TechNotes and other technical documents. IMPORTANT: Your software applications may need to be updated to take full advantage of clustering. Contact your software vendors to check whether their software supports MSCS and to ask whether any patches or updates are available for MSCS operation. Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: b-ch1 Architecture of the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 11:58 AM 2 Chapter Designing the Compaq ProLiant Clusters HA/F100 and HA/F200 Before connecting any cables or powering up any machines, it is important to understand how all of the cluster components and concepts fit together to meet your information system needs. The major topics discussed in this chapter are: I Planning Considerations I Capacity Planning I Network Considerations I Failover/Failback Planning In addition to reading this chapter, read the planning chapter in Microsoft documentation that came with your operating system. Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM 2-2 Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Planning Considerations To correctly assess capacity, network, and failover needs in your business environment, it is important to understand clustering and the things that affect the availability of clusters. The items detailed in this section will help you design your Compaq ProLiant Cluster so that it addresses your specific availability needs. I Cluster configuration design is addressed in "Cluster Configurations." I A step-by-step approach to creating cluster groups is discussed in "Cluster Groups." I Recommendations regarding how to reduce or eliminate single points of failure are contained in the "Reducing Single Points of Failure in the HA/F100 Configuration" section of this chapter. By definition, a highly available system is not continuously available and therefore may have single points of failure. NOTE: The discussion in this chapter relating to single points of failure applies only to the Compaq ProLiant Cluster HA/F100. The HA/F200 includes dual redundant loops, that eliminate certain single points of failure contained in the HA/F100. Cluster Configurations Although there are many ways to set up clusters, most configurations fall into two categories: active/active and active/standby. Active/Active Configuration The core definition of an active/active configuration is that each node is actively processing data when the cluster is in a normal operating state. Both the first and second nodes are "active." Because both nodes are processing client requests, an active/active design maximizes the use of all hardware in both nodes. Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM Designing the Compaq ProLiant Clusters HA/F100 and HA/F200 2-3 An active/active configuration has two primary designs: I The first design uses MSCS failover capabilities on both nodes, enabling Node 1 to fail over clustered applications to Node 2 and enabling Node 2 to fail over clustered applications to Node 1. This design optimizes availability since both nodes can fail over applications to each other. I The second design is a one-way failover. For example, the Microsoft clustering software may be set up to allow Node 1 to fail over clustered applications to Node 2, but not to allow Node 2 to fail over clustered applications to Node 1. While this design increases availability, it does not maximize availability since failover is configured on only one node. When designing cluster nodes to fail over to each other, ensure that each server has enough capacity, memory, and processor power to run all applications (all applications running on the first node plus all clustered applications running on the other node). When designing your cluster so that only one node (Node 1) fails over to the other (Node 2), ensure that Node 2 has enough capacity, memory, and CPU power to execute not only its own applications, but to run the clustered applications that can fail over from Node 1. Another consideration when determining your servers' hardware is understanding your clustered applications' required level of performance when the cluster is in a degraded state (when one or more clustered applications is running on a secondary node). If Node 2 is running near peak performance when the cluster is in a normal operating state, and if several clustered applications are failed over from Node 1, Node 2 will likely execute the clustered applications more slowly than when they were executed on Node 1. Some level of performance degradation may be acceptable. Determining how much degradation is acceptable depends on the company. Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM 2-4 Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Example 1: File & Print/File & Print An example business scenario (Figure 2-1) involves two file and print servers. The Human Resources (HR) department uses one server, and the Marketing department uses the other. Both servers actively run their own file shares and print spoolers while the cluster is in its normal state (an active/active design). If the HR server encounters a failure, it fails over its file and print services to the Marketing server. HR clients experience a slight disruption of service while the file shares and print spooler fail over to their secondary server. Any jobs that were in the print spooler before the failure event will now print from the Marketing server. File and Print File and Print Marketing Human Resources Capacity Capacity Human Resources Marketing Shared Storage (Marketing) (Human Resources) Figure 2-1. Active/active example 1 When failover is complete, all of the HR clients have full access to their file shares and print spooler. Marketing clients do not experience any disruption of service. All clients may experience slowed performance while the cluster runs in a degraded state. Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM Designing the Compaq ProLiant Clusters HA/F100 and HA/F200 2-5 Example 2: Database/Database Another scenario (Figure 2-2) has two distinct database applications running on two separate cluster nodes. One database application maintains Human Resources records, and its primary node is set to the HR database node. The other database application is used for market research, and its primary node is set to the Marketing database node. Order Entry Order Entry Database Database Shared Storage Node 1 Node 2 (Order Entry) (Order Entry) Figure 2-2. Active/active example 2 While in a normal state, both cluster nodes run at expected performance levels. If the Marketing server encounters a failure, the market research application and associated data resources fail over to their secondary node, the HR database server. The Marketing clients experience a slight disruption of service while the database resources are failed over, the database transaction log is rolled back, and the information in the database is validated. When the database validation is complete, the market research application is brought online on the HR database node and the Marketing clients can reconnect to it. While the Marketing database validation is occurring, the HR clients do not experience any disruption of service. Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM 2-6 Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Example 3: File & Print/Database In this example (Figure 2-3), a business uses a single server to run its order entry department. The same department has a file and print server. While order entry is business-critical and requires maximum availability, the file and print server can be unavailable for several hours without impacting revenue. In this scenario, the order entry database is configured to use the file and print server as its secondary node. However, the file and print server will not be configured to fail over applications to the order entry server. File and Print Order Entry Services Database Capacity of Order Entry Shared Storage Database Node1 Node2 (File and Print) (Order Entry) Figure 2-3. Active/active example 3 Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM Designing the Compaq ProLiant Clusters HA/F100 and HA/F200 2-7 If the node running the order entry database encounters a failure, the database fails over to its secondary node. The order entry clients experience a slight disruption of service while the database resources are failed over, the database transaction log is rolled back, and the information in the database is validated. When the database validation is complete, the order entry application is brought online on the file and print server and the clients can reconnect to it. While the database validation is occurring, file and print activities continue without disruption. If the file and print server encounters a failure, those services are not failed over to the order entry server. File and print services are offline until the problem is resolved and the node is brought back online. Active/Standby Configuration The primary difference between an active/active configuration and an active/standby configuration is the number of servers actively processing data. In active/standby, only one server is processing data (active) while the other (the standby server) is in an idle state. The standby server must be logged in to the Windows NT or Windows 2000 domain and the Microsoft clustering software must be up and running. However, no applications are running. The standby server's only purpose is to take over failed clustered applications from its partner. The standby server is not a preferred node for any clustered applications and, therefore, does not fail over any applications to its partner server. Because the standby server does not process data until it accepts failed over applications, the limited use of the server may not justify the cost of the server. However, the cost of standby servers is justified when performance and availability are paramount to a business' operations. The standby server should be designed to run all of the clustered applications with little or no performance degradation. Since the standby server is not running any applications while the cluster is in a normal operating state, a failed-over clustered application will likely execute with the same speed and response time as if it were executing on the primary server. Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM 2-8 Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Example 4: Database/Standby Server An example business scenario describes a mail order business whose competitive edge is quick product delivery (Figure 2-4). If the product is not delivered on time, the order is void and the sale is terminated. The business uses a single server to perform queries and calculations on order entry information, translating sales orders into packaging and distribution instructions for the warehouse. With an estimated downtime cost of $1,000/hour, the company determines that the cost of a standby server is justified. This mission-critical (active) server is clustered with a standby server. If the active server encounters a failure, this critical application and all its resources fail over to the standby server, which validates the database and brings it online. The standby server now becomes active and the application executes at an acceptable level of performance. Capacity Mail Order System (Mail Order System) Shared Storage Node1 Node2 (Standby) (Mail Order Database) Figure 2-4. Active/standby server example Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM Designing the Compaq ProLiant Clusters HA/F100 and HA/F200 2-9 Cluster Groups Understanding the relationship between your company's business functions and cluster groups is essential to getting the most from your cluster. Business functions rely on computer systems to support activities such as transaction processing, information distribution, and information retrieval. Each computer activity relies on applications or services, and each application depends on software and hardware subsystems. For example, most applications need a storage subsystem to hold their data files. This section is designed to help you understand which subsystems, or resources, must be available for either cluster node to run a clustered application properly. Creating a Cluster Group The easiest approach to creating a cluster group is to start by designing a resource dependency tree. A resource dependency tree has as its top level the business function for which cluster groups are created. Each cluster group has branches that indicate the resources upon which the group is dependent. Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM 2-10 Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Resource Dependency Tree The following steps describe the process of creating a resource dependency tree. Each step is illustrated by adding information to a sample resource dependency tree. The sample is for a hypothetical Web Sales Order business function, which consists of two cluster groups: a database server (a Windows NT or Windows 2000 application) and a Web server (a Windows NT or Windows 2000 service). NOTE: For this example, it is assumed that each cluster group can communicate with the other even if they are not executing on the same node, for example, by means of an IP address. With this assumption, one cluster group can fail over to the other node, while the remaining cluster group continues to execute on its primary node. 1. List each business function that requires a clustered application or service (Figure 2-5). Web Sales Order Business Function Web Sales Order Cluster Group Cluster Group #2 Cluster Group #1 Figure 2-5. Resource dependency tree: step 1 Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM Designing the Compaq ProLiant Clusters HA/F100 and HA/F200 2-11 2. List each application or service required for each business function (Figure 2-6). Web Sales Order Business Function Web Server Service Database Server Application (Cluster Group #1) (Cluster Group #2) Resource Resource Resource Resource Resource Resource Resource #1 #2 #3 #1 #2 #3 #4 Dependent-Resource Dependent-Resource #1 #1 Figure 2-6. Resource dependency tree: step 2 Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM 2-12 Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide 3. List the immediate dependencies for each application or service (Figure 2-7. Web Sales Order Business Function Database Server Application Web Server Service (Cluster Group #2) (Cluster Group #1) Physical Disk Web Server Physical Disk Physical Disk Database Network Network Resource - Service Resource- Resource - Application Name Name contains DB contains web contains DB data file(s) pages and web log file(s) scripts IP Address IP Address Figure 2-7. Resource dependency tree: step 3 4. Transfer the resource dependency tree into a Cluster Group Definition worksheet. Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM Designing the Compaq ProLiant Clusters HA/F100 and HA/F200 2-13 Figure 2-8 illustrates the worksheet for the Web Sales Order business function. A blank copy of the worksheet is provided in Appendix A. Cluster Group Definition Worksheet Web Sales Order Cluster Function Web Server Service Group #1 Database Server Application Group #2 Resource Definitions Group #1 (Web Server Service) Resource #1 Network Name Sub Resource 1 Sub Resource 2 Sub Resource 3 Sub Resource 4 IP Address Resource #2 Physical Disk Resource-contains Web pages and Web scripts Sub Resource 1 Sub Resource 2 Sub Resource 3 Sub Resource 4 Resource #3 Web Server Service Sub Resource 1 Sub Resource 2 Sub Resource 3 Sub Resource 4 Resource #4 N/A Sub Resource 1 Sub Resource 2 Sub Resource 3 Sub Resource 4 Group #2 (Database Server Application) Resource #1 Network Name Sub Resource 1 Sub Resource 2 Sub Resource 3 Sub Resource 4 IP Address Resource #2 Physical Disk Resource-contains database log files Sub Resource 1 Sub Resource 2 Sub Resource 3 Sub Resource 4 Resource #3 Physical Disk Resource-contains database data files Sub Resource 1 Sub Resource 2 Sub Resource 3 Sub Resource 4 Resource #4 Database Application Sub Resource 1 Sub Resource 2 Sub Resource 3 Sub Resource 4 Figure 2-8. Cluster Group Definition Worksheet (example) Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM 2-14 Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Use the resource dependency tree concept to review your company's availability needs. It is a useful exercise, directing you to record the exact design and definition of each cluster group. Reducing Single Points of Failure in the HA/F100 Configuration The final planning consideration is reducing single points of failure. Depending on your needs, you may leave all vulnerable areas alone, accepting the risk associated with a potential failure. Or, if the risk of failure is unacceptable for a given area, you may elect to use a redundant component to minimize, or remove, the single point of failure. NOTE: Although not specifically covered in this section, redundant server components (such as power supplies and processor modules) should be used wherever possible. These features will vary based upon your specific server model. The single points of failure described in this section are: I Cluster interconnect I Fibre Channel data paths I Non-shared disk drives I Shared disk drives NOTE: The Compaq ProLiant Cluster HA/F200 addresses the single points of failure listed above with its dual redundant loop configuration. For more information, refer to the "Enhanced High Availability Features of the HA/F200" section of this chapter. Cluster Interconnect The interconnect is the primary means for the cluster nodes to communicate. Intracluster communication is crucial to the health of the cluster. If communication between the cluster nodes ceases, the Microsoft clustering software must determine the state of the cluster and take action, in most cases bringing the cluster groups offline on one of the nodes and failing over all cluster groups to the other node. Following are two strategies for increasing the availability of intracluster communication. Combined, these strategies provide even more redundancy. Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM Designing the Compaq ProLiant Clusters HA/F100 and HA/F200 2-15 Microsoft clustering software configuration Microsoft Cluster Server for Windows NTS/E and Cluster Service for Windows 2000 Advanced Server (MSCS) allow you to configure a primary and backup path for intracluster communication, which will reduce the possibility of an intracluster communication disruption. Any network interface card (NIC) in the nodes can be configured to serve as a backup path for node- to-node communication. When the primary path is disrupted, the transfer of communication responsibilities goes undetected by applications running on the cluster. Whether a dedicated or public interconnect has been set up, a separate NIC should be configured to act as a redundant interconnect. This is an easy and inexpensive way to add redundancy to intracluster communication. Redundant Interconnect Card Another strategy to increase availability is to use a redundant interconnect card. This may be done for either the dedicated intracluster communication path, or for the client LAN. If you are using a dedicated, direct-connection interconnect configuration, you can install a second dedicated, direct-connection interconnect. NOTE: If you are using the ServerNet option as the interconnect, the card itself has a built-in level of redundancy. Each ServerNet PCI adapter has two data ports, thereby allowing two separate cables to be run to and from each cluster node. If the ServerNet adapter determines that data is being sent from one adapter but not received by the other, it will automatically route the information through its other port. There are two implementations that provide identical redundant NIC capability. The implementation you choose will depend on your hardware. The Compaq TLAN Teaming and Configuration Utility is supported on all Compaq TI-based Ethernet and Fast Ethernet NICs, such as NetFlex-3 and Netelligent 10/100 TX PCI Ethernet NICs. The Compaq Network Teaming and Configuration Utility is designed to operate with the Compaq Intel-based 10/100 NICs. Combining these utilities with the appropriate NICs will enable a seamless, undetectable failover of the primary interconnect to the redundant interconnect. NOTE: These two methods of NIC redundancy cannot be combined in a single redundant NIC pair: TI-based NICs may not be paired with Intel-based NICs to create a redundant pair. For more information, refer to the Compaq White Paper, "High Availability Options Supported by Compaq Network Interface Controllers," available at the Compaq High Availability website (http://www.compaq.com/). Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM 2-16 Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Because the purpose of the redundant interconnect is to increase the availability of the cluster, it is important to monitor the status of your redundant NICs. Compaq Insight Manager and Compaq Insight Manager XE simplify management of the interconnect by monitoring the state of the NIC. You can view status information and alert conditions for all cards in each node. If a failover event occurs due to a disruption in the heartbeat, you can use the Compaq Insight Manager tools to determine where the disruption originated. Cluster-to-LAN Communication Each cluster node must have at least one NIC that connects to the LAN. Through this connection, network clients can access applications and data on the cluster. If the LAN NIC fails in one of the nodes, any clients connected directly to the cluster node by means of the computer name, cluster node IP address, or MAC address of the NIC will no longer have access to their applications. Clients connected to a virtual server on the cluster (via the IP address or network name of a cluster group) reconnect to the cluster through the surviving cluster node. Failure of a LAN NIC in a cluster node may have serious repercussions. If your cluster is configured with a dedicated interconnect and a single LAN NIC, the failure of a LAN NIC will prevent network clients from accessing cluster groups running on that node. If the interconnect path is not disrupted, it is possible that a failover will not occur. The applications will continue to run on the node with the failed NIC; however, clients will be unable to access them. Install redundant NICs and use the proper redundant NIC utility to reduce the possibility of LAN NIC failure. When your cluster nodes are configured with the utility, the redundant NIC automatically takes over operation if the primary NIC fails. Clients maintain their connection with their primary node and, without disruption, continue to have access to their applications. Compaq offers a dual-port NIC that can utilize the Compaq Redundant NIC Utility. This also reduces the possibility of the failure scenario described above. However, if the entire NIC or the node slot into which the NIC is placed fails, the same failure scenario will occur. Compaq Insight Manager and Compaq Insight Manager XE monitor the health of any network cards used for the LAN. If any of the cards experience a fault, the Compaq Insight Manager tools mark the card as "Offline" and change its condition to the appropriate status. Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM Designing the Compaq ProLiant Clusters HA/F100 and HA/F200 2-17 Recommended Cluster Communication Strategy The past two sections discussed the redundancy of intracluster and cluster-to-LAN communication. However, to obtain the most benefit while minimizing cost and complexity, view cluster communications as a single entity. To create redundancy for both intracluster and cluster-to-LAN communication, first, employ physical hardware redundancy for the LAN NICs. Second, configure the Microsoft clustering software to use both the primary and redundant LAN NIC as backup for intracluster communication. With this strategy, your cluster can continue normal operations (without a failover event) when each of the following points of failure are encountered: I Failure of the interconnect card I Failure of the interconnect cable I Failure of the port on the LAN NIC I Failure of the LAN NIC (if redundant NICs, as opposed to dual-ported NICs, are used) I Failure of the Ethernet cable running from a cluster node to the Ethernet hub (which connects to the LAN) The following examples describe how to physically set up your cluster nodes to employ the Compaq-recommended strategy. Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM 2-18 Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Example 1 A Compaq dual-port NIC and a single-port NIC are used in this example (Figure 2-9). The first port of the dual-port NIC is a dedicated interconnect, and the second port is the backup path for the cluster-to-LAN network. The single-port NIC is configured as the primary network path for cluster-to-LAN communication. The TLAN Teaming and Configuration Utility (for ThunderLAN NICs) and the Network Teaming and Configuration Utility (for Intel NICs) are used to configure the second port on the dual-port NIC as the backup port of a redundant pair. The single port on the other NIC is configured to be the primary port for cluster-to-LAN communication. The interconnect retains its fully redundant status when MSCS is configured to use the other network ports as interconnect backup. Failure of the primary interconnect path results in intracluster communications occurring over the single-port NIC, since the single-port NIC was configured in MSCS as the backup for intracluster communication. If the entire dual-port NIC fails, the cluster nodes still have a working communication path over the single-port NIC. With this configuration, even a failure of the dual-port NIC results in the transfer of the cluster-to-LAN communication to the single-port NIC. Other than a failure of the network hub, the failure of any cluster network component will be resolved by the redundancy of this configuration. Primary Interconnect Path Node 2 Node 1 Backup Cluster to LAN and Primary Cluster to LAN and Backup Interconnect Path Backup Interconnect Path Hub Clients Figure 2-9. Use of dual-port NICs to increase redundancy Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM Designing the Compaq ProLiant Clusters HA/F100 and HA/F200 2-19 Example 2 The second example configuration consists of three single-port NICs (Figure 2-10). One NIC is dedicated to intracluster communication. The other two NICs are used for cluster-to-LAN communication. The Compaq Advanced Network Control Utility is used to configure two of the NICs--one as the primary and one as the standby of a redundant pair. The interconnect is fully redundant when the Microsoft clustering software is configured to use the other network cards as backups for the interconnect. Failure of the primary interconnect path results in intracluster communications occurring over the primary NIC of the redundant pair. If the entire interconnect card fails, the cluster nodes will still have a working communication path. The cluster-to-LAN communication is fully redundant up to the network hub. With this configuration, even a failure of the primary NIC results only in the transfer of the network path to the standby NIC. Other than a failure of the network hub, any failure of any cluster network component will be resolved by the redundancy of this configuration. The primary disadvantage of this configuration as compared to Example 1 is that an additional card slot is used by the third NIC. Primary Interconnect Path Node 1 Node 2 Primary Cluster to LAN and Backup Interconnect Path Backup Cluster to LAN and Backup Interconnect Path Hub Clients Figure 2-10. Use of three NICs to increase redundancy Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM 2-20 Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide HA/F100 Fibre Channel Data Paths The Compaq StorageWorks RAID Array 4000 or Compaq StorageWorks RAID Array 4100 storage system is the mechanism with which ProLiant Clusters implement shared storage. Generally, the storage system consists of a host bus adapter in each server, a storage hub or switch, a Compaq StorageWorks RA4000 Controller, and a Compaq StorageWorks RAID Array 4000 or Compaq StorageWorks RAID Array 4100 (RA4000/4100) into which the SCSI disks are placed. The RA4000/4100 storage system has two distinct data paths, separated by the Fibre Channel storage hub or FC-AL switch: I The first data path runs from the host bus adapters in the servers to the Fibre Channel storage hub or FC-AL switch. I The second data path runs from the Fibre Channel storage hub or FC-AL switch to the RA4000/4100. The effects of a failure will vary depending on whether the failure occurred on the first or second data path. Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM Designing the Compaq ProLiant Clusters HA/F100 and HA/F200 2-21 Failure of the Host Bus Adapter-to-Storage Hub Data Path If the host bus adapter-to-storage hub path fails (Figure 2-11), it results in a failover of all applications. For instance, if one server can no longer access the storage hub (and by extension the shared storage), all of the cluster groups that depend on shared storage will fail over to the second server. The cost of failure is relatively minor. It is the downtime experienced by users while the failover event occurs. RA4000/4100 storage hub or switch Interconnect ProLiant ProLiant Server Server Corporate LAN Figure 2-11. Host bus adapter-to-storage hub data path Note that the Compaq Insight Manager tools monitor the health of the RA4000/4100 storage system. If any part of the Fibre Channel data path disrupts a server's access to the RA4000/4100, the array controller status changes to "Failed" and the condition is red. The red condition bubbles up to higher-level Compaq Insight Manager screens and eventually to the device list. NOTE: The Compaq Insight Manager tools display a failure of physical hardware through the Mass Storage button on the View screen, marking the hardware "Failed." A logical drive in the cluster is reported on the Cluster Shared Resources screen as a logical disk resource. Compaq Insight Manager and Compaq Insight Manager XE do not associate the logical drive with the physical hardware. Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM 2-22 Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Failure of the Hub-to-RA4000/4100 Data Path The second data path (Figure 2-12), from the storage hub to the RA4000/4100, has more severe implications when it fails. If this data path fails, all clustered applications become inoperable. Even attempting to fail the applications to another cluster node will not gain access to the RA4000/4100. NOTE: This failure scenario can be avoided by deploying the redundant Fibre Channel loop configuration of the Compaq ProLiant Cluster HA/F200. RA4000/4100 storage hub or switch Interconnect ProLiant ProLiant Server Server Corporate LAN Figure 2-12. Hub-to-RA4000/4100 data path Without access to shared storage, clustered applications cannot reach their data or log files. The data, however, is unharmed and remains safely stored on the physical disks inside the RA4000/4100. If a database application was running when this failure occurred, some in-progress transactions will be lost. The database will need to be rolled back and the in-progress transactions re- entered. Like the server-to-storage hub data path, the Compaq Insight Manager tools detect this fault, change the RA4000/4100 status to "Failed," and change its condition to red. The red condition bubbles up through Compaq Insight Manager screens, eventually to the device list. Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM Designing the Compaq ProLiant Clusters HA/F100 and HA/F200 2-23 Nonshared Disk Drives Nonshared disk drives, or local storage, operate the same way in a cluster as they do in a single-server environment. These drives can be in the server drive bays or in an external storage cabinet. As long as they are not accessible by both servers, they are considered nonshared. Treat nonshared drives in a clustered environment as you would in a nonclustered environment. Most likely, some form of RAID is used to protect the drives and restore a failed drive. Since the operating system is stored on these drives, use either hardware or software RAID to protect the information. Hardware RAID is available with the Compaq SMART-2 Controller or by using a nonshared storage system. Shared Disk Drives Shared disk drives are contained in the RA4000/4100, which is accessible by each cluster node. Employ hardware RAID 1 or 5 on all of your shared disk drives. This is configured using the Compaq Array Configuration Utility. If RAID 1 or 5 is not used, failure of a shared disk drive will disrupt service to all clustered applications and services that depend on the drive. Failover of a cluster node will not resolve this failure, since neither server can read from a failed drive. NOTE: Windows NTS/E software RAID is not available for shared drives when using MSCS. Hardware RAID is the only available RAID option for shared storage. As with other system failures, Compaq Insight Manager monitors the health of disk drives and will mark a failed drive as "Failed." Enhanced High Availability Features of the HA/F200 A single point of failure refers to any component in the system that, should it fail, prevents the system from functioning. Single points of failure in hardware can be minimized, and in some cases eliminated, by using redundant components. The most effective way of accomplishing this is by clustering. The Compaq ProLiant Cluster HA/F100 reduces the single points of failure that exist in a single-server environment by allowing two servers to share storage and take over for each other in the event that one server fails. The Compaq ProLiant Cluster HA/F200 goes one step further by implementing a dual redundant Fibre Channel Arbitrated Loop configuration. Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM 2-24 Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide The Compaq ProLiant Cluster HA/F200 further enhances high availability through the use of additional, redundant, components in the server-to-storage connection and in the shared storage system itself. In the event of a failure, processing is switched to an alternate path without affecting applications and end users. In fact, this path switch is transparent even to the Windows NT and Windows 2000 file system (NTFS). The combination of multiple paths and redundant hardware components provided by the HA/F200 offers significantly enhanced high availability over non-redundant configurations. A single component failure in the HA/F200 will result in an automatic failover to an alternate component, allowing end users to continue accessing their applications without interruption. Some typical failures and associated responses in an HA/F200 configuration are: I A server failure will cause the Microsoft clustering software to fail application processing over to the second server. I A host bus adapter failure will cause I/O requests intended for the failed adapter to be rerouted through the remaining adapter. Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM Designing the Compaq ProLiant Clusters HA/F100 and HA/F200 2-25 I A storage hub, switch, or cable failure will be treated like a host bus adapter failure and a failover to the second host bus adapter, which is using a different storage hub and cables, will occur. I An array controller failure will cause the redundant array controller to take over for the failed controller. In all of the above examples, end users will experience minimal interruptions while the failover occurs. In some cases, the interruptions may not even be noticeable. The following illustration depicts the HA/F200 configuration components. Node 1 RA4000/4100 storage hub or switch Dedicated Interconnect storage hub or switch Node 2 LAN Figure 2-13. HA/F200 configuration HA/F200 Fibre Channel Data Paths The Compaq StorageWorks RAID Array 4000/4100 storage system is the mechanism with which the HA/F200 cluster implements shared storage. The Compaq ProLiant Cluster HA/F200 minimum configuration consists of two host bus adapters in each server, two Fibre Channel storage hubs or FC-AL switches, two array controllers per RA4000/4100, and one or more RA4000/4100s. Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM 2-26 Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide The RA4000/4100 storage system has active data paths and standby data paths, separated by two Fibre Channel storage hubs or FC-AL switches. Figure 2-14 and Figure 2-15 detail the active and standby paths of the minimum HA/F200 configuration. A A S S Server Server storage hub storage hub or switch or switch Active Standby RA4000/4100 Figure 2-14. Active host bus adapter-to-storage data paths Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM Designing the Compaq ProLiant Clusters HA/F100 and HA/F200 2-27 The active data paths run from the active host bus adapters in the servers to the active storage hub. If this path fails, the applications can seamlessly fail over to the standby host bus adapter-to-storage hub data paths (Figure 2-15). A A S S Server Server storage hub storage hub or switch or switch Active Standby RA4000/4100 Figure 2-15. Active hub-to-storage data path The second active data path runs from the active hub or switch to the RA4000/4100. If this path fails, the applications can seamlessly fail over to the standby hub-to-RA4000/4100 data path. The dual redundant loop feature of the Compaq ProLiant Cluster HA/F200 increases the level of availability over clusters that have only one path to the shared storage. In addition, the second path in the HA/F200 provides for improved performance through static load balancing. Static load balancing considerations are discussed in the "Static Load Balancing" section of this chapter. Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM 2-28 Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Capacity Planning Capacity planning determines how much computer hardware is needed to support the applications and data on your clustered servers. Unlike conventional, single-server capacity planning, clustered configurations must ensure that each node is capable of running any applications or services that may fail over from its partner node. To simplify the following discussion, the software running on each of the clustered nodes is divided into three generic categories: I Operating system I Nonclustered applications and services I Clustered applications and services Figure 2-16 illustrates these categories in the cluster. Data for Node1 Clustered Applications & Services Data for Node2 Clustered Applications & Services Shared Storage Operating System Operating System Clustered Applications Clustered Applications & Services & Services Non-Clustered Non-Clustered Applications & Services Applications & Services Node2 Node1 Figure 2-16. File locations in a Compaq ProLiant Cluster For each server, determine the processor, memory, and disk storage requirements needed to support its operating system and nonclustered applications and services. Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM Designing the Compaq ProLiant Clusters HA/F100 and HA/F200 2-29 Determine the processor and memory requirements needed to support the clustered applications and services that will run on each node while the cluster is in a normal operating state. If the program files of a clustered application and/or service will reside on local storage, remember to add that capacity to the amount of local storage needed on each node. For all files that will reside on shared storage, see "Shared Storage Capacity" later in this chapter. Server Capacity The capacity needed in each server depends on whether you design your cluster as an active/active configuration or as an active/standby configuration. Capacity planning for each configuration is discussed in the following sections. Active/Active Configuration As described earlier in this chapter, an active/active configuration can be designed in two ways: I Applications and services may be configured to fail over from each node to its partner node. I Applications and services may be configured to fail over from just one node to its partner node. Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM 2-30 Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide The following table details the capacity requirements that can be applied to either active/active design. Table 2-1 Server Capacity* Requirements for Active/Active Configuration Node 1 Node 2 Operating system (with MSCS) Operating system (with MSCS) Nonclustered applications and services Nonclustered applications and services Server1 clustered applications and services Server2 clustered applications and services Server2 clustered applications and services Server1 clustered applications and services (if Server2 is set up to fail applications and (if Server1 is set up to fail applications and services to Server1) services to Server2) * Processing power, memory, and nonshared storage Active/Standby Configuration In an active/standby configuration, only one node actively runs applications and services. The other node is in an idle, or standby, state. Assume Node 1 is the active node and Node 2 is the standby node. Table 2-2 Server Capacity* Requirements for Active/Standby Configuration Node 1 Node 2 Operating System (with MSCS) Operating system (with MSCS) Nonclustered applications and services Server1 clustered applications and services Server1 clustered applications and services * Processing power, memory, and nonshared storage Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM Designing the Compaq ProLiant Clusters HA/F100 and HA/F200 2-31 Shared Storage Capacity Each server is connected to shared storage (the Compaq StorageWorks RAID Array 4000/4100 storage system), which mainly stores data files of clustered applications and services. Follow the guidelines below to determine how much capacity is needed for your shared storage. NOTE: For some clustered applications, it may make sense to store the application program files on shared storage. If the application allows customization and the customized information is stored in program files, the program files should be placed on shared storage. When a failover event occurs, the secondary node will launch the application from shared storage. The application will execute with the same customizations that existed when executed on the primary node. Two factors help to determine the required amount of shared storage disk space: I The amount of space required for all clustered applications and their dependencies. I The level of data protection (RAID) required for each type of data used by each clustered application. Two factors driving RAID requirements are: The performance required for each drive volume The recovery time required for each drive volume IMPORTANT: Windows software RAID is not available for shared drives when using MSCS. Hardware RAID is the only available RAID option for shared storage. For more information about hardware RAID, see the following: I Compaq StorageWorks Fibre Channel RAID Array 4000 User Guide I Compaq StorageWorks Fibre Channel RAID Array 4100 User Guide Compaq Confidential Need to Know Required Writer: Bryan Hicks Project: Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide Comments: Part Number: 380362-003 File Name: c-ch2 Designing the Compaq ProLiant Clusters HAF100 and HAF200.doc Last Saved On: 8/24/00 12:00 PM 2-32 Compaq ProLiant Clusters HA/F100 and HA/F200 Administrator Guide In the "Cluster Groups" section of this chapter, you created a resource dependency tree, then transferred that information into a Cluster Group Definition Worksheet (Figure 2-8). Under the resource dependencies in the worksheet, you listed at least one physical disk resource. For each physical disk resource, determine the capacity and level of protection required for the data to be stored on it. For example, the Web Sales Order Database group depends on a log file, data files, and program files. It might be important for the log file and program files to have a quick recovery time, while performance would be a secondary concern. Together, the files do not take up much capacity; therefore, mirroring
| 380362-003 |