vSphere Availability Update 1 ESXi 6.0 vCenter Server 6.0 This document supports the version of each product listed and supports all subsequent versions until the document is replaced by a new edition. To check for more recent editions of this document, see http://www.vmware.com/support/pubs.
vSphere Availability You can find the most up-to-date technical documentation on the VMware Web site at: http://www.vmware.com/support/ The VMware Web site also provides the latest product updates. If you have comments about this documentation, submit your feedback to: docfeedback@vmware.com Copyright © 2009–2016 VMware, Inc. All rights reserved. Copyright and trademark information. VMware, Inc. 3401 Hillview Ave. Palo Alto, CA 94304 www.vmware.com 2 VMware, Inc.
Contents About vSphere Availability 5 Updated Information 7 1 Business Continuity and Minimizing Downtime 9 Reducing Planned Downtime 9 Preventing Unplanned Downtime 10 vSphere HA Provides Rapid Recovery from Outages 10 vSphere Fault Tolerance Provides Continuous Availability 11 2 Creating and Using vSphere HA Clusters 13 How vSphere HA Works 13 vSphere HA Admission Control 23 vSphere HA Interoperability 29 Creating and Configuring a vSphere HA Cluster Best Practices for vSphere HA Clusters 40 32 3
vSphere Availability 4 VMware, Inc.
About vSphere Availability vSphere Availability describes solutions that provide business continuity, including how to establish ® vSphere High Availability (HA) and vSphere Fault Tolerance. Intended Audience This information is for anyone who wants to provide business continuity through the vSphere HA and Fault Tolerance solutions. The information in this book is for experienced Windows or Linux system administrators who are familiar with virtual machine technology and data center operations.
vSphere Availability 6 VMware, Inc.
Updated Information This vSphere Availability is updated with each release of the product or when necessary. This table provides the update history of the vSphere Availability. Revision Description EN-001810-02 Change to wording about dedicated FT network under Fault Tolerance Requirements. See “Fault Tolerance Requirements, Limits, and Licensing,” on page 46. EN-001810-01 New note about ESXi host version needed for VM Component Protection feature. See “VM Component Protection,” on page 19.
vSphere Availability 8 VMware, Inc.
Business Continuity and Minimizing Downtime 1 Downtime, whether planned or unplanned, brings with it considerable costs. However, solutions to ensure higher levels of availability have traditionally been costly, hard to implement, and difficult to manage. VMware software makes it simpler and less expensive to provide higher levels of availability for important applications.
vSphere Availability ® The vSphere vMotion and Storage vMotion functionality in vSphere makes it possible for organizations to reduce planned downtime because workloads in a VMware environment can be dynamically moved to different physical servers or to different underlying storage without service interruption. Administrators can perform faster and completely transparent maintenance operations, without being forced to schedule inconvenient maintenance windows.
Chapter 1 Business Continuity and Minimizing Downtime vSphere HA has several advantages over traditional failover solutions: Minimal setup After a vSphere HA cluster is set up, all virtual machines in the cluster get failover support without additional configuration. Reduced hardware cost and setup The virtual machine acts as a portable container for the applications and it can be moved among hosts. Administrators avoid duplicate configurations on multiple machines.
vSphere Availability 12 VMware, Inc.
Creating and Using vSphere HA Clusters 2 vSphere HA clusters enable a collection of ESXi hosts to work together so that, as a group, they provide higher levels of availability for virtual machines than each ESXi host can provide individually. When you plan the creation and usage of a new vSphere HA cluster, the options you select affect the way that cluster responds to failures of hosts or virtual machines.
vSphere Availability Master and Slave Hosts When you add a host to a vSphere HA cluster, an agent is uploaded to the host and configured to communicate with other agents in the cluster. Each host in the cluster functions as a master host or a slave host. When vSphere HA is enabled for a cluster, all active hosts (those not in standby or maintenance mode, or not disconnected) participate in an election to choose the cluster's master host.
Chapter 2 Creating and Using vSphere HA Clusters If a master host is unable to communicate directly with the agent on a slave host, the slave host does not respond to ICMP pings, and the agent is not issuing heartbeats it is considered to have failed. The host's virtual machines are restarted on alternate hosts.
vSphere Availability n Medium. Application servers that consume data in the database and provide results on web pages. n Low. Web servers that receive user requests, pass queries to application servers, and return results to users. If a host fails, vSphere HA attempts to register to an active host the affected virtual machines that were powered on and have a restart priority setting of Disabled, or that were powered off.
Chapter 2 Creating and Using vSphere HA Clusters Factors Considered for Virtual Machine Restarts After a failure, the cluster's master host attempts to restart affected virtual machines by identifying a host that can power them on. When choosing such a host, the master host considers a number of factors.
vSphere Availability Virtual Machine Restart Notifications vSphere HA generates a cluster event when a failover operation is in progress for virtual machines in the cluster. The event also displays a configuration issue in the Cluster Summary tab which reports the number of virtual machines that are being restarted. There are four different categories of such VMs.
Chapter 2 Creating and Using vSphere HA Clusters The default settings for monitoring sensitivity are described in Table 2-1. You can also specify custom values for both monitoring sensitivity and the I/O stats interval by selecting the Custom checkbox. Table 2‑1. VM Monitoring Settings Setting Failure Interval (seconds) Reset Period High 30 1 hour Medium 60 24 hours Low 120 7 days After failures are detected, vSphere HA resets virtual machines. The reset ensures that services remain available.
vSphere Availability Configuring VMCP VM Component Protection is enabled and configured in the vSphere Web Client. To enable this feature, you must select the Protect against Storage Connectivity Loss checkbox in the edit cluster settings wizard. The storage protection levels you can choose and the virtual machine remediation actions available differ depending on the type of database accessibility failure.
Chapter 2 Creating and Using vSphere HA Clusters n T=320s: vSphere HA now starts the APD recovery response after the Delay for VM failover for APD elapses (3 minutes after the APD Timeout is reached). Network Partitions When a management network failure occurs for a vSphere HA cluster, a subset of the cluster's hosts might be unable to communicate over the management network with the other hosts. Multiple partitions can occur in a cluster.
vSphere Availability vSphere HA Security vSphere HA is enhanced by several security features. 22 Select firewall ports opened vSphere HA uses TCP and UDP port 8182 for agent-to-agent communication. The firewall ports open and close automatically to ensure they are open only when needed. Configuration files protected using file system permissions vSphere HA stores configuration information on the local storage or on ramdisk if there is no local datastore.
Chapter 2 Creating and Using vSphere HA Clusters vSphere HA Admission Control vCenter Server uses admission control to ensure that sufficient resources are available in a cluster to provide failover protection and to ensure that virtual machine resource reservations are respected. Three types of admission control are available. Host Ensures that a host has sufficient resources to satisfy the reservations of all virtual machines running on it.
vSphere Availability 3 Determines the Current Failover Capacity of the cluster. This is the number of hosts that can fail and still leave enough slots to satisfy all of the powered-on virtual machines. 4 Determines whether the Current Failover Capacity is less than the Configured Failover Capacity (provided by the user). If it is, admission control disallows the operation.
Chapter 2 Creating and Using vSphere HA Clusters Advanced Runtime Info When you select the Host Failures Cluster Tolerates admission control policy, the Advanced Runtime Info pane appears in the vSphere HA section of the cluster's Monitor tab in the vSphere Web Client. This pane displays the following information about the cluster: n Slot size. n Total slots in cluster. The sum of the slots supported by the good hosts in the cluster. n Used slots.
vSphere Availability Figure 2‑1. Admission Control Example with Host Failures Cluster Tolerates Policy VM1 2GHz 1GB VM2 2GHz 1GB VM3 1GHz 2GB VM4 1GHz 1GB VM5 1GHz 1GB slot size 2GHz, 2GB H1 H2 H3 9GHz 9GB 9GHz 6GB 6GHz 6GB 4 slots 3 slots 3 slots 6 slots remaining if H1 fails 1 Slot size is calculated by comparing both the CPU and memory requirements of the virtual machines and selecting the largest.
Chapter 2 Creating and Using vSphere HA Clusters vSphere HA uses the actual reservations of the virtual machines. If a virtual machine does not have reservations, meaning that the reservation is 0, a default of 0MB memory and 32MHz CPU is applied. NOTE The Percentage of Cluster Resources Reserved admission control policy also checks that there are at least two vSphere HA-enabled hosts in the cluster (excluding hosts that are entering maintenance mode).
vSphere Availability Figure 2‑2. Admission Control Example with Percentage of Cluster Resources Reserved Policy VM1 2GHz 1GB VM2 2GHz 1GB VM3 1GHz 2GB VM4 1GHz 1GB VM5 1GHz 1GB total resource requirements 7GHz, 6GB H1 H2 H3 9GHz 9GB 9GHz 6GB 6GHz 6GB total host resources 24GHz, 21GB The total resource requirements for the powered-on virtual machines is 7GHz and 6GB. The total host resources available for virtual machines is 24GHz and 21GB.
Chapter 2 Creating and Using vSphere HA Clusters Choosing an Admission Control Policy You should choose a vSphere HA admission control policy based on your availability needs and the characteristics of your cluster. When choosing an admission control policy, you should consider a number of factors. Avoiding Resource Fragmentation Resource fragmentation occurs when there are enough resources in aggregate for a virtual machine to be failed over.
vSphere Availability n The cluster must have a minimum of three ESXi hosts. Networking Differences Virtual SAN has its own network. When Virtual SAN and vSphere HA are enabled for the same cluster, the HA interagent traffic flows over this storage network rather than the management network. The management network is used by vSphere HA only when Virtual SAN is disabled. vCenter Server chooses the appropriate network when vSphere HA is configured on a host.
Chapter 2 Creating and Using vSphere HA Clusters Using vSphere HA and DRS Together Using vSphere HA with Distributed Resource Scheduler (DRS) combines automatic failover with load balancing. This combination can result in a more balanced cluster after vSphere HA has moved virtual machines to different hosts. When vSphere HA performs failover and restarts virtual machines on different hosts, its first priority is the immediate availability of all virtual machines.
vSphere Availability n HA should respect VM to Host affinity rules during failover --vSphere HA attempts to place VMs with this rule on the specified hosts if at all possible. NOTE vSphere HA can restart a VM in a DRS-disabled cluster, overriding a VM-Host affinity rules mapping if the host failure happens soon (by default, within 5 minutes) after setting the rule. Other vSphere HA Interoperability Issues To use vSphere HA, you must be aware of the following additional interoperability issues.
Chapter 2 Creating and Using vSphere HA Clusters You can enable and configure vSphere HA before you add host nodes to the cluster. However, until the hosts are added, your cluster is not fully operational and some of the cluster settings are unavailable. For example, the Specify a Failover Host admission control policy is unavailable until there is a host that can be designated as the failover host.
vSphere Availability Create a vSphere HA Cluster To enable your cluster for vSphere HA, you must first create an empty cluster. After you plan the resources and networking architecture of your cluster, use the vSphere Web Client to add hosts to the cluster and specify the cluster's vSphere HA settings. A vSphere HA-enabled cluster is a prerequisite for Fault Tolerance. Prerequisites n Verify that all virtual machines and their configuration files reside on shared storage.
Chapter 2 Creating and Using vSphere HA Clusters What to do next Configure the vSphere HA settings as appropriate for your cluster. n Failure conditions and VM response n Admission Control n Datastore for Heartbeating n Advanced Options See “Configuring vSphere HA Cluster Settings,” on page 35. Configuring vSphere HA Cluster Settings When you create a vSphere HA cluster or configure an existing cluster, you must configure settings that determine how the feature works.
vSphere Availability 5 Option Description Response for Datastore with Permanent Device Loss (PDL) This setting determines what VMCP does in the case of a PDL failure. You can choose to have it Issue Events or Power off and restart VMs. Response for Datastore with All Paths Down (APD) This setting determines what VMCP does in the case of an APD failure. You can choose to have it Issue Events or Power off and restart VMs conservatively or aggressively.
Chapter 2 Creating and Using vSphere HA Clusters Procedure 1 In the vSphere Web Client, browse to the vSphere HA cluster. 2 Click the Manage tab and click Settings. 3 Under Settings, select vSphere HA and click Edit. 4 Expand Datastore for Heartbeating to display the configuration options for datastore heartbeating. 5 To instruct vSphere HA about how to select the datastores and how to treat your preferences, choose from the following options: Table 2‑3.
vSphere Availability vSphere HA Advanced Options You can set advanced options that affect the behavior of your vSphere HA cluster. Table 2‑4. vSphere HA Advanced Options 38 Option Description das.isolationaddress[...] Sets the address to ping to determine if a host is isolated from the network. This address is pinged only when heartbeats are not received from any other host in the cluster. If not specified, the default gateway of the management network is used.
Chapter 2 Creating and Using vSphere HA Clusters Table 2‑4. vSphere HA Advanced Options (Continued) Option Description fdm.isolationpolicydelaysec The number of seconds system waits before executing the isolation policy once it is determined that a host is isolated. The minimum value is 30. If set to a value less than 30, the delay will be 30 seconds. das.respectvmvmantiaffinityrules Determines if vSphere HA enforces VM-VM anti-affinity rules.
vSphere Availability Customize an Individual Virtual Machine Each virtual machine in a vSphere HA cluster is assigned the cluster default settings for VM Restart Priority, Host Isolation Response, VM Component Protection, and VM Monitoring. You can specify specific behavior for each virtual machine by changing these defaults. If the virtual machine leaves the cluster, these settings are lost. Procedure 1 In the vSphere Web Client, browse to the vSphere HA cluster.
Chapter 2 Creating and Using vSphere HA Clusters Networks Used for vSphere HA Communications To identify which network operations might disrupt the functioning of vSphere HA, you should know which management networks are being used for heart beating and other vSphere HA communications. n On legacy ESX hosts in the cluster, vSphere HA communications travel over all networks that are designated as service console networks. VMkernel networks are not used by these hosts for vSphere HA communications.
vSphere Availability In most implementations, NIC teaming provides sufficient heartbeat redundancy, but as an alternative you can create a second management network connection attached to a separate virtual switch. Redundant management networking allows the reliable detection of failures and prevents isolation or partition conditions from occurring, because heartbeats can be sent over multiple networks. The original management network connection is used for network and management purposes.
Chapter 2 Creating and Using vSphere HA Clusters Best Practices for Admission Control Observe the following best practices for the configuration and usage of admission control for vSphere HA. The following recommendations are best practices for vSphere HA admission control. n Select the Percentage of Cluster Resources Reserved admission control policy. This policy offers the most flexibility in terms of host and virtual machine sizing.
vSphere Availability A cluster enabled for vSphere HA becomes invalid when the number of virtual machines powered on exceeds the failover requirements, that is, the current failover capacity is smaller than configured failover capacity. If admission control is disabled, clusters do not become invalid. In the vSphere Web Client, select vSphere HA from the cluster's Monitor tab and then select Configuration Issues. A list of current vSphere HA issues appears.
Providing Fault Tolerance for Virtual Machines 3 You can utilize vSphere Fault Tolerance for your virtual machines to ensure business continuity with higher levels of availability and data protection than is offered by vSphere HA. Fault Tolerance is built on the ESXi host platform, and it provides continuous availability by having identical virtual machines run on separate hosts.
vSphere Availability A fault tolerant virtual machine and its secondary copy are not allowed to run on the same host. This restriction ensures that a host failure cannot result in the loss of both VMs. NOTE You can also use VM-Host affinity rules to dictate which hosts designated virtual machines can run on. If you use these rules, be aware that for any Primary VM that is affected by such a rule, its associated Secondary VM is also affected by that rule.
Chapter 3 Providing Fault Tolerance for Virtual Machines CPUs that are used in host machines for fault tolerant VMs must be compatible with vSphere vMotion or improved with Enhanced vMotion Compatibility. Also, CPUs that support Hardware MMU virtualization (Intel EPT or AMD RVI) are required. The following CPUs are supported. n Intel Sandy Bridge or later. Avoton is not supported. n AMD Bulldozer or later. Use a 10-Gbit logging network for FT and verify that the network is low latency.
vSphere Availability n Linked clones. You cannot use Fault Tolerance on a virtual machine that is a linked clone, nor can you create a linked clone from an FT-enabled virtual machine. n VM Component Protection (VMCP). If your cluster has VMCP enabled, overrides are created for fault tolerant virtual machines that turn this feature off. n Virtual Volume datastores. n Storage-based policy management. n I/O filters.
Chapter 3 Providing Fault Tolerance for Virtual Machines Using Fault Tolerance with DRS You can use vSphere Fault Tolerance with vSphere Distributed Resource Scheduler (DRS) only when the Enhanced vMotion Compatibility (EVC) feature is enabled. This process allows fault tolerant virtual machines to benefit from better initial placement.
vSphere Availability Host Requirements for Fault Tolerance You must meet the following host requirements before you use Fault Tolerance. n Hosts must use supported processors. n Hosts must be licensed for Fault Tolerance. n Hosts must be certified for Fault Tolerance. See http://www.vmware.com/resources/compatibility/search.php and select Search by Fault Tolerant Compatible Sets to determine if your hosts are certified.
Chapter 3 Providing Fault Tolerance for Virtual Machines Prerequisites Multiple gigabit Network Interface Cards (NICs) are required. For each host supporting Fault Tolerance, a minimum of two physical NICs is recommended. For example, you need one dedicated to Fault Tolerance logging and one dedicated to vMotion. Use three or more NICs to ensure availability. NOTE The vMotion and FT logging NICs must be on different subnets. If you are using legacy FT, IPv6 is not supported on the FT logging NIC.
vSphere Availability Validation Checks for Turning On Fault Tolerance If the option to turn on Fault Tolerance is available, this task still must be validated and can fail if certain requirements are not met. Several validation checks are performed on a virtual machine before Fault Tolerance can be turned on. n SSL certificate checking must be enabled in the vCenter Server settings. n The host must be in a vSphere HA cluster or a mixed vSphere HA and DRS cluster. n The host must have ESXi 6.
Chapter 3 Providing Fault Tolerance for Virtual Machines Turn On Fault Tolerance You can turn on vSphere Fault Tolerance through the vSphere Web Client. When Fault Tolerance is turned on, vCenter Server resets the virtual machine's memory limit and sets the memory reservation to the memory size of the virtual machine. While Fault Tolerance remains turned on, you cannot change the memory reservation, size, limit, number of vCPUs, or shares. You also cannot add or remove disks for the VM.
vSphere Availability Fault Tolerance is turned off for the selected virtual machine. The history and the secondary virtual machine for the selected virtual machine are deleted. Suspend Fault Tolerance Suspending vSphere Fault Tolerance for a virtual machine suspends its Fault Tolerance protection, but preserves the Secondary VM, its configuration, and all history. Use this option to resume Fault Tolerance protection in the future.
Chapter 3 Providing Fault Tolerance for Virtual Machines Test Restart Secondary You can induce the failure of a Secondary VM to test the Fault Tolerance protection provided for a selected Primary VM. This option is unavailable (dimmed) if the virtual machine is powered off. Procedure 1 In the vSphere Web Client, browse to the Primary VM for which you want to conduct the test. 2 Right-click the virtual machine and select Fault Tolerance > Test Restart Secondary.
vSphere Availability Host Configuration Hosts running the Primary and Secondary VMs should operate at approximately the same processor frequencies, otherwise the Secondary VM might be restarted more frequently. Platform power management features that do not adjust based on workload (for example, power capping and enforced low frequency modes to save power) can cause processor frequencies to vary greatly.
Chapter 3 Providing Fault Tolerance for Virtual Machines For virtual machines with Fault Tolerance enabled, you might use ISO images that are accessible only to the Primary VM. In such a case, the Primary VM can access the ISO, but if a failover occurs, the CD-ROM reports errors as if there is no media. This situation might be acceptable if the CD-ROM is being used for a temporary, noncritical operation such as a patch.
vSphere Availability Table 3‑2. Differences Between Legacy FT and FT (Continued) Legacy FT FT vStorage APIs - Data Protection backups Not supported Supported Eager-zeroed thick .vmdk disk files Required Not required because FT supports all disk file types, including thick and thin .vmdk redundancy Only a single copy Primary VMs and Secondary VMs always maintain independent copies, which can be placed on different datastores to increase redundancy.
Chapter 3 Providing Fault Tolerance for Virtual Machines Enable Legacy Fault Tolerance To use legacy Fault Tolerance, you must configure an advanced option for the virtual machine. Legacy FT can be used only with single vCPU virtual machines that are not already using FT. To enable legacy FT for each VM that is to use it, you must set the vm.uselegacyft advanced option to a value of true. Procedure 1 In the vSphere Web Client, browse to the virtual machine.
vSphere Availability 60 VMware, Inc.
Index A best practices Fault Tolerance 55 vSphere HA clusters 40 vSphere HA networking 40 business continuity 9 das.maxftvcpusperhost 46 das.maxftvmsperhost 46 das.maxresets 38 das.maxterminates 38 das.reservationrequestretryintervalsec 38 das.respectvmvmantiaffinityrules 38 das.slotcpuinmhz 23, 38 das.slotmeminmb 23, 38 das.terminateretryintervalsec 38 das.usedefaultisolationaddress 38 das.vmcpuminmhz 23, 26, 38 das.
vSphere Availability interoperability 47 logging 50 migrate secondary 54 networking configuration 50 options 51 overview 45 prerequisites 49 restrictions for turning on 52 suspending 54 test failover 54 test restart secondary 55 turning off 53 turning on 53 use cases 46 validation checks 52 version 49 vSphere configuration 49 Fault Tolerance licensing 46 Fault Tolerance limits 46 Fault Tolerance requirements 46 fdm.
Index test restart secondary, Fault Tolerance 55 tolerating host failures 23 transparent failover 11, 45 turning off, Fault Tolerance 53 vSphere HA datastore heartbeating 36 vSphere HA networking best practices 40 path redundancy 40 U UDP port 22 unplanned downtime 10 updated information 7 upgrading hosts with FT virtual machines 55 use cases, Fault Tolerance 46 V VADP backups 57 validation checks 52 virtual machine overrides 15, 40 virtual machine protection 14, 21 Virtual Machine Startup and Shutdown
vSphere Availability 64 VMware, Inc.