MediaCentral Platform Services Concepts and Clustering Guide
Legal Notices Product specifications are subject to change without notice and do not represent a commitment on the part of Avid Technology, Inc. This product is subject to the terms and conditions of a software license agreement provided with the software. The product may only be used in accordance with the license agreement. This product may be protected by one or more U.S. and non-U.S patents. Details are available at www.avid.com/patents. This document is protected under copyright law.
Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation. This software is provided "as is" without express or implied warranty. Copyright 1996 Daniel Dardailler.
Avid Interplay contains components licensed from LavanTech. These components may only be used as part of and in connection with Avid Interplay. This product includes FFmpeg, which is covered by the GNU Lesser General Public License. This product includes software that is based in part of the work of the FreeType Team. This software is based in part on the work of the Independent JPEG Group. This product includes libjpeg-turbo, which is covered by the wxWindows Library License, Version 3.1.
Trademarks 003, 192 Digital I/O, 192 I/O, 96 I/O, 96i I/O, Adrenaline, AirSpeed, ALEX, Alienbrain, AME, AniMatte, Archive, Archive II, Assistant Station, AudioPages, AudioStation, AutoLoop, AutoSync, Avid, Avid Active, Avid Advanced Response, Avid DNA, Avid DNxcel, Avid DNxHD, Avid DS Assist Station, Avid Ignite, Avid Liquid, Avid Media Engine, Avid Media Processor, Avid MEDIArray, Avid Mojo, Avid Remote Response, Avid Unity, Avid Unity ISIS, Avid VideoRAID, AvidRAID, AvidShare, AVIDstripe, AVX, Beat Detect
Contents Using This Guide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Chapter 1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Single Server Deployments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Multi-Server Deployments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Verifying the Startup Configuration for Avid Services . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Services Start Order and Dependencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Chapter 4 Validating the Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Verifying Node Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Verifying the “Always-On” IP Address . .
Performing a Rolling Reboot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Chapter 7 User Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Identifying Connected Users and Sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Backing Up the UMS Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Using This Guide This guide is intended for the individuals responsible for installing, maintaining or performing administrative tasks on an Avid MediaCentral Platform Services (MCS) system. This document serves as an educational tool; providing background and technical information on MCS. Additionally, it explains the specifics of an MCS cluster, how each service operates in a cluster, and provides guidance on best practices for cluster administration.
1 Overview MediaCentral Platform Services (MCS) is a collection of services running on one or more servers, providing a base infrastructure for solutions including MediaCentral UX, Media Composer Cloud, and Interplay MAM. Multiple MCS servers can be grouped together in a cluster configuration to provide high-availability and increased scale. Every server in a cluster is identified as a “node”. The first two nodes in a cluster are known as the primary (master) and secondary (slave).
• Replicated Cache. The media transcoded by one node in the cluster is automatically replicated on the other nodes. If another node receives the same playback request, the media is available without the need to re-transcode. • Cluster Monitoring. A cluster resource monitor lets you actively monitor the status of the cluster. In addition, if a node fails or a serious problem is detected, designated system administrators are alerted to the issue through an automatically generated an e-mail.
Multi-Server Deployments Two or more MCS servers connect to each other through clustering software installed and configured on each server. In a basic deployment, a cluster consists of a master/slave pair of nodes configured for high-availability. All MCS traffic is routed through the master node which is running all MCS services. Select MCS services and databases are replicated to the slave node.
How Failover Works Failover in MCS operates at two distinct levels: service, and node - both of which are manged by a cluster monitoring system. If a service fails, it is quickly restarted by the cluster monitor, which also tracks the service's fail count. If the service fails too often (or cannot be restarted), the cluster monitor gives responsibility for the service to the standby node in the cluster, in a process referred to as a failover. A service restart in itself is not enough to trigger a failover.
n In a correctly sized cluster, a single node can fail and the cluster will properly service its users. However, if two nodes fail, the remaining servers are likely under-provisioned for expected use and will be oversubscribed. Users should expect reduced performance in this scenario. If the primary and secondary nodes both fail, the system will be unavailable until the situation is resolved.
The master node is treated differently in that 30% of its CPU capacity is always reserved for the duties performed by the master node alone, which include serving the UI, handling logins and user session information, and so on. When the system is under heavy usage, the master node will not take on additional playback jobs. All other nodes can reach 100% CPU saturation to service playback requests. The following illustration shows a typical load-balanced cluster.
Working with Linux Red Hat Enterprise Linux (RHEL) is a commercially supported, open source version of the Linux operating system. If you have run DOS commands in Windows or have used the Mac terminal window, the Linux environment will be familiar to you. While many aspects of the MCS installation are automated, much of it requires entering commands and editing files using the Linux command-line. n RHEL is not free, and Avid does not redistribute it or include it as part of the MCS installation.
Key Linux Directories Like other file systems, the Linux filesystem is represented as a hierarchical tree. In Linux directories are reserved for particular purposes. The following table presents some of the key Linux directories encountered during the MCS installation and configuration: Directory Description / The root of the filesystem. /dev Contains device files, including those identifying HD partitions, USB and CD drives, and so on.
Linux Command Line The Linux command line is a powerful tool that lets you perform simple and powerful actions alike with equal speed and ease.
Command Description dmesg Displays messages from the Linux kernel buffer. Useful to see if a device (such as USB key) mounted correctly. find Searches for files. For example, the following use of the find command searches for on all local filesystems (avoiding network mounts): find / -mount -name grep Searches for the named regular expression.
Command Description tail Shows you the last 10 (or n) lines in a file. tail tail -50 tail –f The “-f” option keeps the tail command outputting appended data as the file grows. Useful for monitoring log files. udevadm Requests device events from the Linux kernel. Can be used to replay device events and create/update the 70-persistent-net.rules file. e.g. udevadm trigger --action=add vi Starts a vi editing session.
The following table presents a few of the more useful vi insert mode commands: Key Press Description i Insert text before the cursor, until you press I Insert text at beginning of current line a Insert text after the cursor A Insert text at end of current line w Next word b Previous word Shift-g Move cursor to last line of the file D Delete remainder of line x Delete character under the cursor dd Delete current line yy “Yank” (copy) a whole line in command mode.
Tip Description “command not found” error A common experience for users new to the Linux command line is to receive a “command not found” after invoking a command or script that is definitely in the current directory. Linux has a PATH variable, but for reasons of security, the current directory — “.” in Linux — is not included in it by default.
2 System Architecture MediaCentral Platform Services is comprised of multiple systems such as: messaging systems, user management services, cluster management infrastructure, and so on. While many of these systems are independent, they are required to work together to create a cohesive environment. The following diagram shows how these systems operate at distinct layers of the architecture.
The following table explains the role of each layer: System Architecture Layer Description Client Applications MCS clients are defined as any system that takes advantage of the MCS platform. Clients can range in complexity from a single MediaCentral UX session on a web browser to a complex system such as Interplay MAM. Additional client examples include Media Composer Cloud, and MediaCentral UX on a mobile device.
System Architecture Layer Description Load-Balancing Services Databases The mid-level service layer includes the services that run on all servers, regardless of a single server or cluster configuration. In a cluster, these services are load-balanced. • AvidConnectivityMon - Verifies that the “always on” cluster IP is reachable. • AvidAll - Encapsulates all other ICPS back-end services. • AvidICPS - Interplay Central Playback Services: Transcodes and serves transcoded media.
Cluster Networking System Architecture Layer Description File systems The standard Linux file system. This layer also conceptually includes GlusterFS, the Gluster “network file system” used for cache replication. GlusterFS performs its replication at the file level. Unlike the Linux file system, GlusterFS operates in the “user space” the advantage being any GlusterFS malfunction does not bring down the system. Hardware At the lowest layer is the server hardware.
Cluster Networking • Virtual IP Address (unicast) During the configuration process, a unicast IP address is assigned to the cluster. This IP is associated with a virtual hostname in the site’s DNS system. Clients use these virtual identifiers to communicate with the cluster. If a cluster node is offline, clients are still able to communicate with the cluster using the virtual host name or IP. The virtual IP address is managed by the cluster in the form of the AvidClusterIP resource.
Cluster Networking n HP servers identify network adapters with an “eth” prefix whereas Dell servers identify the adapters with an “em1”, “p1p1”or “p2p1”. The following is true for the example above: • “eth0” is the node IP address. This is the IP address of the server. Each node will have a listing for this. In this example, “192.168.10.51” is the unicast IP address for this node. This physical adapter has a state of “UP” which means the adapter is available and active.
MCS Services, Resources and Cluster Databases MCS Services, Resources and Cluster Databases The following table lists the main MCS services and resources managed by Pacemaker, and where they run: Resource Name Node 1 (Master) Node 2 (Slave) Node 3 Node n IPC Core Services (“the middleware”) (avid-interplay-central) AvidIPC ON OFF OFF OFF User Management Service (avid-ums) AvidUMS ON OFF OFF OFF UMS session cache service (redis) Redis ON OFF OFF OFF User Setting Service (avid-uss) A
MCS Services, Resources and Cluster Databases The following table lists the bus-dependent services: Node 1 (master) Node 2 (slave) Node 3 Node n AAF Generator* (avid-aaf-gen) ON ON ON ON MCS Messaging (avid-acs-messenger & avid-acs-mail) ON ON ON ON Services and Resources MCS * The AAF Generator runs on all nodes. However, since it is used by the MCS Core Service (“the middleware”), it is only in operation on the master and slave nodes.
Clustering Infrastructure Services Clustering Infrastructure Services The MCS services and databases presented in the previous section depend on a functioning clustered infrastructure. The infrastructure is supported by a small number of open-source software components designed specifically (or very well suited) for clustering.
RabbitMQ RabbitMQ RabbitMQ is the message broker (“task queue”) used by the MCS top level services. MCS makes use of RabbitMQ in an active/active configuration, with all queues mirrored to exactly two nodes, and partition handling set to ignore. The RabbitMQ cluster operates independently of the MCS master/slave corosync cluster, but is often co-located on the same two nodes. The MCS installation scripts create the RabbbitMQ cluster without the need for human intervention.
DRBD and Database Replication Suggestions for Further Reading • Clustering: http://www.rabbitmq.com/clustering.html • Mirrored queues: http://www.rabbitmq.com/ha.html • Network Partitions: http://www.rabbitmq.com/partitions.html DRBD and Database Replication Recall the file system layout of a typical node. The system drive (in RAID1) consists of three partitions: sda1, sda2 and sda3. As noted earlier, sda2 is the partition used for storing the MCS databases, stored as PostgreSQL databases.
Corosync and Pacemaker The following illustration shows DRBD volume mirroring of the sda2 partition across the master and slave. Corosync and Pacemaker Corosync and Pacemaker are independent systems which operate closely together to create the core cluster monitoring and failover capabilities. Corosync is the messaging layer used by the cluster. Its primary purpose is to maintain awareness of node membership - nodes joining or leaving the cluster.
Disk and File System Layout Disk and File System Layout It is helpful to have an understanding of a system’s disk and file system layout. The following illustration represents the layout of a typical MCS server: The above illustration shows a set of two drives in bays 1 and 2 in a RAID 1 configuration. These drives house the operating system and MCS software. The drives in bays 3 - 8 are configured in a RAID 5 group for the purpose of storing and streaming the transcoded media in the /cache folder.
Gluster and Cache Replication • sda3 contains the system swap disk and the root partition. • sdb1 is the RAID 5 cache volume used to store transcoded media and various other temporary files. The following configurations require a RAID 5 volume as a temporary file cache: • MediaCentral UX installations that intend to stream media to iOS or Android mobile devices. In this case, the media on ISIS is transcoded to MPEG-TS (MPEG-2 transport stream) and stored locally in the MCS server’s /cache folder.
Gluster and Cache Replication The replication process is controlled by Gluster, an open source software solution for creating shared file systems. In MCS, Gluster manages data replication using its own highly efficient network protocol. In this respect, it can be helpful to think of Gluster as a “network file system” or even a “network RAID” system. Gluster operates independently of other clustering services.
3 Services and Resources Services are highly important to the operation and health of an MCS system. As noted in “System Architecture” on page 23, services are responsible for all aspects of MCS activity, from the ACS bus, to end-user management and transcoding. Additional services supply the clustering infrastructure. In a cluster, some MCS services are managed by Pacemaker, for the purposes of high-availability and failover readiness. Services overseen by Pacemaker are called resources.
Tables of Services and Resources Tables of Services and Resources The tables in this section provide lists of essential services that need to be running on single-node and clustered configurations. It includes four tables: • Single Server: The services that must be running in a single server deployment. • Cluster - Master Node Only: The services that must be running on the master node only.
Tables of Services and Resources Service Description avid-acs-messenger The services related to the IPC end-user messaging feature: • “messenger” service (handles delivery of user messages) • “mail” service (handles mail-forwarding feature) This service registers itself on the ACS bus. All instances are available for handling requests, which are received by way of the bus via a round-robin-type distribution system.
Tables of Services and Resources Service Description avid-mpd (if installed) Services related to Media Distribute include: • avid-media-central-mpd • avid-mpd • servicemix Operates similarly to the avid-acs-messenger service described above. This service is only available when Media Distribute (separate installer) is installed on the system.
Tables of Services and Resources Cluster - Master Node Only The following table presents the services that must be running on a cluster master node.
Tables of Services and Resources Service Description redis Redis is a key-value data store used to store user session data. This allows MCS to cache active session data and not continuously make calls to the postgresql database to retrieve user information.
Tables of Services and Resources Service Description avid-aaf-gen AAF Generator service, the service responsible for saving sequences. To reduce bottlenecks when the system is under heavy load, five instances of this service run concurrently, by default. Installed on all nodes but only used on the master or slave node, depending on where the IPC Core service (avid-interplay-central) is running.
Tables of Services and Resources Cluster - Pacemaker Resources The following table lists the cluster resources overseen and managed by Pacemaker. For additional details, query the Cluster Resource Manager using the following command: crm configure show In the output that appears, “primitive” is the token that defines a cluster resource. Resource Description AvidAll Encapsulates: • AvidACS Encapsulates: • AvidClusterMon MongoDB avid-uss Encapsulates: • drbd • postgresql-9.
Tables of Services and Resources Resource Description AvidCCC Encapsulates: • “Multi-Zone resources” “Media Index resources” avid-ccc The following resources (and related services) are used in Multi-Zone configurations: • pgpool (pgpool) • pgpoolchecker (pgpoolchecker) The following resources (and related services) are used in Media Index configurations: • AvidSearch (avid-acs-search) • AvidSearchAutoComplete (avid-acs-autocomplete) • AvidSearchConfig (avid-acs-media-index-configuration) •
Interacting with Services Interacting with Services MCS services are standard Linux applications and/or daemons, and you interact with them following the standard Linux protocols.
Directly Stopping Managed Services Issuing the crm resource status command without specifying a resource returns the status of all cluster resources (similar to what you would see in the crm_mon tool). For more information see the discussion of the Cluster Resource Monitor tool, crm_mon, in “Cluster Resource Monitor” on page 67.
Using the avid-ics Utility Script Using the avid-ics Utility Script “avid-ics” is a utility script (not a service) that can be used to verify the status of all the major MCS services. The script verifies the status of the following services: • avid-all • avid-interplay-central • avid-acs-messenger • acs-ctrl-core • avid-ums The utility script enables you to stop, start and view the status of all the services it encapsulates at once.
Services Start Order and Dependencies Services Start Order and Dependencies When direct intervention with a service is required, take special care with regards to stopping, starting, or restarting. The services on a node operate within a framework of dependencies. Services must be stopped and started in a specific order. This order is particularly important when you have to restart an individual service (in comparison to rebooting the entire server).
Services Start Order and Dependencies The following table summarizes the order in which services can be safely started. Start Order Service Name Process Name Notes 1 DRBD drbd Only applies to cluster configurations. 2 PostgreSQL postgresql-9.1 3 MongoDB mongod 4 RabbitMQ rabbitmq-server 5 Avid Common Service bus (ACS: “the bus”) acs-ctrl-core 6 Node.
Services Start Order and Dependencies 5. Restart UMS (#7). 6. Restart services #8, #9 and, #12, in that order. For a closer look at the start orders assigned to Linux services, see the content of the /etc/rc3.d directory. The files in this directory are prefixed Sxx or Kxx (e.g. S24, S26, K02). The prefix Sxx indicates the start order. Kxx indicates the shutdown order. The content of a typical /etc/rc3.d directory is shown below: n The Linux start order as reflected in the /etc/rc3.
4 Validating the Cluster This chapter includes a series of tests for determining if the underlying systems upon which the MCS cluster is built are operating as expected. Many of the procedures in this chapter only needed to be completed once, after the initial configuration of the cluster. However, if a new node has been added to the cluster or if conditions on the network have changed (for example, a network switch has been altered or replaced), cluster verification tests should be repeated.
Verifying Node Connectivity Verifying the “Always-On” IP Address The “pingable IP” or “always-on” IP address is used by the Avid Connectivity Monitor cluster components to determine if a particular node is still in the cluster. For example, if the Connectivity Monitor on a slave node can no longer communicate with the master node, it “pings” the always-on IP address (in practice, usually a router).
Verifying Node Connectivity The system responds by outputting its efforts to reach the specified host, and the results. For example, output similar to the following indicates success: PING wavd-mcs02.wavd.com (192.168.10.52) 56(84) bytes of data. 64 bytes from wavd-mcs02.wavd.com (192.168.10.52): icmp_seq=1 ttl=64 64 bytes from wavd-mcs02.wavd.com (192.168.10.52): icmp_seq=2 ttl=64 64 bytes from wavd-mcs02.wavd.com (192.168.10.52): icmp_seq=3 ttl=64 64 bytes from wavd-mcs02.wavd.com (192.168.10.
Verifying Node Connectivity Repeat the traceroute tests to verify the routing to each node. Each node should have the same number of “hops”. If one or more nodes has a different number of hops than the others, this should be investigated and optimized if possible. Verifying DNS Host Name Resolution It is important that the Domain Name System (DNS) servers correctly identify the nodes in the cluster. This is true of all physical nodes and the virtual cluster IP and hostname.
Verifying Node Connectivity Additionally, the “>>HEADER<<” section indicated a status of NOERROR. This verifies that the DNS server (192.168.10.10 in this example) has a valid entry for the host in question.
Validating the FQDN for External Access Validating the FQDN for External Access It is vital that the fully qualified domain name (FQDN) for each MCS server is resolvable by the domain name server (DNS) tasked with doing so. This is particularly important when MediaCentral will be accessed from the MediaCentral mobile application (iPad, iPhone or Android device) or when connecting from outside the corporate firewall through Network Address Translation (NAT).
Validating the FQDN for External Access n Item Description xlb_node_full_name The FQDN of the assigned node. If connecting to MediaCentral from outside the corporate firewall through NAT, this domain name must resolve to an external (public) IP address. An example of a failed connection from the Safari browser on an iOS device appears as follows: “Safari cannot open the page because the server cannot be found.” 3. Verify the output of the command.
Validating the FQDN for External Access If you are still unsuccessful and you are not using NAT, an alternative option exists. MCS v2.0.2 added a feature for altering the “application.properties” file to instruct the MCS servers to return an IP address during the load-balancing handshake instead of a hostname. n This process is not supported for single-server systems using NAT. To adjust the application.preperties file: 1. Log in to the MCS server as the ‘root’ user.
Verifying Time Synchronization Verifying Time Synchronization Verifying time synchronization across multiple networked servers in Linux is a challenge, and there is no simple way to do it that provides entirely satisfactory results. The major impediment is the nature of the Linux Network Time Protocol (NTP) itself. Time synchronization is particularly important in a cluster, since Pacemaker and Corosync rely on time stamps for accuracy in communication.
Verifying the Pacemaker / Corosync Cluster Status Verifying the Pacemaker / Corosync Cluster Status For all important events, such as a master node failover, the cluster sends automated e-mails to cluster administrator e-mail address(es). It is nevertheless important to regularly check up on the cluster manually. Recall that cluster resources are Linux services under management by Pacemaker.
Verifying the DRBD Status Verifying the DRBD Status Recall that DRBD is responsible for mirroring the MCS database on the two servers in the master/slave configuration. It does not run on any other nodes. In this section you run the DRDB drdb-overview utility to ensure there is connectivity between the two DRBD nodes, and to verify database replication is taking place.
Verifying the DRBD Status Element Description Primary/Secondary The roles for the local and peer (remote) DRBD resources. The local role is always presented first (i.e. local/peer). UptoDate/UptoDate • Primary - The active resource. • Secondary - The resource that receives updates from its peer (the primary). • Unknown - The resource’s role is currently not known. This status is only ever displayed for the peer resource (i.e. Primary/Unknown). The resource’s disk state.
Verifying ACS Bus Functionality Verifying ACS Bus Functionality The Avid Common Services bus (“the bus”) provides essential bus services needed for the overall platform to work. Numerous services depend upon it, and will not start — or will throw serious errors — if the bus is not running. You can easily verify ACS bus functionality using the acs-query command. On a master node, this tests the ACS bus directly.
Verifying the AAF Generator Service To verify the status and/or stop the AAF Generator service: 1. Log in to both the master and slave nodes as root. Though the AAF Generator service is active in saving sequences only on the master node, you should verify its status on the slave node too, to prepare for any failover. 2.
5 Cluster Resource Monitor The easiest way to verify that all nodes are participating in the cluster and that all resources are up is through the Pacemaker Cluster Resource Monitor, crm_mon. This utility provides a real-time view of the cluster status including information on failures and failure counts. This section provides information to assist in interpreting the output of the Cluster Resource Monitor.
Interpreting the Output of CRM 1) 2) 3) 4) 5) 6) 7) 8) 9) ============ Last updated: Thu Jul 16 16:20:01 2015 Last change: Mon Jul 13 10:06:51 2015 via crm_attribute on wavd-mcs02 Stack: classic openais (with plugin) Current DC: wavd-mcs04 - partition with quorum Version: 1.1.
Interpreting the Output of CRM Line(s) Description 10 Lists the cluster nodes including their current status (online, offline, standby). 11-12 The AvidConnectivityMon resource monitors the pingable IP address specified during the cluster setup. 13 The resource that sends the automated e-mails. 14 The MongoDB resource. 15 The Redis resource. 16-19 The PostgreSQL resource group. · postgres_fs: Responsible for mounting the drbd device as a file system.
Interpreting the Output of CRM The master node can be identified in a number of ways: • It is always the owner of the AvidClusterIP resource. • It is listed as “master” under the drbd_postgres resource. • It will be the owner of multiple other resources such as: MongoDB, AvidIPC, AvidUMS and more. The slave node can be identified as “slave” under the drbd_postgres resource. It will also run additional load-balancing resources such as AvidICPS and AvidAll.
Identifying Failures in CRM Note the total number of “Resources configured” at the top of the tool. There are 24 resources in the example image. The resources are identified in bold text and a count has been added on the right. Some resources run on the master node only while other resources, such as AvidICPS, run on multiple nodes. The counts listed on the right equal the total number of configured resources.
Identifying Failures in CRM Started: [ wavd-mcs01 wavd-mcs02 wavd-mcs03 wavd-mcs04 ] Migration summary: * Node wavd-mcs01: Redis: migration-threshold=20 fail-count=5 last-failure='Wed Jul 15 16:46:45 2015' AvidUMS: migration-threshold=20 fail-count=3 last-failure='Wed Jul 15 15:26:30 2015' AvidACS: migration-threshold=20 fail-count=1 last-failure='Wed Jul 15 18:30:08 2015' * Node wavd-mcs02: AvidConnectivityMon: migration-threshold=1000000 fail-count=1 last-failure='Wed Jul 15 18:30:49 2015' * Node wavd-mc
Identifying Failures in CRM Failures at the bottom of the tool can be cleared using the following command in a second terminal window (a terminal window other than the one showing crm_mon): crm resource cleanup [] n • is the resource name of interest: AvidIPC, AvidUMS, AvidACS, etc. • (optional) is the node of interest. Omitting the node cleans up the resource on all nodes.
Interpreting Failures in the Cluster Interpreting Failures in the Cluster The following section provide additional details on what users should expect from service, resource or node failures. What impact does a failover have upon users? Most service failures result in an immediate service restart on the same node in the cluster. In such cases, users generally do not notice the failure.
6 Cluster Maintenance and Administration MCS is based on the Linux operating system which is generally considered to be a very reliable platform and therefore suggestions for regular maintenance are limited. Avid does not recommend regular reboots of the MCS servers as are often recommended for Windows-based systems. Server reboots should only be completed as part of troubleshooting efforts if the situation arises.
Adding Nodes to a Cluster Additional nodes are often added to existing MCS clusters to add horizontal scale which accommodates increased client capacity and system load. The process for adding a new node or nodes is similar to that of a new cluster installation. If the GlusterFS volume replication system has been configured on the existing nodes, Gluster needs to be installed and configured on the new node(s) as well. In the following process, “MCS Install Guide” refers to the v2.
To Add Node(s) to GlusterFS 1. Complete “Starting GlusterFS” in the MCS Install Guide. 2. Complete “Creating the Trusted Storage Pool” in the MCS Install Guide. Only the new node or nodes need to be probed. 3. Similar to the gluster volume create command used in the “Configuring the GlusterFS Volumes” process found in the MCS Install Guide you will use the add-brick command to add the new node to Gluster. Complete this step on a node other than the one you are adding.
Permanently Removing a Node As discussed, a node can be temporarily removed from the cluster by putting it into standby. Permanently removing a node involves a reconfiguration of the Corosync / Pacemaker cluster and the GlusterFS shares. The following is an overview of the steps required to remove a node. In the following process, “MCS Install Guide” refers to the v2.4 MediaCentral Platform Services Installation and Configuration Guide.
All cluster nodes, including the one you want to remove should be listed. Example: [root@wavd-mcs02 etc]# rabbitmqctl cluster_status Cluster status of node 'rabbit@wavd-mcs02' ... [{nodes,[{disc,['rabbit@wavd-mcs01','rabbit@wavd-mcs02', 'rabbit@wavd-mcs03']}]}, {running_nodes,['rabbit@wavd-mcs01','rabbit@wavd-mcs02']}, {cluster_name,<<"rabbit@wavd-mcs01">>}, {partitions,[]}] ...done. b. Stop the rabbitmq service on the node to be removed: service rabbitmq-server stop c.
This command will bring the cluster back online. 8. Open the Cluster Resource Monitor to verify the status of the cluster. crm_mon -f The number of “Nodes configured” and the number of “expected votes” should match the number of actual nodes in your cluster (one less than before). 9. The node is now removed from the cluster. However, a residual reference to the node might still exist in the “Load Balancer” section of MediaCentral UX. If this reference exists, it should be removed. a.
2. Similar to the gluster volume create command used in the “Configuring the GlusterFS Volumes” process found in the MCS Install Guide you will use the remove-brick command to remove the node from Gluster.
Reviewing the Cluster Configuration File During the cluster installation, a configuration file was created which contains information about the cluster and the resources managed by Pacemaker. You can review the contents of the configuration file at any time by typing: crm configure show For example, the AvidClusterIP primitive contains the cluster IP address and the network interface being used (e.g. eth0). If necessary, press Q to get back to the Linux command line prompt.
3. Find the line containing the cluster administrator e-mail address. Example: rsc_defaults rsc_defaults-options: \ admin-email="admin@wavd.com" 4. Alter the existing e-mail address or add additional e-mail addresses by separating each contact with a comma. Example: rsc_defaults rsc_defaults-options: \ admin-email="admin@wavd.com,engineering@wavd.com" 5. Save the changes using the same command as you would use in a “vi” edit session.
Changing IP Address in a Cluster In the event that you need to alter the IP address of a node or an entire cluster, follow the procedures below as they apply to your network change requirements. Recall that a cluster has multiple IP addresses: • Node IP addresses. Each node is a assigned a standard unicast address. • Cluster IP address. This address is used by the nodes to communicate with each other within the cluster. By default, this is a multicast address.
t If you need to alter the virtual cluster IP address, see Changing the Virtual IP Address below. Once all required changes have been made, continue with step 3 of this process. 3. Bring the cluster back online on the master node: service pacemaker start service corosync start 4. Open the Cluster Resource Monitor to verify the status of the cluster: crm_mon -f Wait for the master node to start all resources. 5.
6. If you are changing the IP address of the master and / or slave nodes, you must edit the drbd configuration file. a. Open the file for editing: vi /etc/drbd.d/r0.res b. Find and change the IP address(es) associated with the altered node(s): on wavd-mcs02 { device /dev/drbd1; disk /dev/sda2; address 192.168.10.52:7789; meta-disk internal; } on wavd-mcs01 { device /dev/drbd1; disk /dev/sda2; address 192.168.10.51:7789; meta-disk internal; } } c. Save and exit the vi session.
Changing the Virtual IP Address 1. On the Master node, run the cluster setup-cluster command with your updated IP address information to update the cluster configuration file. See “Starting the Cluster Services on the Master Node” the MediaCentral Platform Services Installation and Configuration Guide for details. This command will start the cluster services on the master node. 2.
Taking Nodes Offline and Forcing a Failover At times it might be required to take a node offline for troubleshooting. Pacemaker offers an easy way to temporarily remove and reactivate a node in the cluster. The same commands can be used to force a failover of the cluster which is useful when testing a fully functional system. n Be aware that since the playback service is load-balanced across all cluster nodes, taking a node offline can result in an interruption in playback.
1. Log in to any node in the cluster as root and open the Cluster Resource Monitor utility: crm_mon -f This returns the status of all cluster-related services on all nodes. Ensure all nodes are active and operating normally prior to the test. Any failures should be cleared or investigated and cleared so as not to initiate additional unexpected failovers. 2. Note the line identifying the master node: AvidClusterIP (ocf::heartbeat:IPaddr2): Started wavd-mcs01 3.
Shutting Down or Rebooting a Single Cluster Node The Linux reboot process is thorough and robust, and automatically shuts down and restarts all the MCS and clustering infrastructure services on a server in the correct order. However, when the server is a node in an MCS cluster, care must be taken to remove the node from the cluster — that is, stop all clustering activity first — before shutting down or rebooting the individual node.
Shut down or reboot the cluster node: 1. Log into the node as the Linux root user. 2. Stop the Pacemaker and Corosync services: service pacemaker stop && service corosync stop The services should stop with a green [OK] status. n You can safely stop these cluster services without putting the nodes in Standby. If you are stopping pacemaker and corosync on the master node, the cluster will fail over to the slave node and it will become the cluster master. That is expected and normal behavior. 3.
Shutting Down the Cluster When shutting down an entire cluster, the nodes must be shut down and restarted in a specific order. Rebooting nodes in the incorrect order can cause DRBD to become confused about which node is master, resulting in a “split brain” condition. Rebooting in the incorrect order can also cause RabbitMQ to enter into a state of disarray, and hang. Both DRBD and RabbitMQ malfunctions can present misleading symptoms and can be difficult to resolve.
Starting the Cluster When bringing the cluster online, it is important to bring up the original master first. This was the last node down, and must be the first back up. This is primarily for the sake of RabbitMQ, which runs on all nodes and maintains its own “master” (called a “disc node” in RabbitMQ parlance). The non-master RabbitMQ nodes (called “ram nodes”) look to the last known disc node for their configuration information.
Performing a Rolling Reboot A rolling reboot is a process in which one or more cluster nodes are rebooted in sequence and only one machine at a time is off-line. A rolling reboot allows the entire cluster to be restarted with minimal disruption of service to the clients. The following list shows the correct order for a rolling reboot: 1. Power-cycle the load-balancing nodes. 2. Power-cycle the slave node. 3. Power-cycle the master node.
7 User Management The MediaCentral | UX Administration Guide provides details on user creation and general user management. Appendix A of the Administration Guide provides additional information regarding commands that can be used with the avid-ums service. This chapter includes information on determining what users are connected to the MCS system and a process for manually backing up and restoring the MCS user database.
Identifying Connected Users and Sessions 4. Click the plus sign (+) to the left of one of the nodes. Information regarding client connections to this node appears. Example: The Host column indicates the IP address of the system that is making the connection to MediaCentral UX.
Identifying Connected Users and Sessions 2015-07-29 15:25:43.324 -0400 INFO com.avid.uls.bl.session.impl.SessionHolder - Logging in: logon=MessierTest, role=Journalist, userId=249, isAvidAdministrator=false, clientIp=192.168.10.117 2015-07-29 15:25:43.326 -0400 INFO com.avid.uls.bl.session.impl.SessionHolder - Session created, SID=-8917047212884686433, logon=TESTJOURN n For best results for viewing the log file, use an application such as Notepad+ which will correctly interpret carriage returns.
Identifying Connected Users and Sessions Backing Up the UMS Database The MediaCentral Platform Services Upgrade Guide includes a process for backing up the MCS databases and system settings through the use of the system-backup.sh script. That process includes a backup of the UMS user database. However, in some situations you might need to backup only the UMS data.
Identifying Connected Users and Sessions To restore the UMS database: 1. Log in to the MCS server as the root user. In a clustered configuration, log in to the master node. 2. Stop the UMS service: - For a single server: service avid-ums stop - For a cluster: crm resource stop AvidUMS 3. Copy the backup of the UMS databse to your destination MCS server. 4.
Identifying Connected Users and Sessions Migrating the 1.4.x / 1.5.x UMS Database To extract the UMS database from an ICS 1.4.x/1.5.x system and load it into an MCS 2.x system, you must use PostgreSQL tools directly, at both ends. To extract the UMS database from an ICS 1.4.x/1.5.x system: 1. Log in to the master node as root and dump the UMS database pg_dump –U postgres uls > uls_backup.sql 2. Move the file to a safe location (off the server) in preparation for restoring it to the MCS 2.x system.
8 MCS Troubleshooting and System Logs This chapter presents troubleshooting tips and procedures as well as the location and description of the log files produced by MCS systems. Common Troubleshooting Commands The following table lists some helpful commands for general troubleshooting: Command Description ics_version Prints MCS version information to the screen. drbd-overview (cluster only) Prints DRBD status information to the screen.
Common Troubleshooting Commands Command Description gluster (cluster only) Queries GlusterFS peers. e.g. gluster peer [command] gluster peer probe acs-query Tests the RabbitMQ message bus. watch service rabbitmq-server status Provides a live status of the rabbitmq-server. This command can be used for troubleshooting, but do not leave it running for long periods of time to ensure system performance is not affected.
Responding to Automated Cluster E-mail Responding to Automated Cluster E-mail By default Pacemaker is configured to send automated e-mails to notify the cluster administrators of important events. The following table presents the e-mail types that can be sent and the remedial action needed. E-mail Type Description Action Needed Node Up /Joined Cluster • A node that was put into standby has added back into the cluster None. • During installation, a new node has successfully joined the cluster.
Troubleshooting RabbitMQ Troubleshooting RabbitMQ The Avid Knowledge Base includes a page that provides detailed instructions on reviewing the status of RabbitMQ and troubleshooting any related errors. See the following link for details: http://avid.force.
Troubleshooting DRBD Fanout.Broadcasts present Bindings: Fanout.Broadcasts -> Local.Broadcasts Fanout.Broadcasts -> MultiZone.Broadcasts Fanout.Channels -> Local.Channels Fanout.Channels -> MultiZone.Channels [ OK ] [ [ [ [ OK OK OK OK ] ] ] ] An “OK” response indicates that the acs-broker and rabbitmq communication is normal. Troubleshooting DRBD Recall that DRBD runs on the master and slave nodes only, and is responsible for mirroring the contents of a partition between master and slave.
Troubleshooting DRBD Master Node: WFConnection 1:r0/0 WFConnection Primary/Unknown UpToDate/DUnknown C r----- /mnt/drbd ext4 20G 397M 18G 3% Summary: The DRBD master node cannot connect to the DRBD slave node: WFConnection The master node is waiting for a connection from the slave node (i.e. the slave node cannot be found on the network). Primary/Unknown This node is the master, but the slave node cannot be reached.
Troubleshooting DRBD Both Nodes: Secondary/Secondary 1:r0/0 Connected Secondary/Secondary UpToDate/UpToDate C r----- Summary: The nodes are connected, but neither is master. Details: Connected A connection is established. Secondary/Secondary Both nodes are operating as the slave node. That is, each is acting as the peer that receives updates. UpToDate/Unknown The database on the master is up to date, but the state of the database on the slave node is not known.
Troubleshooting DRBD Both Nodes: Standalone and Primary 1:r0/0 StandAlone Primary/Unknown UpToDate/Unknown C r----- /mnt/drbd ext4 20G 397M 18G 3% 1:r0/0 StandAlone Primary/Unknown UpToDate/Unknown C r----- Summary: A DRBD “split brain” has occurred. Both nodes are operating independently, reporting themselves as the master node, and claiming their database is up to date. StandAlone The master node is waiting for a connection from the slave node (i.e. the slave node cannot be found on the network).
Manually Connecting the DRBD Slave to the Master Manually Connecting the DRBD Slave to the Master When the master and slave nodes are not connecting automatically, you will have to make the connection manually. You do so by telling the slave node to connect to the resource owned by the master. To manually connect the DRBD slave to the master: 1. Log in to any node in the cluster as root and start the Pacemaker Cluster Resource Monitor utility: crm_mon 2.
Correcting a DRBD Split Brain n Discarding the database on the slave node does not result in a full re-synchronization from master to slave. The slave node has its local modifications rolled back, and modifications made to the master are propagated to the slave. To recover from a DRBD split brain: 1. Log in to any node in the cluster as root and start the Pacemaker Cluster Resource Monitor: crm_mon 2. Identify the master node. To identify the master, look for the line containing “Master/Slave Set”.
Working with Cluster Logs Working with Cluster Logs MCS and its supporting services — such as Pacemaker, Corosync, and RabbitMQ — produce numerous logs. These are stored in the standard RHEL directory and subdirectories: /var/log Typically, log files have a name of the following form: .log For example: spooler.log spooler.log-201310.25.gz spooler.log.old20131024_141055 Note the following: • *.log are current log files, for the active process. • *.
Working with Cluster Logs Understanding Log Rotation and Compression The Linux logrotate utility runs and compresses the old logs daily. Although it is invoked by the Linux cron daemon, the exact runtime for logrotate cannot be stated with accuracy. It varies, for example, depending on when the system was most recently rebooted, but it does not run at a fixed time after the reboot. This is by design, in order to vary and minimize the impact on other system resources.
Working with Cluster Logs • grep - Use the grep command to search for regular expressions within a log file from the command line. For example the following command searches all log files in the current directory for the term “fail-count”: grep fail-count *.log Adding a -r option to the same command recursively searches the log files in the current directory and all subdirectories for the specified : grep -r *.
Working with Cluster Logs n WinSCP uses the standard TCP port 22 for its SSH connection. If you can establish an SSH connection to the server outside of WinSCP, you can use WinSCP. 4. Click Login. The following message is displayed: “Continue connecting and add host key to the cache?” 5. Click Yes. The WinSCP interface is displayed. The left pane represents your source Windows system. The right pane represents your MCS server. n WinSCP automatically opens in the home directory of the logged in user.
Important Log Files at a Glance Important Log Files at a Glance The following tables detail the name, location and purpose of the logs found on an MCS server. RHEL Logs in /var/log The following table presents the standard RHEL logs found in the /var/log directory: Log File Description /var/log/anaconda.log Linux installation messages. /var/log/boot.log Information pertaining to boot time. /var/log/btmp.log Failed login attempts. /var/log/cron Information logged by the Linux cron daemon.
Important Log Files at a Glance RHEL Subdirectories in /var/log The following table presents the standard RHEL subdirectories found in the /var/log directory: Log File Description /var/log/audit Logs stored by the RHEL audit daemon. /var/log/ConsoleKit Logs stored related to user sessions. Deprecated. /var/log/cups Logs related to printing. /var/log/httpd The Apache web server access and error logs. As of ICS 1.8 Apache is no longer used. /var/log/ntpstats Logs relating to the NTP daemon.
Important Log Files at a Glance Avid Logs in /var/log The following table presents logs specifically related to MCS and related systems found in /var/log and its associated subdirectories: Log File Description /var/log • MediaCentral_Services_Build_Linux.log - Logs any errors encountered during the an MCS software installation. • ICS_installer__.log - Logs related primarily to the Linux phase of the installation. • fuse_avidfos.
Important Log Files at a Glance Log File Description /var/log/avid/acs • avid-acs-attributes.log - Log file for the avid-acs-attributes service which stores service configuration attributes. • avid-acs-federation.log - Log file for the avid-acs-federation service which stores bus configuration information for multi-zone. • avid-acs-infrastructure.log - Log file for the avid-acs-infrastructure service which is used to track bus server connection information used by the Bus Access Layer component.
Important Log Files at a Glance Log File Description /var/log/avid/avid-interplay-central • YYYY_MM_DD.request.log - Daily request logs • acs-bal-YYYY-MM-DD.0.log - • interplay_central_#.log - MediaCentral server log. Helpful for troubleshooting a variety of problems including login issues and failed searches. • osgi.log • osgi-framework.log • service_startup.log • uls.
Important Log Files at a Glance Log File Description /var/log/cluster Corosync log files. These log files are only available in clustered MCS configurations. /var/log/elasticsearch Logs related to the elasticsearch component of Media Index. Logs are only available if Media Index has been configured. • /var/log/elasticsearch-tribe .log - hostname is the hostname of the single node or virtual cluster name of the MCS system. Logs related to the elasticsearch component of Media Index.
Important Log Files at a Glance MediaCentral Distribution Service Logs The following table presents log information for the MediaCentral Distribution Service (MCDS); supported by Interplay Production send-to-playback workflows. MCDS is generally installed on a Windows server hosting other Interplay Production services. Log File Description C:\ProgramData\Avid\Interplay Central Distribution Service • STPService_nn.log - Messages from the MediaCentral Distribution Service • STPTimerTask_nn.
Important Log Files at a Glance Mobile Device Logs Logs are available for both iOS and Android devices. However, logging is not enabled by default and must be manually selected per device. To ensure best performance of the device, logging should only be enabled temporarily to create a log for a specific issue. Enable logging for iOS and Android Devices: 1. Sign in to your mobile client. 2. Select the application menu to access the Preferences or Settings. 3. Select the option to enable logging.
Important Log Files at a Glance 6. Once you have reproduced the issue, select “Send Log” from the application menu. In the example below, the Android app is pictured on the left and the iOS app is pictured on the right. 7. Send an e-mail with the log to yourself or an Avid representative for analysis.