RDF System Management Manual for H-Series RVUs (RDF 1.8)

Communication Failures During Phase 3 Takeover Processing
If one RDF subsystem is unable to reach the backup system of another RDF subsystem during
phase 3 processing, phase 3 processing stalls until the communication line comes back up. This
can lengthen the overall duration of takeover operations on all backup systems. Should this type
of stall occur, the RDF subsystem issues an event message alerting operators to the situation.
Takeover Delays and Purger Restarts
During phase 3 purger work, the network master needs information from the other purger
processes in the RDF network, and, during the latter part of phase 3 processing, the non-network
master purgers need information from the purger of the network master. When a purger process
is waiting for information from another purger, it waits for up to 60 seconds, during which time
it does not respond to certain requests (such as STATUS RDF). After a purger has waited 60
seconds, it quits the operation and restarts. This allows the purger to read the $RECEIVE file,
respond to messages that have been waiting for replies, and then retry phase 3 processing.
Takeover Restartability
As has always been the case, the RDFCOM TAKEOVER command is restartable. Therefore, if a
takeover operation terminates prematurely for any reason on any system in an RDF network, it
can be restarted.
Takeover and File Recovery
When a takeover operation completes in an RDF network environment, the purger logs two
events: one reports a safe MAT position (indicating that all committed data up to that location
was successfully applied to the backup database), and the second (888 or 858) reports whether
or not a File Recovery position is available for use on the primary system. The RDF event 888
reports that a File Recovery is available and it includes the exact sno and rba to be used for a File
Recovery operation on the primary system. If, however, “kept-commits” have been encountered
during phase 2 processing, a File Recovery position is not available; this is reported in RDF event
858. This last situation will never occur in an RDF/ZLT environment because a File Recovery
position is always available with RDF/ZLT.
If an RDF event 888 is reported, then the specified File Recovery position is based on both phase
1 and phase 3 processing. Each system logs its own File Recovery position. While that position
can differ from one backup system to the next, the logged position for any single system is correct.
If you supply the returned File Recovery position to the TMF file recovery process on the primary
system, the process recovers the files on the primary database up to that point. If you use File
Recovery to a MAT position on all primary systems in the RDF network, in each case using the
returned File Recovery positions, then your primary distributed database will be consistent
across the RDF network.
You would use the File Recovery position with File Recovery in several situations: Assume you
have had an outage of your primary system, you have executed the RDF takeover operation on
your backup system, and you have resumed business transactions on your backup system.
Assume further that the former primary system has been repaired, it is back online, and you
want to switch your business transactions from the active backup database back to the former
primary database. To do so, you merely execute a planned RDF switchover from the backup to
the newly restored primary.
The problem with doing a planned switchover from backup to primary after an RDF takeover
operation is that some transactions might have committed on the primary system immediately
prior to the unplanned outage, and the outage brought down the extractor before it could send
that data to the backup system. In such a case, when you bring the primary system back up the
two databases are no longer synchronized because the primary database contains committed
transactions that are not in the backup database. Such transactions cannot be recovered.
280 Network Transactions