Transactd High Availability (THA)

Overview

Since Transactd version 3.5, Transactd High Availability (THA) is available. THA redundant configuration with multiple servers make high availability system which has much shorter downtime than single server configuration. In addition, load balancing, such as doing read operations on the slave is available easily.

THA has following features:

Index

Structures of THA

THA is constructed with a master server and multiple slave servers which use MySQL GTID replication. You can balancing read operations to slave server optionally. You can also receive all access with master server and make the slave(s) spare for failover. One slave is required at least. Two or more slaves are more desirable.

All servers are required same version of MySQL (5.6 or later) / MariaDB (10.0 or later) because THA uses GTID replication. See MySQL / MariaDB documents to set up replication with GTID.

Mechanism

THA is constructed these elements:

Hostname resolver (THNR)

THNR (Transactd HostName Resolver) is hostname resolver which is embedded in client library.

You can easily use it in your application. The application accesses to database with virtual hostname instead of real hostname. THNR is embedded in tdclc library. It converts hostname from virtual to real. Master switching is available with a single product THNR, without some products such as OS, network device and DNS.

One instance of THNR is shared with all threads in one process.

Server role

Typically, the server role is master or slave. It is possible to detect which is the master in servers from configuration of replication. But THA requires to specify role on server variables explicitly, on purpose of supporting more complicated configuration in future.

On failover, the server role will be changed correctly by failover program.

THNR detects change of server role by checking sameness between the role which was requested by client and the server's real role. This mechanism prevents troubles such as write to slave by mistake.

The role which was requested by client is identified by which virtual hostname is master or slave.

Client reconnecting

Client has the function to reconnect to server when network error has occurred.

If reconnection was succeeded, the current record or lock status will be recover, then the application can continue their procedure as if nothing had happened. But the client will not to try to reconnect to server if transaction has been started with beginTrn() or beginSnapshot(). In this case, the application will receive an error. It can Abort transaction and notice result to user. Reconnection will be tried on the next operation.

Alive monitoring

THNR do the alive monitoring and calling failover program. If an network error caused by death of master server has been occurred, reconnection will be tried according to error status. At this time THNR will be notified that "this is reconnection".

THNR discards the cache and search correct host when it receives reconnection notify. If it can not connect to master in this process, it detects that the master server has died. Then it will call the failover program, switch master, and search new master.

This alive monitoring is implemented by handling client access errors. It has these advantages:

Failover program (haMgr)

The failover program haMgr is independent process program. (c.f. haMgr reference)

Failover will be done if haMgr is callable by THNR (haMgr is put on PATH). If it is not callable, the application simply receives a network error.

Only the first client which starts failover process can do it. Clients which try later can not lock HA object on server, then fail to do it.

The clients which has failed to lock start to resolve hostnames by THNR immediately. In this time, they try to lock HA object during up to 60 seconds to check the servers are in failover process. The new hostname will be got by them if the failover process was finished and the lock was released.

Minimize downtime

Above mechanism enables automatic failover to new master and continue process without downtime. The time which takes to failover is depending on the number of slaves, but detecting takes several seconds and switching takes less than 1 second.

Any errors will not occur between detecting master server fault and starting failover program, because actual client monitor the alive of server directly. Only the transaction which has been started at the time of master's down will fail.

THA also supports the switchover which switch master intentionally. It waits for the transaction on current master to be finished. So any data will not be loss.

Configure THA

Now we explain how to set up THA. The examples assumes these servers:

ServerIP address
Master192.168.0.2
Slave 1192.168.0.3
Slave 2192.168.0.4
  1. Configure replication between a master and multiple slaves, with MySQL 5.6 or MariaDB 10.0 or later. c.f: details of MySQL/MariaDB GTID replication (Japanese).

    In this time, set up same configuration on all servers. Such as username and password for Transactd, channel, username and password for replication.

  2. Specify hostname or IP address like report-host=192.168.0.x in my.cnf on each servers. If you use the port other than default (8610), specify it like report-host=192.168.0.x:8611.

    This hostname will be used by start function or health_check which are described later.

  3. Restart master server with boot option –transactd-startup_ha=1.

  4. Put haMgr program onto the directory which is callable from Transactd client application (set path).

  5. Check failover availability with haMgr -c health_check command.

    ./haMgr64 -c health_check -o 192.168.0.2 -s 192.168.0.3,192.168.0.4 -u root -p abcd
    2016-07-05T18:34:28 Starting health check...
      SLAVE_LIST=192.168.0.3,192.168.0.4
      192.168.0.2: Role = Master OK!
      192.168.0.2: HA lock OK!
      192.168.0.3: Role = Slave OK!
      192.168.0.3: Failover is disabled NG!
      192.168.0.3: HA lock OK!
      192.168.0.3: channel name=
      192.168.0.3: SQL thread running OK!
      192.168.0.3: IO thread running OK!
      192.168.0.3: SQL thread delay=0
      192.168.0.4: Role = Slave OK!
      192.168.0.4: Failover is disabled NG!
      192.168.0.4: HA lock OK!
      192.168.0.4: channel name=
      192.168.0.4: SQL thread running OK!
      192.168.0.4: IO thread running OK!
      192.168.0.4: SQL thread delay=0
          2 errors detected.
    2016-07-05T18:34:29 Done!

    There are two Failover is disabled NG! errors, but these are no problem at this time. We will set failover enablement later.

  6. Add THNR start function to Transactd client application.

    haNameResolver::start("master_host", "slave_host", "192.168.0.2,192.168.0.3,192.168.0.4", 1, "root", "abcd");

    See haNameResolver SDK document (Japanese) about detail of this function.

  7. Replace real hostname to virtual hostname in Transactd client application.

    /* Access to the master */
    db->open("tdap://root@master_host/db/test?pwd=abcd");
    /* Access to the slave */
    db->open("tdap://root@slave_host/db/test?pwd=abcd");
  8. After the application testing, enable failover with haMgr program.

    ./haMgr64 -c set_failover_enable -v 1 -o 192.168.0.2
    2016-07-06T10:21:04 Done!

transactd-startup_ha option means the startup server role.

It is possible that change the role without restarting server with haMgr program or Transactd API.

If restart the server after changing the master with switchover, the server start up with the role which is specified in this boot option, in spite of the master has been changed. To restore the role at last time, add 4 to transactd-startup_ha value.

However, the failed server has the master role information at last time. It need to specify transactd-startup_ha=0 explicitly as a slave on start after fix.

transactd-startup_ha value is saved in data_dir/transactd_srv_master.info. The role restoring is disabled if delete it.

Note of THA applications

Two database objects and transactions in the same thread

For example, we assume that there are two database object instances DB_A and DB_B in our program.

First, start transaction with DB_A, then an error occurs with DB_B during the transaction. In this case, if DB_A and DB_B are in same thread, the reconnection will not be tried. It need to wait for DB_A transaction to reconnect with DB_B, but the program can not continue to process because DB_A and DB_B are in same thread.

THA administration

Keep health

To keep high availability, check daily if failover will go well on fault. haMgr has health_check function to check failover will go well on fault. Keep health of THA with scheduling this test. c.f: Health check.

This health check function does not validate the account that specified by repl_user and username. Please check that the account has the privilege to CHANGE MASTER TO operation in the other way.

Check whether failover was done

CHANGE MASTER TO ... will be logged to MySQL error.log (or eventlog on Windows) when failover has done by THNR. Monitor this log to check failover has done.

Detailed log of failover process

More detailed log will be logged to client log. (If THNR call haMgr, the log will be redirected to errorlog. If you execute haMgr, the log will be showed in console.)

In particular, the log will be like: (c.f: error log)

2016-07-07T09:29:33 Starting fail over...
  SLAVE_LIST=192.168.0.2,192.168.0.3,192.168.0.4
  HP6730B:8611: promote to master 
  HP6730B:8611: channel name=
  HP6730B:8611: set role=MASTER 
  HP6730B:8612: channel name=
  HP6730B:8612: stop slave all 
  HP6730B:8612: change master to new master, pos=slave_pos
  HP6730B:8612: start slave 
2016-07-07T09:29:33 Done!

THA with both of Transactd Client and SQL

THA shows maximum performance with the application which uses only Transactd API. We recommend to modify your application so, if high availability is the most important probrem on your project.

THA does not disturb SQL access. It is friendly to Transactd API, and it is transparent to SQL access. If you need both of them, there are some combinations of THA elements. I will explain major choices.

Use THA mainly

Set up THA according to above instructions. Modify your application with SQL to use haNameResolver::master() as the master hostname, and use haNameResolver::slave() as the slave hostname. There is no problem in normal operations.

The problem will occur when the first access is from SQL after the master fault. If the first access is from Transactd API, THA will do fault detection, failover and resolve hostname automatically. But if the first access is from SQL, these recovery processes will not be done automatically, and errors will occur. These recovery processes will be done at next Transactd API access.

To improve recovery speed, modify the application to "Try with Transactd API access if an error occurred with SQL access".

If you do not need automatic failover, this way is the best choice. The project which is enough with manually switchover on fault use this way without any problems. In this case, disable failover:

./haMgr64 -c set_failover_enable -v 0 -o 192.168.0.2

"the first access after the master fault" is for each client process. All threads on one client process share THNR.

Use MySQL's HA

Traditional MySQL's HA is constructed like:

  1. Monitoring with heartbeat
  2. Failover with mysqlfaileover program
  3. Change hostname with VirtualIP
  4. Update router ARP cache

On this way, enable reconnection on client application to reconnect new server after failover.

/* In the first of the application */
nsdatabase::setEnableAutoReconnect(true);

There are some problems on this way:

Use MySQL's HA mainly, and use haMgr for failover

If you use haMgr as failover program, transaction with Transactd API will not fail at switchover, because haMgr wait for finish of it.

haMgr reference

haMgr is the program to control THA. It has these features:

Options

command line option:
  -c [ --command ]         command [switchover | failover | demote_to_slave | set_failover_enable | set_server_role | health_check]
  -o [ --cur_master ]      current master host name
  -n [ --new_master ]      new master host name
  -C [ --channel ]         new master channel name
  -P [ --repl_port ]       new master port
  -r [ --repl_user ]       new master repl user
  -d [ --repl_passwd ]     new master repl password
  -O [ --repl_option ]     option params for change master(ex:MASTER_CONNECT_RETRY=30)
  -s [ --slaves ]          slave list for failover
  -a [ --portmap ]         port map ex:3307:8611
  -v [ --value ]           value (For set_failover_enable or set_server_role)
  -u [ --username ]        transactd username
  -p [ --password ]        transactd password
  -R [ --readonly ]        0 | 1: When it is 1, the READONLY variable will be set ON to slaves and OFF to a master
  -D [ --disable_demote ]  0 | 1: disable old master demote

Switchover

Change the master.

e.g. Change the master from 192.168.0.2 to 192.168.0.3.

./haMgr64 -c switchover -o 192.168.0.2 -n 192.168.0.3 -R1 -r replication_user -d abcd -u root -p xxxx

The switchover changes master, and demote it to slave at the same time.

Failover

Do failover manually.

e.g. Failover with current two slaves 192.168.0.3 and 192.168.0.4.

./haMgr64 -c failover -s 192.168.0.3,192.168.0.4 -u root -p xxxx

Demote the master

e.g. Add the old master 192.168.0.2 as a slave, to current master 192.168.0.3.

./haMgr64 -c demote_to_slave -o 192.168.0.2 -n 192.168.0.3 -r replication_user -d abcd -u root -p xxxx

Change enable / disable of failover

Use 1 to enable failover. Use 0 to disable it.

e.g. Enable failover on current master 192.168.0.3.

./haMgr64 -c set_failover_enable -v 1 -o 192.168.0.3 -u root -p xxxx

Change server role

Use 1 to set as the master. Use 0 to set as slave.

e.g. Set 192.168.0.3 as master.

./haMgr64 -c set_server_role -o 192.168.0.3 -v 1 -u root -p xxxx

Health check

Health check reports replication status, server roles, lock status and failover enablement of the master and its slaves to stdout.

The program returns 0 without error. If some error was found, it returns 1.

e.g. Health check with the master 192.168.0.2 ant the two slaves 192.168.0.3 and 192.168.0.4.

./haMgr64 -c health_check -o 192.168.0.2 -s 192.168.0.3,192.168.0.4 -u root -p xxxx
2016-07-05T18:34:28 Starting health check...
  SLAVE_LIST=192.168.0.3,192.168.0.4
  192.168.0.2: Role = Master OK!
  192.168.0.2: HA lock OK!
  192.168.0.3: Role = Slave OK!
  192.168.0.3: Failover is enabled OK!
  192.168.0.3: HA lock OK!
  192enablement .168.0.3: channel name=
  192.168.0.3: SQL thread running OK!
  192.168.0.3: IO thread running OK!
  192.168.0.3: SQL thread delay=0
  192.168.0.4: Role = Slave OK!
  192.168.0.4: Failover is enabled OK!
  192.168.0.4: HA lock OK!
  192.168.0.4: channel name=
  192.168.0.4: SQL thread running OK!
  192.168.0.4: IO thread running OK!
  192.168.0.4: SQL thread delay=0
      No errors detected.
2016-07-05T18:34:29 Done!

Health ckeck shows if the system can do failover on fault. Doing health check regularly enhances the reliability of the system.

To do health check correctly, it is important that use same value with haNameResolver::start() to parameters.