Transactd High Availability (THA)
Overview
Since Transactd version 3.5, Transactd High Availability (THA) is available. THA redundant configuration with multiple servers make high availability system which has much shorter downtime than single server configuration. In addition, load balancing, such as doing read operations on the slave is available easily.
THA has following features:
- It has all of functions which are needed for failover. Fault detection, changing master, resolving hostnames, etc.
- Minimize downtime on fault. (Within a few seconds.)
- Automatic master failover.
- Alive monitoring for master server.
- Resolving hostnames between virtual hostname and real hostname.
- Without help from other products such as OS, external devices and DNS.
- Available with changing only a few row of application program.
- Switchover.
- Health checking.
- Available in wide environments. MySQL/MariaDB, Linux/Windows/Mac OSX, C++/PHP/Ruby/COM.
- It does not need libaries other than Transactd plugin and Transactd clients.
Index
- Structures of THA
- Configure THA
- Note of THA applications
- THA administration
- THA with both of Transactd Client and SQL
- haMgr reference
Structures of THA
THA is constructed with a master server and multiple slave servers which use MySQL GTID replication. You can balancing read operations to slave server optionally. You can also receive all access with master server and make the slave(s) spare for failover. One slave is required at least. Two or more slaves are more desirable.
All servers are required same version of MySQL (5.6 or later) / MariaDB (10.0 or later) because THA uses GTID replication. See MySQL / MariaDB documents to set up replication with GTID.
- Multi master replication is not supported now. Single master is available.
- We recommend the semi-sync replication to prevent data loss.
Mechanism
THA is constructed these elements:
- An master server and slave servers
- Hostname resolver (THNR)
- Server role
- Client reconnecting
- Alive monitoring
- Failover program (haMgr)
Hostname resolver (THNR)
THNR (Transactd HostName Resolver) is hostname resolver which is embedded in client library.
You can easily use it in your application.
The application accesses to database with virtual hostname instead of real hostname.
THNR is embedded in tdclc library. It converts hostname from virtual to real.
Master switching is available with a single product THNR, without some products such as OS, network device and DNS.
One instance of THNR is shared with all threads in one process.
Server role
Typically, the server role is master or slave. It is possible to detect which is the master in servers from configuration of replication. But THA requires to specify role on server variables explicitly, on purpose of supporting more complicated configuration in future.On failover, the server role will be changed correctly by failover program.
THNR detects change of server role by checking sameness between the role which was requested by client and the server's real role. This mechanism prevents troubles such as write to slave by mistake.
The role which was requested by client is identified by which virtual hostname is master or slave.
Client reconnecting
Client has the function to reconnect to server when network error has occurred.
If reconnection was succeeded, the current record or lock status will be recover,
then the application can continue their procedure as if nothing had happened.
But the client will not to try to reconnect to server if transaction has been started with
beginTrn() or beginSnapshot().
In this case, the application will receive an error.
It can Abort transaction and notice result to user.
Reconnection will be tried on the next operation.
Alive monitoring
THNR do the alive monitoring and calling failover program. If an network error caused by death of master server has been occurred, reconnection will be tried according to error status. At this time THNR will be notified that "this is reconnection".
THNR discards the cache and search correct host when it receives reconnection notify. If it can not connect to master in this process, it detects that the master server has died. Then it will call the failover program, switch master, and search new master.
This alive monitoring is implemented by handling client access errors. It has these advantages:
- Time-base polling is not required.
- There is no server load by polling.
- There is no time lag between the failure and detection of it.
- Any access error does not occur since failure until detection of it.
Failover program (haMgr)
The failover program haMgr is independent process program. (c.f. haMgr reference)
Failover will be done if haMgr is callable by THNR (haMgr is put on PATH).
If it is not callable, the application simply receives a network error.
Only the first client which starts failover process can do it. Clients which try later can not lock HA object on server, then fail to do it.
The clients which has failed to lock start to resolve hostnames by THNR immediately. In this time, they try to lock HA object during up to 60 seconds to check the servers are in failover process. The new hostname will be got by them if the failover process was finished and the lock was released.
Minimize downtime
Above mechanism enables automatic failover to new master and continue process without downtime. The time which takes to failover is depending on the number of slaves, but detecting takes several seconds and switching takes less than 1 second.
Any errors will not occur between detecting master server fault and starting failover program, because actual client monitor the alive of server directly. Only the transaction which has been started at the time of master's down will fail.
THA also supports the switchover which switch master intentionally. It waits for the transaction on current master to be finished. So any data will not be loss.
Configure THA
Now we explain how to set up THA. The examples assumes these servers:
| Server | IP address |
|---|---|
| Master | 192.168.0.2 |
| Slave 1 | 192.168.0.3 |
| Slave 2 | 192.168.0.4 |
-
Configure replication between a master and multiple slaves, with MySQL 5.6 or MariaDB 10.0 or later. c.f: details of MySQL/MariaDB GTID replication (Japanese).
In this time, set up same configuration on all servers. Such as username and password for Transactd, channel, username and password for replication.
-
Specify hostname or IP address like
report-host=192.168.0.xinmy.cnfon each servers. If you use the port other than default (8610), specify it likereport-host=192.168.0.x:8611.This hostname will be used by
startfunction orhealth_checkwhich are described later. -
Restart master server with boot option
–transactd-startup_ha=1. -
Put
haMgrprogram onto the directory which is callable from Transactd client application (set path). -
Check failover availability with
haMgr-c health_checkcommand../haMgr64 -c health_check -o 192.168.0.2 -s 192.168.0.3,192.168.0.4 -u root -p abcd 2016-07-05T18:34:28 Starting health check... SLAVE_LIST=192.168.0.3,192.168.0.4 192.168.0.2: Role = Master OK! 192.168.0.2: HA lock OK! 192.168.0.3: Role = Slave OK! 192.168.0.3: Failover is disabled NG! 192.168.0.3: HA lock OK! 192.168.0.3: channel name= 192.168.0.3: SQL thread running OK! 192.168.0.3: IO thread running OK! 192.168.0.3: SQL thread delay=0 192.168.0.4: Role = Slave OK! 192.168.0.4: Failover is disabled NG! 192.168.0.4: HA lock OK! 192.168.0.4: channel name= 192.168.0.4: SQL thread running OK! 192.168.0.4: IO thread running OK! 192.168.0.4: SQL thread delay=0 2 errors detected. 2016-07-05T18:34:29 Done!There are two
Failover is disabled NG!errors, but these are no problem at this time. We will set failover enablement later. -
Add THNR
startfunction to Transactd client application.haNameResolver::start("master_host", "slave_host", "192.168.0.2,192.168.0.3,192.168.0.4", 1, "root", "abcd");See haNameResolver SDK document (Japanese) about detail of this function.
-
Replace real hostname to virtual hostname in Transactd client application.
/* Access to the master */ db->open("tdap://root@master_host/db/test?pwd=abcd"); /* Access to the slave */ db->open("tdap://root@slave_host/db/test?pwd=abcd"); -
After the application testing, enable failover with
haMgrprogram../haMgr64 -c set_failover_enable -v 1 -o 192.168.0.2 2016-07-06T10:21:04 Done!
transactd-startup_ha option means the startup server role.
transactd-startup_ha=0or no description inmy.cnf: the server start as a slave.transactd-startup_ha=1: the server start as a master.
It is possible that change the role without restarting server with haMgr program or Transactd API.
If restart the server after changing the master with switchover,
the server start up with the role which is specified in this boot option, in spite of the master has been changed.
To restore the role at last time, add 4 to transactd-startup_ha value.
transactd-startup_ha=4: Restore the role if there is role information at last time. Start as a slave if there is not.transactd-startup_ha=5: Restore the role if there is role information at last time. Start as a master if there is not.
However, the failed server has the master role information at last time.
It need to specify transactd-startup_ha=0 explicitly as a slave on start after fix.
transactd-startup_ha value is saved in data_dir/transactd_srv_master.info.
The role restoring is disabled if delete it.
Note of THA applications
Two database objects and transactions in the same thread
For example, we assume that there are two database object instances DB_A and DB_B in our program.
First, start transaction with DB_A, then an error occurs with DB_B during the transaction.
In this case, if DB_A and DB_B are in same thread, the reconnection will not be tried.
It need to wait for DB_A transaction to reconnect with DB_B,
but the program can not continue to process because DB_A and DB_B are in same thread.
THA administration
Keep health
To keep high availability, check daily if failover will go well on fault.
haMgr has health_check function to check failover will go well on fault.
Keep health of THA with scheduling this test. c.f: Health check.
This health check function does not validate the account that specified by repl_user and username.
Please check that the account has the privilege to CHANGE MASTER TO operation in the other way.
Check whether failover was done
CHANGE MASTER TO ... will be logged to MySQL error.log (or eventlog on Windows) when failover has done by THNR. Monitor this log to check failover has done.
Detailed log of failover process
More detailed log will be logged to client log. (If THNR call haMgr, the log will be redirected to errorlog.
If you execute haMgr, the log will be showed in console.)
In particular, the log will be like: (c.f: error log)
2016-07-07T09:29:33 Starting fail over...
SLAVE_LIST=192.168.0.2,192.168.0.3,192.168.0.4
HP6730B:8611: promote to master
HP6730B:8611: channel name=
HP6730B:8611: set role=MASTER
HP6730B:8612: channel name=
HP6730B:8612: stop slave all
HP6730B:8612: change master to new master, pos=slave_pos
HP6730B:8612: start slave
2016-07-07T09:29:33 Done!
THA with both of Transactd Client and SQL
THA shows maximum performance with the application which uses only Transactd API. We recommend to modify your application so, if high availability is the most important probrem on your project.
THA does not disturb SQL access. It is friendly to Transactd API, and it is transparent to SQL access. If you need both of them, there are some combinations of THA elements. I will explain major choices.
Use THA mainly
Set up THA according to above instructions. Modify your application with SQL to use haNameResolver::master() as the master hostname, and use haNameResolver::slave() as the slave hostname. There is no problem in normal operations.
The problem will occur when the first access is from SQL after the master fault. If the first access is from Transactd API, THA will do fault detection, failover and resolve hostname automatically. But if the first access is from SQL, these recovery processes will not be done automatically, and errors will occur. These recovery processes will be done at next Transactd API access.
To improve recovery speed, modify the application to "Try with Transactd API access if an error occurred with SQL access".
If you do not need automatic failover, this way is the best choice. The project which is enough with manually switchover on fault use this way without any problems. In this case, disable failover:
./haMgr64 -c set_failover_enable -v 0 -o 192.168.0.2
"the first access after the master fault" is for each client process. All threads on one client process share THNR.
Use MySQL's HA
Traditional MySQL's HA is constructed like:
- Monitoring with
heartbeat - Failover with
mysqlfaileoverprogram - Change hostname with
VirtualIP - Update router ARP cache
On this way, enable reconnection on client application to reconnect new server after failover.
/* In the first of the application */
nsdatabase::setEnableAutoReconnect(true);
There are some problems on this way:
- Accesses will fail since fault until finish failover.
- Transactions with Transactd API will fail on switchover because failover program other than
haMgrdoes not wait for it. - Updating IP address or ARP depend OS or other devices.
Use MySQL's HA mainly, and use haMgr for failover
If you use haMgr as failover program, transaction with Transactd API will not fail at switchover,
because haMgr wait for finish of it.
haMgr reference
haMgr is the program to control THA. It has these features:
- Switchover
- Failover manually
- Demote master to slave
- Change failover enable/disable
- Change server role
- Health check
Options
command line option:
-c [ --command ] command [switchover | failover | demote_to_slave | set_failover_enable | set_server_role | health_check]
-o [ --cur_master ] current master host name
-n [ --new_master ] new master host name
-C [ --channel ] new master channel name
-P [ --repl_port ] new master port
-r [ --repl_user ] new master repl user
-d [ --repl_passwd ] new master repl password
-O [ --repl_option ] option params for change master(ex:MASTER_CONNECT_RETRY=30)
-s [ --slaves ] slave list for failover
-a [ --portmap ] port map ex:3307:8611
-v [ --value ] value (For set_failover_enable or set_server_role)
-u [ --username ] transactd username
-p [ --password ] transactd password
-R [ --readonly ] 0 | 1: When it is 1, the READONLY variable will be set ON to slaves and OFF to a master
-D [ --disable_demote ] 0 | 1: disable old master demote
-commandSpecify command.-cur_masterSpecify current master.-new_masterSpecify the server which will be new master by switchover.-channelSpecify channel name used by connecting to new master on switchover. Default value is"". It is available on MySQL 5.7 or MariaDB 10.0 or later.-repl_portSpecify replication port number used byCHANGE MASTER TOcommand.-repl_userSpecify username used by replication.-repl_passwdSpecify password forrepl_user.-repl_optionSpecify additional option(s) forCHANGE MASTER TOcommand. e.g.MASTER_CONNECT_RETRY=30. Separate options with commas.-slavesSpecify comma separated slave server names for failover.-portmapSpecify Transactd port and MySQL port if you change them from default. e.g.-a 3307:8611.-valueSpecify value used byset_failover_enableandset_server_rolecommand.-usernameSpecify username used by Transactd access.-passwordSpecify password forusername.-readonlySpecify0or1. Use1to set server variableREADONLYon switchover or failover. The master will be set OFF. The slaves will be set ON.READONLYaffects only to SQL access. Transactd API access is not affected by it.-disable_demoteSpecify0or1. Use1to detach old master on switchover (then it will not be demoted to slave).
Switchover
Change the master.
e.g. Change the master from 192.168.0.2 to 192.168.0.3.
./haMgr64 -c switchover -o 192.168.0.2 -n 192.168.0.3 -R1 -r replication_user -d abcd -u root -p xxxx
The switchover changes master, and demote it to slave at the same time.
- If
–disable_demote 1is specified, the old master will not be demoted. It will be detached. - If
–readonly 1is specified, the old master will be setREADONLY. SQL access will be limited read only.
Failover
Do failover manually.
e.g. Failover with current two slaves 192.168.0.3 and 192.168.0.4.
./haMgr64 -c failover -s 192.168.0.3,192.168.0.4 -u root -p xxxx
Demote the master
e.g. Add the old master 192.168.0.2 as a slave, to current master 192.168.0.3.
./haMgr64 -c demote_to_slave -o 192.168.0.2 -n 192.168.0.3 -r replication_user -d abcd -u root -p xxxx
Change enable / disable of failover
Use 1 to enable failover. Use 0 to disable it.
e.g. Enable failover on current master 192.168.0.3.
./haMgr64 -c set_failover_enable -v 1 -o 192.168.0.3 -u root -p xxxx
Change server role
Use 1 to set as the master. Use 0 to set as slave.
e.g. Set 192.168.0.3 as master.
./haMgr64 -c set_server_role -o 192.168.0.3 -v 1 -u root -p xxxx
Health check
Health check reports replication status, server roles, lock status and failover enablement of the master and its slaves to stdout.
The program returns 0 without error. If some error was found, it returns 1.
e.g. Health check with the master 192.168.0.2 ant the two slaves 192.168.0.3 and 192.168.0.4.
./haMgr64 -c health_check -o 192.168.0.2 -s 192.168.0.3,192.168.0.4 -u root -p xxxx
2016-07-05T18:34:28 Starting health check...
SLAVE_LIST=192.168.0.3,192.168.0.4
192.168.0.2: Role = Master OK!
192.168.0.2: HA lock OK!
192.168.0.3: Role = Slave OK!
192.168.0.3: Failover is enabled OK!
192.168.0.3: HA lock OK!
192enablement .168.0.3: channel name=
192.168.0.3: SQL thread running OK!
192.168.0.3: IO thread running OK!
192.168.0.3: SQL thread delay=0
192.168.0.4: Role = Slave OK!
192.168.0.4: Failover is enabled OK!
192.168.0.4: HA lock OK!
192.168.0.4: channel name=
192.168.0.4: SQL thread running OK!
192.168.0.4: IO thread running OK!
192.168.0.4: SQL thread delay=0
No errors detected.
2016-07-05T18:34:29 Done!
Health ckeck shows if the system can do failover on fault. Doing health check regularly enhances the reliability of the system.
To do health check correctly, it is important that use same value with haNameResolver::start() to parameters.